Posts
Comments
I'd be surprised if this were the case. next neurips I can survey some non native English speakers to see how many ML terms they know in English vs in their native language. I'm confident in my ability to administer this experiment on Chinese, French, and German speakers, which won't be an unbiased sample of non-native speakers, but hopefully still provides some signal.
only 2 people walked away without answering (after saying yes initially); they were not counted as yes or no. another several people refused to even answer, but this was also quite rare. the no responders seemed genuinely confused, as opposed to dismissive.
feel free to replicate this experiment at ICML or ICLR or next neurips.
not sure, i didn't keep track of this info. an important data point is that because essentially all ML literature is in english, non-anglophones generally either use english for all technical things, or at least codeswitch english terms into their native language. for example, i'd bet almost all chinese ML researchers would be familiar with the term CNN and it would be comparatively rare for people to say 卷积神经网络. (some more common terms like 神经网络 or 模型 are used instead of their english counterparts - neural network / model - but i'd be shocked if people didn't know the english translations)
overall i'd be extremely surprised if there were a lot of people who knew conceptually the idea of AGI but didn't know that it was called AGI in english
the specific thing i said to people was something like:
excuse me, can i ask you a question to help settle a bet? do you know what AGI stands for? [if they say yes] what does it stand for? [...] cool thanks for your time
i was careful not to say "what does AGI mean".
most people who didn't know just said "no" and didn't try to guess. a few said something like "artificial generative intelligence". one said "amazon general intelligence" (??). the people who answered incorrectly were obviously guessing / didn't seem very confident in the answer.
if they seemed confused by the question, i would often repeat and say something like "the acronym AGI" or something.
several people said yes but then started walking away the moment i asked what it stood for. this was kind of confusing and i didn't count those people.
I decided to conduct an experiment at neurips this year: I randomly surveyed people walking around in the conference hall to ask whether they had heard of AGI
I found that out of 38 respondents, only 24 could tell me what AGI stands for (63%)
we live in a bubble
I'm very excited about approaches to add hierarchy to SAEs - seems like an important step forward. In general, approaches that constraint latents in various ways that let us have higher L0 without reconstruction becoming trivial seem exciting.
I think it would be cool to get follow up work on bigger LMs. It should also be possible to do matryoshka with block size = 1 efficiently with some kernel tricks, which would be cool.
I won't claim to be immune to peer pressure but at least on the epistemic front I think I have a pretty legible track record of believing things that are not very popular in the environments I've been in.
a medium with less limitations is strictly better for making good art, but it's also harder to identify good art among the sea of bad art because the medium alone is no longer as good a signal of quality
to be clear, a "winter/slowdown" in my typology is more about the vibes and could only be a few years counterfactual slowdown. like the dot-com crash didn't take that long for companies like Amazon or Google to recover from, but it was still a huge vibe shift
also to further clarify this is not an update I've made recently, I'm just making this post now as a regular reminder of my beliefs because it seems good to have had records of this kind of thing (though everyone who has heard me ramble about this irl can confirm I've believed sometime like this for a while now)
people often say that limitations of an artistic medium breed creativity. part of this could be the fact that when it is costly to do things, the only things done will be higher effort
also a lot of people will suggest that alignment people are discredited because they all believed AGI was 3 years away, because surely that's the only possible thing an alignment person could have believed. I plan on pointing to this and other statements similar in vibe that I've made over the past year or two as direct counter evidence against that
(I do think a lot of people will rightly lose credibility for having very short timelines, but I think this includes a big mix of capabilities and alignment people, and I think they will probably lose more credibility than is justified because the rest of the world will overupdate on the winter)
it's quite plausible (40% if I had to make up a number, but I stress this is completely made up) that someday there will be an AI winter or other slowdown, and the general vibe will snap from "AGI in 3 years" to "AGI in 50 years". when this happens it will become deeply unfashionable to continue believing that AGI is probably happening soonish (10-15 years), in the same way that suggesting that there might be a winter/slowdown is unfashionable today. however, I believe in these timelines roughly because I expect the road to AGI to involve both fast periods and slow bumpy periods. so unless there is some super surprising new evidence, I will probably only update moderately on timelines if/when this winter happens
a take I've expressed a bunch irl but haven't written up yet: feature sparsity might be fundamentally the wrong thing for disentangling superposition; circuit sparsity might be more correct to optimize for. in particular, circuit sparsity doesn't have problems with feature splitting/absorption
the most valuable part of a social event is often not the part that is ostensibly the most important, but rather the gaps between the main parts.
- at ML conferences, the headline keynotes and orals are usually the least useful part to go to; the random spontaneous hallway chats and dinners and afterparties are extremely valuable
- when doing an activity with friends, the activity itself is often of secondary importance. talking on the way to the activity, or in the gaps between doing the activity, carry a lot of the value
- at work, a lot of the best conversations happen outside of scheduled 1:1s and group meetings, but rather happen in spontaneous hallway or dinner groups
related: https://xkcd.com/217/
People generally assume those around them agree with them (even when they don't see loud support of their position - see "silent majority"). So when you ask what their neighbors think, they will guess their neighbors have the same views as themselves, and will report their own beliefs with plausible deniability.
I'm claiming that even if you go all the way to BoN, it still doesn't necessarily leak less info to the morel
for a sufficiently competent policy, the fact that BoN doesn't update the policy doesn't mean it leaks any fewer bits of info to the policy than normal RL
For ML, yes. I'm deriving this from the bitter lesson.
I think there are a whole bunch of inputs that determine a company's success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I'd guess it's exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.
For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there's a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
Separately, it's also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can't speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What's an example of an interpretability work that you feel has affected capabilities intuitions a lot?
SAE steering doesn't seem like it obviously beats other steering techniques in terms of usefulness. I haven't looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.
Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it's vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it's not surprising that most directions don't get a lot of attention.
Probably as interp gets better it will start to be helpful for capabilities. I'm uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.
I don't think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers - they are basically noise.)
I think you should take into account the fact that before there are really good concrete capabilities results, the process that different labs use to decide what to invest in is highly contingent on a bunch of high variance things. Like, what kinds of research directions appeal to research leadership, or whether there happen to be good ICs excited to work on that direction around and not tied down to any other project.
I don't think you should be that surprised by interpretability being more popular than other areas of alignment. Certainly I think incentives towards capabilities is a small fraction of why it's popular and funded etc (if anything, its non-usefulness for capabilities to date may count against it). Rather, I think it's popular because it's an area where you can actually get traction and do well-scoped projects and have a tight feedback loop. This is not true of the majority of alignment research directions that actually could help with aligning AGI/ASI, and correspondingly those directions are utterly soul grinding to work on.
One could very reasonably argue that more people should be figuring out how to work on the low traction, ill-scoped, shitty feedback loop research problems, and that the field is looking under the streetlight for the keys. I make this argument a lot. But I think you shouldn't need to postulate some kind of nefarious capabilities incentive influence to explain it.
aiming directly for achieving some goal is not always the most effective way of achieving that goal.
there is an obvious utilitarian reason of not getting sick
also, from the perspective of an organization, it is highly effective to create lots of opportunities for people to find unique niches on the Pareto frontier of ability, as opposed to linearizing competence
in a highly competitive domain, it is often better and easier to be sui generis, rather than a top 10 percentile member of a large reference class
for people who are not very good at navigating social conventions, it is often easier to learn to be visibly weird than to learn to adapt to the social conventions.
this often works because there are some spaces where being visibly weird is tolerated, or even celebrated. in fact, from the perspective of an organization, it is good for your success if you are good at protecting weird people.
but from the perspective of an individual, leaning too hard into weirdness is possibly harmful. part of leaning into weirdness is intentional ignorance of normal conventions. this traps you in a local minimum where any progress on understanding normal conventions hurts your weirdness, but isn't enough to jump all the way to the basin of the normal mode of interaction.
(epistemic status: low confidence, just a hypothesis)
often the easiest way to gain status within some system is to achieve things outside that system
one kind of reasoning in humans is a kind of instant intuition; you see something and something immediately and effortlessly pops into your mind. examples include recalling vocabulary in a language you're fluent in, playing a musical instrument proficiently, or having a first guess at what might be going wrong when debugging.
another kind of reasoning is the chain of thought, or explicit reasoning: you lay out your reasoning steps as words in your head, interspersed perhaps with visuals, or abstract concepts that you would have a hard time putting in words. It feels like you're consciously picking each step of the reasoning. Working through a hard math problem, or explicitly designing a codebase by listing the constraints and trying to satisfy them, are examples of this.
so far these map onto what people call system 1 and 2, but I've intentionally avoided these labels because I think there's actually a third kind of reasoning that doesn't fit well into either of these buckets.
sometimes, I need to put the relevant info into my head, and then just let it percolate slowly without consciously thinking about it. at some later time, insights into the problem will suddenly and unpredictably pop into my head. I've found this mode of reasoning to be indispensible for dealing with the hardest problems, or for generating insights, where if I just did explicit reasoning I'd just get stuck.
of course, you can't just sit around and do nothing and hope insights come to you - to make this process work you have to absorb lots of info, and also do a lot of explicit reasoning before and after to take flashes of insight and turn them into actual fleshed-out knowledge. and there are conditions that are more or less conducive to this kind of reasoning.
I'm still figuring out how to best leverage it, but I think one hypothesis this raises is the possibility that a necessary ingredient in solving really hard problems is spending a bunch of time simply not doing any explicit reasoning, and creating whatever conditions are needed for subconscious insight-generating reasoning.
I think there are several reasons this division of labor is very minimal, at least in some places.
- You need way more of the ML engineering / fixing stuff skill than ML research. Like, vastly more. There are still a very small handful of people who specialize full time in thinking about research, but they are very few and often very senior. This is partly an artifact of modern ML putting way more emphasis on scale than academia.
- Communicating things between people is hard. It's actually really hard to convey all the context needed to do a task. If someone is good enough to just be told what to do without too much hassle, they're likely good enough to mostly figure out what to work on themselves.
- Convincing people to be excited about your idea is even harder. Everyone has their own pet idea, and you are the first engineer on any idea you have. If you're not a good engineer, you have a bit of a catch-22: you need promising results to get good engineers excited, but you need engineers to get results. I've heard of even very senior researchers finding it hard to get people to work on their ideas, so they just do it themselves.
a corollary is i think even once AI can automate the "google for the error and whack it until it works" loop, this is probably still quite far off from being able to fully automate frontier ML research, though it certainly will make research more pleasant
it's surprising just how much of cutting edge research (at least in ML) is dealing with really annoying and stupid bottlenecks. pesky details that seem like they shouldn't need attention. tools that in a good and just world would simply not break all the time.
i used to assume this was merely because i was inexperienced, and that surely eventually you learn to fix all the stupid problems, and then afterwards you can just spend all your time doing actual real research without constantly needing to context switch to fix stupid things.
however, i've started to think that as long as you're pushing yourself to do novel, cutting edge research (as opposed to carving out a niche and churning out formulaic papers), you will always spend most of your time fixing random stupid things. as you get more experienced, you get bigger things done faster, but the amount of stupidity is conserved. as they say in running- it doesn't get easier, you just get faster.
as a beginner, you might spend a large part of your research time trying to install CUDA or fighting with python threading. as an experienced researcher, you might spend that time instead diving deep into some complicated distributed training code to fix a deadlock or debugging where some numerical issue is causing a NaN halfway through training.
i think this is important to recognize because you're much more likely to resolve these issues if you approach them with the right mindset. when you think of something as a core part of your job, you're more likely to engage your problem solving skills fully to try and find a resolution. on the other hand, if something feels like a brief intrusion into your job, you're more likely to just hit it with a wrench until the problem goes away so you can actually focus on your job.
in ML research the hit it with a wrench strategy is the classic "google the error message and then run whatever command comes up" loop. to be clear, this is not a bad strategy when deployed properly - this is often the best first thing to try when something breaks, because you don't have to do a big context switch and lose focus on whatever you were doing before. but it's easy to end up trapped in this loop for too long. at some point you should switch modes to actively understanding and debugging the code, which is easier to do if you think of your job as mostly being about actively understanding and debugging code.
earlier in my research career i would feel terrible about having spent so much time doing things that were not the "actual" research, which would make me even more likely to just hit things with a wrench, which actually did make me less effective overall. i think shifting my mindset since then has helped me a lot
in research, if you settle into a particular niche you can churn out papers much faster, because you can develop a very streamlined process for that particular kind of paper. you have the advantage of already working baseline code, context on the field, and a knowledge of the easiest way to get enough results to have an acceptable paper.
while these efficiency benefits of staying in a certain niche are certainly real, I think a lot of people end up in this position because of academic incentives - if your career depends on publishing lots of papers, then a recipe to get lots of easy papers with low risk is great. it's also great for the careers of your students, because if you hand down your streamlined process, then they can get a phd faster and more reliably.
however, I claim that this also reduces scientific value, and especially the probability of a really big breakthrough. big scientific advances require people to do risky bets that might not work out, and often the work doesn't look quite like anything anyone has done before.
as you get closer to the frontier of things that have ever been done, the road gets tougher and tougher. you end up spending more time building basic infrastructure. you explore lots of dead ends and spend lots of time pivoting to new directions that seem more promising. you genuinely don't know when you'll have the result that you'll build your paper on top of.
so for people who are not beholden as strongly to academic incentives, it might make sense to think carefully about the tradeoff between efficiency and exploration.
(not sure I 100% endorse this, but it is a hypothesis worth considering)
people love to find patterns in things. sometimes this manifests as mysticism- trying to find patterns where they don't exist, insisting that things are not coincidences when they totally just are. i think a weaker version of this kind of thinking shows up a lot in e.g literature too- events occur not because of the bubbling randomness of reality, but rather carry symbolic significance for the plot. things don't just randomly happen without deeper meaning.
some people are much more likely to think in this way than others. rationalists are very far along the spectrum in the "things just kinda happen randomly a lot, they don't have to be meaningful" direction.
there are some obvious cognitive bias explanations for why people would see meaning/patterns in things. most notably, it's comforting to feel like we understand things. the idea of the world being deeply random and things just happening for no good reason is scary.
but i claim that there is something else going on here. I think an inclination towards finding latent meaning is actually quite applicable when thinking about people. people's actions are often driven by unconscious drives to be quite strongly correlated with those drives. in fact, unconscious thoughts are often the true drivers, and the conscious thoughts are just the rationalization. but from the inside, it doesn't feel that way; from the inside it feels like having free will, and everything that is not a result of conscious thought is random or coincidental. this is a property that is not nearly as true of technical pursuits, so it's very reasonable to expect a different kind of reasoning to be ideal.
not only is this useful for modelling other people, but it's even more useful for modelling yourself. things only come to your attention if your unconscious brain decides to bring them to your attention. so even though something happening to you may be a coincidence, whether you focus on it or forget about it tells you a lot about what your unconscious brain is thinking. from the inside, this feels like things that should obviously be coincidence nonetheless having some meaning behind them. even the noticing of a hypothesis for the coincidence is itself a signal from your unconscious brain.
I don't quite know what the right balance is. on the one hand, it's easy to become completely untethered from reality by taking this kind of thing too seriously and becoming superstitious. on the other hand, this also seems like an important way of thinking about the world that is easy for people like me (and probably lots of people on LW) to underappreciate.
in some way, bureaucracy design is the exact opposite of machine learning. while the goal of machine learning is to make clusters of computers that can think like humans, the goal of bureaucracy design is to make clusters of humans that can think like a computer
The o1 public documentation neither confirms nor denies whether process based supervision was used.
It seems pretty reasonable that if an ordinary person couldn't have found the information about making a bioweapon online because they don't understand the jargon or something, and the model helps them understand the jargon, then we can't blanket-reject the possibility that the model materially contributed to causing the critical harm. Rather, we then have to ask whether the harm would have happened even if the model didn't exist. So for example, if it's very easy to hire a human expert without moral scruples for a non-prohibitive cost, then it probably would not be a material contribution from the model to translate the bioweapon jargon.
Basically agree - I'm generally a strong supporter of looking at the loss drop in terms of effective compute. Loss recovered using a zero-ablation baseline is really quite wonky and gives misleadingly big numbers.
I also agree that reconstruction is not the only axis of SAE quality we care about. I propose explainability as the other axis - whether we can make necessary and sufficient explanations for when individual latents activate. Progress then looks like pushing this Pareto frontier.
Extremely valid, you've convinced me that atom is probably a bad term for this
I like the word "atom" to refer to units inside an SAE
Keep in mind that if, hypothetically, there were major compute efficiency tricks to be had, they would likely not be shared publicly. So the absence of publicly known techniques is not strong evidence in either direction.
Also, in general I start from a prior of being skeptical of papers claiming their models are comparable/better than GPT-4. It's very easy to mislead with statistics - for example, human preference comparisons depend very heavily on the task distribution, and how discerning the raters are. I have not specifically looked deeply into Llama 405B though.
This is likely not the first instance, but OpenAI was already using the word "aligned" in this way in 2021 in the Codex paper.
https://arxiv.org/abs/2107.03374 (section 7.2)
investment in anything speculative, including alignment, and AGI research, is likely to decrease if the economy is not doing great
for a sense of scale of just how bubbly things can get: Bitcoin has a market cap of ~1T, and the entirety of crypto ~2T. Crypto does produce some amount of real value, but probably on the order of magnitude of 1% that market cap. So it's not at all unheard of for speculation to account for literally trillions of dollars of map (or ~tens of billions of earnings per year, at a reasonable P/E ratio)
economic recession and subsequent reduction in speculative research, including towards AGI, seems very plausible
AI (by which I mean, like, big neural networks and whatever) is not that economically useful right now. furthermore, current usage figures are likely an overestimate of true economic usefulness because a very large fraction of it is likely to be bubbly spending that will itself dry up if there is a recession (legacy companies putting LLMs into things to be cool, startups that are burning money without PMF, consumers with disposable income to spend on entertainment).
it will probably still be profitable to develop AI tech, but things will be much more tethered to consumer usefulness.
this probably doesn't set AGI back that much but I think people are heavily underrating this as a possibility. it also probably heavily impacts the amount of alignment work done at labs.
even if scaling does eventually solve the reliability problem, it means that very plausibly people are overestimating how far along capabilities are, and how fast the rate of progress is, because the most impressive thing that can be done with 90% reliability plausibly advances faster than the most impressive thing that can be done with 99.9% reliability
I think even if failures are automatically detectable, it's quite annoying. the cost is very logarithmic: there's a very large cliff in effort when going from zero manual intervention required to any manual intervention required whatsoever; and as the amount of manual intervention continues to increase, you can invest in infrastructure to make it less painful, and then to delegate the work out to other people.
reliability is surprisingly important. if I have a software tool that is 90% reliable, it's actually not that useful for automation, because I will spend way too much time manually fixing problems. this is especially a problem if I'm chaining multiple tools together in a script. I've been bit really hard by this because 90% feels pretty good if you run it a handful of times by hand, but then once you add it to your automated sweep or whatever it breaks and then you have to go in and manually fix things. and getting to 99% or 99.9% is really hard because things break in all sorts of weird ways.
I think this has lessons for AI - lack of reliability is one big reason I fail to get very much value out of AI tools. if my chatbot catastrophically hallucinates once every 10 queries, then I basically have to look up everything anyways to check. I think this is a major reason why cool demos often don't mean things that are practically useful - 90% reliable it's great for a demo (and also you can pick tasks that your AI is more reliable at, rather than tasks which are actually useful in practice). this is an informing factor for why my timelines are longer than some other people's