Posts
Comments
I find that surprising, do you care to elaborate? I don't think his worldview is complete, but he cares deeply about a lot of things I value too, which modern society seems not to value. I would certainly be glad to have him in my moral parliament.
Feels connected to his distrust of "quick, bright, standardized, mental processes", and the obsession with language. It's like his mind is relentlessly orienting to the territory, refusing to accept anyone else's map. Which makes it harder to be a student but easier to discover something new. Reminds me of Geoff Hinton's advice to not read the literature before engaging with the problem yourself.
I like this a lot! A few scattered thoughts
- This theory predicts and explains "therapy-resistant dissociation", or the common finding that none of the "woo" exercises like focusing, meditation, etc, actually work. (c.f. Scott's experience as described in https://www.astralcodexten.com/p/are-woo-non-responders-defective). If there's an active strategy of self-deception, you'd expect people to react negatively (or learn to not react via yet deeper levels of self-deception) to straightforward attempts to understand and untangle one's psychology.
- It matches and extends Robert Trivers' theory of self-deception, wherein he predicts that when your mind is the site of a conflict between two sub-parts, the winning one will always be subconscious, because the conscious mind is visible to the subconscious but not vice versa, and being visible makes you weak. Thus, counterintuitively, the mind we are conscious of - in your phrase the false self - is always the losing part.
- It connects to a common question I have for people doing meditation seriously - why exactly do you want to make the subconscious conscious? Why is it such a good thing to "become more conscious"? Now I can make the question more precise - why do you think it's safe to have more access to your thoughts and feelings than your subconscious gave you? And how exactly do you plan to deal with all the hostile telepaths out there (possibly including parts of yourself?). I expect most people find themselves dealing with (partly) hostile telepaths all the time, and so Occlumency is genuinely necessary unless one lives in an extraordinarily controlled environment such as a monastery.
- Social deception games like Avalon or Diplomacy provide a fertile ground for self- and group experimentation with the ideas in this essay.
I know this isn't the central point of your life reviews section but curious if your model has any lower bound on life review timing - if not minutes to hours, at least seconds? milliseconds? (1 ms being a rough lower bound on the time for a signal to travel between two adjacent neurons).
If it's at least milliseconds it opens the strange metaphysical possibility of certain deaths (e.g. from very intense explosions) being exempt from life reviews.
Really appreciated this exchange, Ben & Alex have rare conversational chemistry and ability to sense-make productively at the edge of their world models.
I mostly agree with Alex on the importance of interfacing with extant institutional religion, though less sure that one should side with pluralists over exclusivists. For example, exclusivist religious groups seem to be the only human groups currently able to reproduce themselves, probably because exclusivism confers protection against harmful memes and cultural practices.
I'm also pursuing the vision of a decentralized singleton as alternative to Moloch or turnkey totalitarianism, although it's not obvious to me how the psychological insights of religious contemplatives are crucial here, rather than skilled deployment of social technology like the common law, nation states, mechanism design, cryptography, recommender systems, LLM-powered coordination tools, etc. Is there evidence that "enlightened" people, for some sense of "enlightened" are in fact better at cooperating with each other at scale?
If we do achieve existential security through building a stable decentralized singleton, it seems much more likely that it would be the result of powerful new social tech, rather than the result of intervention on individual psychology. I suppose it could be the result of both with one enabling the other, like the printing press enabling the Reformation.
definitely agree there's some power-seeking equivocation going on, but wanted to offer a less sinister explanation from my experiences in AI research contexts. Seems that a lot of equivocation and blurring of boundaries comes from people trying to work on concrete problems and obtain empirical information. a thought process like
- alignment seems maybe important?
- ok what experiment can I set up that lets me test some hypotheses
- can't really test the long-term harms directly, let me test an analogue in a toy environment or on a small model, publish results
- when talking about the experiments, I'll often motivate them by talking about long-term harm
Not too different from how research psychologists will start out trying to understand the Nature of Mind and then run a n=20 study on undergrads because that's what they had budget for. We can argue about how bad this equivocation is for academic research, but it's a pretty universal pattern and well-understood within academic communities.
The unusual thing in AI is that researchers have most of the decision-making power in key organizations, so these research norms leak out into the business world, and no-one bats an eye at a "long-term safety research" team that mostly works on toy and short term problems.
This is one reason I'm more excited about building up "AI security" as a field and hiring infosec people instead of ML PhDs. My sense is that the infosec community actually has good norms for thinking about and working on things-shaped-like-existential-risks, and the AI x-risk community should inherit those norms, not the norms of academic AI research.
by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn't be a 'warning shot', it'd just be a 'shot' or 'disaster'.
Yours is the more direct definition but from context I at least understood 'warning shot' to mean 'disaster', on the scale of a successful terrorist attack, where the harm is large and undeniable and politicians feel compelled to Do Something Now. The 'warning' is not of harm but of existential harm if the warning is not heeded.
I do still expect such a warning shot, though as you say it could very well be ignored even if there are large undeniable harms (e.g. if a hacker group deploys a rogue AI that causes a trillion dollars of damage, we might take that as warning about terrorism or cybersecurity not about AI).
Agreed that coalitional agency is somehow more natural than squiggly-optimizer agency. Besides people, another class of examples are historical empires (like the Persian and then Roman) which were famously lenient [1] and respectful of local religious and cultural traditions; i.e. optimized coalition builders that offered goal-stability guarantees to their subagent communities, often stronger guarantees than those communities could expect by staying independent.
This extends my argument in Cooperators are more powerful than agents - in a world of hierarchical agency, evolution selects not for world-optimization / power-seeking but for cooperation, which looks like coalition-building (negotiation?) at the higher levels of organization and coalition-joining (domestication?) at the lower levels.
I don't see why this tendency should break down at higher levels of intelligence, if anything it should get stronger as power-seeking patterns are detected early and destroyed by well-coordinated defensive coalitions. There's still no guarantee that coalitional superintelligence will respect "human values" any more than we respect the values of ants; but contra Yudkowsky-Bostrom-Omohundro doom is not the default outcome.
- ^
if you surrendered!
Correct, I was not offered such paperwork nor any incentives to sign it. Edited my post to include this.
I left Anthropic in June 2023 and am not under any such agreement.
EDIT: nor was any such agreement or incentive offered to me.
- Agree trust and cooperation is dual use, and I'm not sure how to think about this yet; perhaps the most important form of coordination is the one that prevents (directly or via substitution) harmful forms of coordination from arising.
- One reason I wouldn't call lack of altruism the root is that it's not clear how to intervene on it, it's like calling the laws of physics the root of all evil. I prefer to think about "how to reduce transaction costs to self-interested collaboration". I'm also less sure that a society of people more altruistic motives will necessarily do better... the nice thing about self-interest is that your degree of care is proportional to your degree of knowledge about the situation. A society of extremely altruistic people who are constantly devoting resources to solve what they believe to be other people's problems may actually be less effective at ensuring flourishing.
You're right the conclusion is quite underspecified - how exactly do we build such a cooperation machine?
I don't know yet, but my bet is more on engineering, product design, and infrastructure than on social science. More like building a better Reddit or Uber (or supporting infrastructure layers like WWW and the Internet) than like writing papers.
would to love to see this idea worked out a little more!
I like the "guardian" framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.
Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.
When I say the world selects for "cooperation" I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).
Agreed with the complicity point - as evo-sim experiments like Axelrod's showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of "hammers" fails to satisfy.
Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors - by denying them resources, by hardening core infrastructure, via MAD, etc.
I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:
- Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if not millions of people, distributed over decades or even centuries.
- Pivotal acts have stronger claims of impact, but generally have weaker claims of the sign of that impact - actually realistic pivotal-seeming acts like "unilaterally deploy a friendly-seeming AI singleton" or "institute a stable global totalitarianism" are extremely, existentially dangerous. If someone identifies a pivotal-seeming act that is actually robustly positive, I'll be the first to sign on.
- In contrast, gradual steering proposals like "improve AI lab communication" or "improve interpretability" have weaker claims to impact, but stronger claims to being net positive across many possible worlds, and are much less subject to multi-agent problems like races and the unilateralist's curse.
- True, complete existential safety probably requires some measure of "solving politics" and locking in current human values, hence may not be desirable. Like what if the Long Reflection decides that the negative utilitarians are right and the world should in fact be destroyed? I won't put high credence on that, but there is some level of accidental existential risk that we should be willing to accept in order to not lock in our values.
You might find AI Safety Endgame Stories helpful - I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.
Nate's "how various plans miss the hard bits of the alignment challenge" might also be helpful as it communicates the "dynamics of doom" that success stories have to fight against.
One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds where one or more of the following claims hold:
- No serious global treaties on AI ever get signed.
- Deceptive alignment turns out not to be a problem.
- Mechanistic interpretability becomes impractical for large enough models.
- CAIS turns out to be right, and AI agents simply aren't economically competitive.
- Multi-agent training becomes the dominant paradigm for AI.
- Due to a hardware / software / talent bottleneck there turns out to be one clear AI capabilities leader with nobody else even close.
These all seem like plausible worlds to me, and it would be great if we had more clarity about what worlds different interventions are optimizing for. Ideally we should have bets across all the plausible worlds in which intervention is tractable, and I think that's currently far from being true.
I don't mean to suggest "just supporting the companies" is a good strategy, but there are promising non-power-seeking strategies like "improve collaboration between the leading AI labs" that I think are worth biasing towards.
Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn't delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.
Interesting, I haven't seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.
Fabricated options are products of incoherent thinking; what is the incoherence you're pointing out with policies that aim to delay existential catastrophe or reduce transaction costs between existing power centers?
I've considered starting an org that was either aimed at generating better alignment data or would do so as a side effect and this is really helpful - this kind of negative information is nearly impossible to find.
Is there a market niche for providing more interactive forms of human feedback, where it's important to have humans tightly in the loop with an ML process, rather than "send a batch to raters and get labels back in a few hours"? One reason RLHF is so little used is the difficulty of setting up this kind of human-in-the-loop infrastructure. Safety approaches like debate, amplification and factored cognition could also become competitive much faster if it was easier and faster to get complex human-in-the-loop pipelines running.
Maybe Surge already does this? But if not, you wouldn't necessarily want to compete with them on their core competency of recruiting and training human raters. Just use their raters (or Scale's), and build good reusable human-in-the-loop infrastructure, or maybe novel user interfaces that improve supervision quality.
I think a substantial fraction of ML researchers probably agree with Yann LeCun that AI safety will be solved "by default" in the course of making the AI systems useful. The crux is probably related to questions like how competent society's response will be, and maybe the likelihood of deceptive alignment.
Two points of disagreement though:
- I don't think setting P(doom) = 10% indicates lack of engagement or imagination; Toby Ord in the Precipice also gives a 10% estimate for AI-derived x-risk this century, and I assume he's engaged pretty deeply with the alignment literature.
- I don't think P(doom) = 10% or even 5% should be your threshold for "taking responsibility". I'm not sure I like the responsibility frame in general, but even a 1% chance of existential risk is big enough to outweigh almost any other moral duty in my mind.
Thank you for putting numbers on it!
~60%: there will be an existential catastrophe due to deceptive alignment specifically.
Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?
Agreed with the sentiment, though I would make a weaker claim, that AGI timelines are not uniquely strategically relevant, and the marginal hour of forecasting work at this point is better used on other questions.
My guess is that the timelines question has been investigated and discussed so heavily because for many people it is a crux for whether or not to work on AI safety at all - and there are many more such people than there are alignment researchers deciding what approach to prioritize. Most people in the world are not convinced that AGI safety is a pressing problem, and building very robust and legible models showing that AGI could happen soon is, empirically, a good way to convince them.
Mostly orthogonal:
- Evan's post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
- This post argues against (a version of) Evan's premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn't expect gradient descent to select for it.
Other relevant differences are
- gradient descent doesn't actually select for low time complexity / minimal circuits; it holds time & space complexity fixed, while selecting for low L2 norm. But I think you could probably do a similar reduction for L2 norm as Evan does for minimal circuits. The crux is in the premise.
- I think Evan is using a broader definition of search than I am in this post, closer to John Wentworth's definition of search as "general problem solving algorithm".
- Evan is doing worst-case analysis (can we completely rule out the possibility of deception by penalizing time complexity?) whereas I'm focusing on the average or default case.
Agreed with Rohin that a key consideration is whether you are trying to form truer beliefs, or to contribute novel ideas, and this in turn depends on what role you are playing in the collective enterprise that is AI safety.
If you're the person in charge of humanity's AI safety strategy, or a journalist tasked with informing the public, or a policy person talking to governments, it makes a ton of sense to build a "good gears-level model of what their top 5 alignment researchers believe and why". If you're a researcher, tasked with generating novel ideas that the rest of the community will filter and evaluate, this is probably not where you want to start!
In particular I basically buy the "unique combination of facts" model of invention: you generate novel ideas when you have a unique subset of the collective's knowledge, so the ideas seems obvious to you and weird or wrong to everyone else.
Two examples from leading scientists (admittedly in more paradigmatic fields):
- I remember Geoff Hinton saying at his Turing award lecture that he strongly advised new grad students not to read the literature before trying, for months, to solve the problem themselves.
- Richard Hamming in You and Your Research:
If you read all the time what other people have done you will think the way they thought. If you want to think new thoughts that are different, then do what a lot of creative people do - get the problem reasonably clear and then refuse to look at any answers until you've thought the problem through carefully how you would do it, how you could slightly change the problem to be the correct one. So yes, you need to keep up. You need to keep up more to find out what the problems are than to read to find the solutions.
Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that's definitely a crux for me.
Agreed that the existence of general-purpose heuristic-generators like relaxation is a strong argument for why we should expect to select for inner optimizers that look something like A*, contrary to my gradient descent doesn't select for inner search post.
Recursive structure creates an even stronger bias toward things like A* but only in recurrent neural architectures (so notably not currently-popular transformer architectures, though it's plausible that recurrent architectures will come back).
I maintain that the compression / compactness argument from "Risks from Learned Optimization" is wrong, at least in the current ML regime:
In general, evolved/trained/selected systems favor more compact policies/models/heuristics/algorithms/etc. In ML, for instance, the fewer parameters needed to implement the policy, the more parameters are free to vary, and therefore the more parameter-space-volume the policy takes up and the more likely it is to be found. (This is also the main argument for why overparameterized ML systems are able to generalize at all.)
I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that have low L2 norm, not low description length / Kolmogorov complexity. See Neel's recent interpretability post for an example of weight decay slowly selecting a generalizable algorithm over (non-generalizable) memorization over the course of training.
I don't understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn't we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters? In practice we see maybe 2x distillation before dramatic performance losses, meaning most of those parameters really are essential to the learned policy.
Overall though this post updated me substantially towards expecting the emergence of inner A*-like algorithms, despite their computational overhead. Added it to the list of caveats in my post.
Yeah it's probably definitions. With the caveat that I don't mean the narrow "literally iterates over solutions", but roughly "behaves (especially off the training distribution) as if it's iterating over solutions", like Abram Demski's term selection.
I disagree that performing search is central to human capabilities relative to other species. The cultural intelligence hypothesis seems much more plausible: humans are successful because our language and ability to mimic allow us to accumulate knowledge and coordinate at massive scale across both space and time. Not because individual humans are particularly good at thinking or optimizing or performing search. (Not sure what the implications of this are for AI).
You're right though, I didn't say much about alternative algorithms other than point vaguely in the direction of hierarchical control. I mostly want to warn people not to reason about inner optimizers the way they would about search algorithms. But if it helps, I think AlphaStar is a good example of an algorithm that is superhuman in a very complex strategic domain but is very likely not doing anything like "evaluating many possibilities before settling on an action". In contrast to AlphaZero (with rollouts), which considers tens of thousands of positions before selecting an action. AlphaZero (just the policy network) I'm more confused about... I expect it still isn't doing search, but it is literally trained to imitate the outcome of a search so it might have similar mis-generalization properties?
I agree that A* and gradient descent are central examples of search; for realistic problems these algorithms typically evaluate the objective on millions of candidates before returning an answer.
In contrast, human problem solvers typically do very little state evaluation - perhaps evaluating a few dozen possibilities directly, and relying (as you said) on abstractions and analogies instead. I would call this type of reasoning "not very search-like".
On the far end we have algorithms like Gauss-Jordan elimination, which just compute the optimal solution directly without evaluating any possibilities. Calling them "search algorithms" seems quite strange.
a general search process is something which takes in a specification of some problem or objective (from a broad class of possible problems/objectives), and returns a plan which solves the problem or scores well on the objective.
This appears to be a description of any optimization process, not just search - in particular it would include Gauss-Jordan elimination. I guess your ontology has "optimization" and "search" as synonyms, whereas mine has search as a (somewhat fuzzy) proper subset of optimization. Anyways, to eliminate confusion I'll use Abram Demski's term selection in future. Also added a terminology note to the post.
See my answer to tailcalled:
a program is more "search-like" if it is enumerating possible actions and evaluating their consequences
I'm curious if you mean something different by search when you say that we're likely to find policies that look like an "explicit search process + simple objective(s)"
Agreed that "search" is not a binary but more like a continuum, where we might call a program more "search-like" if it is enumerating possible actions and evaluating their consequences, and less "search-like" if it is directly mapping representations of inputs to actions. The argument in this post is that gradient descent (unlike evolution, and unlike human programmers) doesn't select much for "search-like" programs. If we take depth-first search as a central example of search, and a thermostat as the paradigmatic non-search program, gradient descent will select for something more like the thermostat.
it's totally possible to embed a few steps of gradient descent into the inference of a neural network, since gradient descent is differentiable
Agreed, and networks may even be learning something like this already! But in my ontology I wouldn't call an algorithm that performs, say, 5 steps of gradient descent over a billion-parameter space and then outputs an action very "search-like"; the "search" part is generating a tiny fraction of the optimization pressure, relative to whatever process sets up the initial state and the error signal.
Maybe this is just semantics, because for high levels of capability search and control are not fundamentally different (what you're pointing to with "much more efficient search" - an infinitely efficient search is just optimal control, you never even consider suboptimal actions!). But it does seem like for a fixed level of capabilities search is more brittle, somehow, and more likely to misgeneralize catastrophically.
Yeah I think you need some additional assumptions on the models and behaviors, which you're gesturing at with the "matching behaviors" and "inexact descriptions". Otherwise it's easy to find counterexamples: imagine the model is just a single N x N matrix of parameters, then in general there is no shorter description length of the behavior than the model itself.
Yes there are non-invertible (you might say "simpler") behaviors which each occupy more parameter volume than any given invertible behavior, but random matrices are almost certainly invertible so the actual optimization pressure towards low description length is infinitesimal.
Ah I think that's the crux - I believe the overparametrized regime finds generalizing models because gradient descent finds functions that have low function norm, not low description length. I forget the paper that showed this for neural nets but here's a proof for logistic regression.
Agreed on "explicit search" being a misleading phrase, I'll replace it with just "search" when I'm referring to learned programs.
small descriptions give higher parameter space volume, and so the things we find are those with short descriptions
I don't think I understand this. GPT-3 is a thing we found, which has 175B parameters, what is the short description of it?
Thinking about this more, I think gradient descent (at least in the modern regime) probably doesn't select for inner search processes, because it's not actually biased towards low Kolmogorov complexity. More in my standalone post, and here's a John Maxwell comment making a similar point.
Agreed with John, with the caveat that I expect search processes + simple objectives to only emerge from massively multi-task training. If you're literally training an AI just on smiling, TurnTrout is right that "a spread of situationally-activated computations" is more likely since you're not getting any value from the generality of search.
The Deep Double Descent paper is a good reference for why gradient descent training in the overparametrized regime favors low complexity models, though I don't know of explicit evidence for the conjecture that "explicit search + simple objectives" is actually lower complexity (in model space) than "bundle of heuristics". Seems intuitive if model complexity is something close to Kolmogorov complexity, but would love to see an empirical investigation!
I love the framing of outer alignment as a data quality problem!
As an illustrative data point, the way Google generates "alignment data" for its search evals is by employing thousands of professional raters and training them to follow a 200-page handbook (!) that operationalizes the concept of a "good search result".
Intuitively speaking, the underlying problem is that aligned goals need to generalize robustly enough to block AGIs from the power-seeking strategies recommended by instrumental reasoning, which will become much more difficult as their instrumental reasoning skills improve.
This is the clearest justification of capabilities generalize further than alignment I've seen, bravo!
My main disagreement with the post is that goal misgeneralization comes after situational awareness. Weak versions of goal misgeneralization are already happening all the time, from toy RL experiments to production AI systems suffering from "training-serving skew". We can study it today and learn a lot about the specific ways goals misgeneralize. In contrast, we probably can't study the effects of high levels of situational awareness with current systems.
The problems in the earlier phases are more likely to be solved by default as the field of ML progresses.
I think this is certainly not true if you mean the problem of "situational awareness". Even if the problem is "weakness of human supervisors", I still don't think it will be solved by default - the reinforcement learning from human preferences paper was published in 2017 but very few leading AI systems actually use RLHF, preferring to use even simpler and less scalable forms of human supervision. I think it's reasonably likely that even in worlds where scalable supervision ideas like IDA, debate, or factored cognition could have saved us, they just won't be built due to the engineering challenges involved.
The most valuable research of this type will likely require detailed reasoning about how proposed alignment techniques will scale up to AGIs, rather than primarily trying to solve early versions of these problems which appear in existing systems.
Would love to see this point defended more. I don't have a strong opinion but weakly expect the most valuable research to come from attempts to align narrowly superhuman models rather than detailed reasoning about scaling, though we definitely need more of both. To use Steinhardt's analogy to the first controlled nuclear reaction, we understand the AI equivalent to nuclear chain reactions well enough conceptually; what we need is the equivalent of cadmium rods and measurements of criticality, and if we find those it'll probably be by deep engagement with the details of current systems and how they are trained and deployed.
Yes, definitely possible.
Saying the quiet part out loud: VC for both research and product startups runs on trust. To get funding you will mostly likely need someone trusted to vouch for you, and/or to have legible, hard-to-fake accomplishments in a related field, that obviates the need for trust. (Writing up a high quality AI alignment research agenda could be such an accomplishment!). If you DM me with more details about your situation, I might be able to help route you.
I don't think any factored cognition proponents would disagree with
Composing interpretable pieces does not necessarily yield an interpretable system.
They just believe that we could, contingently, choose to compose interpretable pieces into an interpretable system. Just like we do all the time with
- massive factories with billions of components, e.g. semiconductor fabs
- large software projects with tens of millions of lines of code, e.g. the Linux kernel
- military operations involving millions of soldiers and support personnel
Figuring out how to turn interpretability/tool-ness/alignment/corrigibility of the parts into interpretability/tool-ness/alignment/corrigibility of the whole is the central problem, and it’s a hard (and interesting) open research problem.
Agreed this is the central problem, though I would describe it more as engineering than research - the fact that we have examples of massively complicated yet interpretable systems means we collectively "know" how to solve it, and it's mostly a matter of assembling a large enough and coordinated-enough engineering project. (The real problem with factored cognition for AI safety is not that it won't work, but that equally-powerful uninterpretable systems might be much easier to build).
Agreed on all points! One clarification is that large founder-led companies, including Facebook, are all moral mazes internally (i.e. from the perspective of the typical employee); but their founders often have so much legitimacy that their external actions are only weakly influenced by moral maze dynamics.
I guess that means that if AGI deployment is very incremental - a sequence of small changes to many different AI systems, that only in retrospect add up to AGI - moral maze dynamics will still be paramount, even in founder-led companies.
basically every company eventually becomes a moral maze
Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes.
Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was overwhelmingly likely to deploy one of the first AGIs, I expect Mark Zuckerberg's beliefs about AGI safety would be more important to understand than the general dynamics of moral mazes. (Facebook example deliberately chosen to avoid taking stances on the more likely AGI players, but I think it's relatively clear which ones are moral mazes).
In support of this, I remember Geoff Hinton saying at his Turing award lecture that he strongly advised new grad students not to read the literature before trying, for months, to solve the problem themselves.
Two interesting consequences of the "unique combination of facts" model of invention:
- You may want to engage in strategic ignorance: avoid learning about certain popular subfields or papers, in the hopes that this will let you generate a unique idea that is blocked for people who read all the latest papers and believe whatever the modern equivalent of "vanishing gradients means you can't train deep networks end-to-end" turns out to be.
- You may want to invest in uncorrelated knowledge: what is a body of knowledge that you're curious about that nobody else in your field seems to know? Fields that seem especially promising to cross-pollinate with alignment are human-computer interaction, economic history, industrial organization, contract law, psychotherapy, anthropology. Perhaps even these are too obvious!
It's very hard to bring the various members of the AI world together around one table, because some people who work on longterm/AGI-style policy tend to ignore, minimize, or just not consider the immediate problems of AI deployment/harms.
This is pointing at an ongoing bravery debate: I'm sure the feeling is real; but also, "AGI-style" people see their concerns being ignored & minimized by the "immediate problems" people, and so feel like they need to get even more strident.
This dynamic is really bad, I'm not sure what the systemic solution is, but as a starting point I would encourage people reading this to vocally support both immediate problems work and long term risks work rather than engaging in bravery-debate style reasoning like "I'll only ever talk about long term risks because they're underrated in The Discourse". Obviously, do this only to the extent that you actually believe it! But most longtermists believe that at least some kinds of immediate problems work is valuable (at least relative to the realistic alternative which, remember, is capabilities work!), and should be more willing to say so.
Ajeya's post on aligning narrow models and the Pragmatic AI Safety Sequence come to my mind as particularly promising starting points for building bridges between the two worlds.
Agreed, the competitiveness penalty from enforcing internal legibility is the main concern with externalized reasoning / factored cognition. The secular trend in AI systems is towards end-to-end training and human-uninterpretable intermediate representations; while you can always do slightly better at the frontier by adding some human-understandable components like chain of thought (previously beam search & probabilistic graphical models), in the long run a bigger end-to-end model will win out.
One hope that "externalized reasoning" can buck this trend rests on the fact that success in "particularly legible domains, such as math proofs and programming" is actually enough for transformative AI - thanks to the internet and especially the rise of remote work, so much of the economy is legible. Sure, your nuclear-fusion-controller AI will have a huge competitiveness penalty if you force it to explain what it's doing in natural language, but physical control isn't where we've seen AI successes anyway.
Side note:
standard training procedures only incentivize the model to use reasoning steps produced by a single human.
I don't think this is right! The model will have seen enough examples of dialogue and conversation transcripts; it can definitely generate outputs that involve multiple domains of knowledge from prompts like
An economist and a historian are debating the causes of WW2.
Economist:
Thought-provoking post, thanks.
One important implication is that pure AI companies such as OpenAI, Anthropic, Conjecture, Cohere are likely to fall behind companies with access to large amounts of non-public-internet text data like Facebook, Google, Apple, perhaps Slack. Email and messaging are especially massive sources of "dark" data, provided they can be used legally and safely (e.g. without exposing private user information). Taking just email, something like 500 billion emails are sent daily, which is more text than any LLM has ever been trained on (admittedly with a ton of duplication and low quality content).
Another implication is that federated learning, data democratization efforts, and privacy regulations like GDPR are much more likely to be critical levers on the future of AI than previously thought.
I certainly wouldn't bet the light cone on that assumption! I do think it would very surprising if a single gradient step led to a large increase in capabilities, even with models that do a lot of learning between gradient steps. Would love to see empirical evidence on this.