Searching for the Root of the Tree of Evil 2024-06-08T17:05:53.950Z
Cooperators are more powerful than agents 2022-10-21T20:02:35.224Z
AI Safety Endgame Stories 2022-09-28T16:58:03.075Z
Gradient descent doesn't select for inner search 2022-08-13T04:15:19.373Z
Safety Implications of LeCun's path to machine intelligence 2022-07-15T21:47:44.411Z
Updated Deference is not a strong argument against the utility uncertainty approach to alignment 2022-06-24T19:32:17.547Z


Comment by Ivan Vendrov (ivan-vendrov) on Searching for the Root of the Tree of Evil · 2024-06-10T13:25:47.339Z · LW · GW
  1. Agree trust and cooperation is dual use, and I'm not sure how to think about this yet; perhaps the most important form of coordination is the one that prevents (directly or via substitution) harmful forms of coordination from arising.
  2. One reason I wouldn't call lack of altruism the root is that it's not clear how to intervene on it, it's like calling the laws of physics the root of all evil. I prefer to think about "how to reduce transaction costs to self-interested collaboration". I'm also less sure that a society of people more altruistic motives will necessarily do better... the nice thing about self-interest is that your degree of care is proportional to your degree of knowledge about the situation. A society of extremely altruistic people who are constantly devoting resources to solve what they believe to be other people's problems may actually be less effective at ensuring flourishing.
Comment by Ivan Vendrov (ivan-vendrov) on Searching for the Root of the Tree of Evil · 2024-06-10T13:18:49.953Z · LW · GW

You're right the conclusion is quite underspecified - how exactly do we build such a cooperation machine?

I don't know yet, but my bet is more on engineering, product design, and infrastructure than on social science. More like building a better Reddit or Uber (or supporting infrastructure layers like WWW and the Internet) than like writing papers.

Comment by Ivan Vendrov (ivan-vendrov) on Searching for the Root of the Tree of Evil · 2024-06-10T13:13:45.897Z · LW · GW

would to love to see this idea worked out a little more!

Comment by Ivan Vendrov (ivan-vendrov) on Guardian AI (Misaligned systems are all around us.) · 2022-11-25T17:40:55.979Z · LW · GW

I like the "guardian" framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.

Comment by Ivan Vendrov (ivan-vendrov) on Cooperators are more powerful than agents · 2022-10-22T15:51:12.503Z · LW · GW

Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.

When I say the world selects for "cooperation" I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).

Agreed with the complicity point - as evo-sim experiments like Axelrod's showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of "hammers" fails to satisfy.

Comment by Ivan Vendrov (ivan-vendrov) on Any further work on AI Safety Success Stories? · 2022-10-04T02:03:04.753Z · LW · GW

Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors - by denying them resources, by hardening core infrastructure, via MAD, etc.

Comment by Ivan Vendrov (ivan-vendrov) on Any further work on AI Safety Success Stories? · 2022-10-03T21:35:14.251Z · LW · GW

I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:

  1. Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if not millions of people, distributed over decades or even centuries. 
  2. Pivotal acts have stronger claims of impact, but generally have weaker claims of the sign of that impact - actually realistic pivotal-seeming acts like "unilaterally deploy a friendly-seeming AI singleton" or "institute a stable global totalitarianism" are extremely, existentially dangerous. If someone identifies a pivotal-seeming act that is actually robustly positive, I'll be the first to sign on.
  3. In contrast, gradual steering proposals like "improve AI lab communication" or "improve interpretability" have weaker claims to impact, but stronger claims to being net positive across many possible worlds, and are much less subject to multi-agent problems like races and the unilateralist's curse.
  4. True, complete existential safety probably requires some measure of "solving politics" and locking in current human values, hence may not be desirable. Like what if the Long Reflection decides that the negative utilitarians are right and the world should in fact be destroyed? I won't put high credence on that, but there is some level of accidental existential risk that we should be willing to accept in order to not lock in our values.
Comment by Ivan Vendrov (ivan-vendrov) on Any further work on AI Safety Success Stories? · 2022-10-03T16:27:51.202Z · LW · GW

You might find AI Safety Endgame Stories helpful - I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.

Nate's "how various plans miss the hard bits of the alignment challenge" might also be helpful as it communicates the "dynamics of doom" that success stories have to fight against.

One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds where one or more of the following claims hold:

  • No serious global treaties on AI ever get signed.
  • Deceptive alignment turns out not to be a problem.
  • Mechanistic interpretability becomes impractical for large enough models.
  • CAIS turns out to be right, and AI agents simply aren't economically competitive.
  • Multi-agent training becomes the dominant paradigm for AI.
  • Due to a hardware / software / talent bottleneck there turns out to be one clear AI capabilities leader with nobody else even close.

These all seem like plausible worlds to me, and it would be great if we had more clarity about what worlds different interventions are optimizing for. Ideally we should have bets across all the plausible worlds in which intervention is tractable, and I think that's currently far from being true.

Comment by Ivan Vendrov (ivan-vendrov) on AI Safety Endgame Stories · 2022-09-29T17:06:28.217Z · LW · GW

I don't mean to suggest "just supporting the companies" is a good strategy, but there are promising non-power-seeking strategies like "improve collaboration between the leading AI labs" that I think are worth biasing towards.

Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn't delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.

Comment by Ivan Vendrov (ivan-vendrov) on AI Safety Endgame Stories · 2022-09-29T04:04:33.656Z · LW · GW

Interesting, I haven't seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive.  An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.

Comment by Ivan Vendrov (ivan-vendrov) on AI Safety Endgame Stories · 2022-09-29T03:12:37.938Z · LW · GW

Fabricated options are products of incoherent thinking; what is the incoherence you're pointing out with policies that aim to delay existential catastrophe or reduce transaction costs between existing power centers?

Comment by Ivan Vendrov (ivan-vendrov) on Why we're not founding a human-data-for-alignment org · 2022-09-28T02:06:27.586Z · LW · GW

I've considered starting an org that was either aimed at generating better alignment data or would do so as a side effect and this is really helpful - this kind of negative information is nearly impossible to find.

Is there a market niche for providing more interactive forms of human feedback, where it's important to have humans tightly in the loop with an ML process, rather than "send a batch to raters and get labels back in a few hours"? One reason RLHF is so little used is the difficulty of setting up this kind of human-in-the-loop infrastructure. Safety approaches like debate, amplification and factored cognition could also become competitive much faster if it was easier and faster to get complex human-in-the-loop pipelines running. 

Maybe Surge already does this? But if not, you wouldn't necessarily want to compete with them on their core competency of recruiting and training human raters. Just use their raters (or Scale's), and build good reusable human-in-the-loop infrastructure, or maybe novel user interfaces that improve supervision quality.

Comment by Ivan Vendrov (ivan-vendrov) on Why Do AI researchers Rate the Probability of Doom So Low? · 2022-09-24T05:24:16.726Z · LW · GW

I think a substantial fraction of ML researchers probably agree with Yann LeCun that AI safety will be solved "by default" in the course of making the AI systems useful. The crux is probably related to questions like how competent society's response will be, and maybe the likelihood of deceptive alignment.

Two points of disagreement though:

  • I don't think setting P(doom) = 10% indicates lack of engagement or imagination; Toby Ord in the Precipice also gives a 10% estimate for AI-derived x-risk this century, and I assume he's engaged pretty deeply with the alignment literature.
  • I don't think P(doom) = 10% or even 5% should be your threshold for "taking responsibility". I'm not sure I like the responsibility frame in general, but even a 1% chance of existential risk is big enough to outweigh almost any other moral duty in my mind.
Comment by Ivan Vendrov (ivan-vendrov) on How likely is deceptive alignment? · 2022-08-31T16:49:14.379Z · LW · GW

Thank you for putting numbers on it!

~60%: there will be an existential catastrophe due to deceptive alignment specifically.

Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?

Comment by Ivan Vendrov (ivan-vendrov) on AGI Timelines Are Mostly Not Strategically Relevant To Alignment · 2022-08-24T00:40:42.491Z · LW · GW

Agreed with the sentiment, though I would make a weaker claim, that AGI timelines are not uniquely strategically relevant, and the marginal hour of forecasting work at this point is better used on other questions.

My guess is that the timelines question has been investigated and discussed so heavily because for many people it is a crux for whether or not to work on AI safety at all - and there are many more such people than there are alignment researchers deciding what approach to prioritize. Most people in the world are not convinced that AGI safety is a pressing problem, and building very robust and legible models showing that AGI could happen soon is, empirically, a good way to convince them.

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-19T19:42:00.322Z · LW · GW

Mostly orthogonal:

  • Evan's post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
  • This post argues against (a version of) Evan's premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn't expect gradient descent to select for it.

Other relevant differences are

  1. gradient descent doesn't actually select for low time complexity / minimal circuits; it holds time & space complexity fixed, while selecting for low L2 norm. But I think you could probably do a similar reduction for L2 norm as Evan does for minimal circuits. The crux is in the premise.
  2. I think Evan is using a broader definition of search than I am in this post, closer to John Wentworth's definition of search as "general problem solving algorithm".
  3. Evan is doing worst-case analysis (can we completely rule out the possibility of deception by penalizing time complexity?) whereas I'm focusing on the average or default case.
Comment by Ivan Vendrov (ivan-vendrov) on Concrete Advice for Forming Inside Views on AI Safety · 2022-08-18T19:29:13.987Z · LW · GW

Agreed with Rohin that a key consideration is whether you are trying to form truer beliefs, or to contribute novel ideas, and this in turn depends on what role you are playing in the collective enterprise that is AI safety.

If you're the person in charge of humanity's AI safety strategy, or a journalist tasked with informing the public, or a policy person talking to governments, it makes a ton of sense to build a "good gears-level model of what their top 5 alignment researchers believe and why". If you're a researcher, tasked with generating novel ideas that the rest of the community will filter and evaluate, this is probably not where you want to start!

In particular I basically buy the "unique combination of facts" model of invention: you generate novel ideas when you have a unique subset of the collective's knowledge, so the ideas seems obvious to you and weird or wrong to everyone else.

Two examples from leading scientists (admittedly in more paradigmatic fields):

  1. I remember Geoff Hinton saying at his Turing award lecture that he strongly advised new grad students not to read the literature before trying, for months, to solve the problem themselves.
  2. Richard Hamming in You and Your Research:

If you read all the time what other people have done you will think the way they thought. If you want to think new thoughts that are different, then do what a lot of creative people do - get the problem reasonably clear and then refuse to look at any answers until you've thought the problem through carefully how you would do it, how you could slightly change the problem to be the correct one. So yes, you need to keep up. You need to keep up more to find out what the problems are than to read to find the solutions.

Comment by Ivan Vendrov (ivan-vendrov) on What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? · 2022-08-17T16:11:49.130Z · LW · GW

Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that's definitely a crux for me.

Comment by Ivan Vendrov (ivan-vendrov) on What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? · 2022-08-17T02:46:40.591Z · LW · GW

Agreed that the existence of general-purpose heuristic-generators like relaxation is a strong argument for why we should expect to select for inner optimizers that look something like A*, contrary to my gradient descent doesn't select for inner search post.

Recursive structure creates an even stronger bias toward things like A* but only in recurrent neural architectures (so notably not currently-popular transformer architectures, though it's plausible that recurrent architectures will come back).

I maintain that the compression / compactness argument from "Risks from Learned Optimization" is wrong, at least in the current ML regime:

In general, evolved/trained/selected systems favor more compact policies/models/heuristics/algorithms/etc. In ML, for instance, the fewer parameters needed to implement the policy, the more parameters are free to vary, and therefore the more parameter-space-volume the policy takes up and the more likely it is to be found. (This is also the main argument for why overparameterized ML systems are able to generalize at all.)

I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that have low L2 norm, not low description length / Kolmogorov complexity. See Neel's recent interpretability post for an example of weight decay slowly selecting a generalizable algorithm over (non-generalizable) memorization over the course of training.

I don't understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn't we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters? In practice we see maybe 2x distillation before dramatic performance losses, meaning most of those parameters really are essential to the learned policy.

Overall though this post updated me substantially towards expecting the emergence of inner A*-like algorithms, despite their computational overhead. Added it to the list of caveats in my post.

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-16T15:56:04.881Z · LW · GW

Yeah it's probably definitions. With the caveat that I don't mean the narrow "literally iterates over solutions", but roughly "behaves (especially off the training distribution) as if it's iterating over solutions", like Abram Demski's term selection.

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-15T23:50:31.538Z · LW · GW

I disagree that performing search is central to human capabilities relative to other species. The cultural intelligence hypothesis seems much more plausible: humans are successful because our language and ability to mimic allow us to accumulate knowledge and coordinate at massive scale across both space and time. Not because individual humans are particularly good at thinking or optimizing or performing search. (Not sure what the implications of this are for AI).

You're right though, I didn't say much about alternative algorithms other than point vaguely in the direction of hierarchical control. I mostly want to warn people not to reason about inner optimizers the way they would about search algorithms. But if it helps, I think AlphaStar is a good example of an algorithm that is superhuman in a very complex strategic domain but is very likely not doing anything like "evaluating many possibilities before settling on an action". In contrast to AlphaZero (with rollouts), which considers tens of thousands of positions before selecting an action. AlphaZero (just the policy network) I'm more confused about... I expect it still isn't doing search, but it is literally trained to imitate the outcome of a search so it might have similar mis-generalization properties?

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-15T03:48:14.099Z · LW · GW

I agree that A* and gradient descent are central examples of search; for realistic problems these algorithms typically evaluate the objective on millions of candidates before returning an answer.

In contrast, human problem solvers typically do very little state evaluation - perhaps evaluating a few dozen possibilities directly, and relying (as you said) on abstractions and analogies instead. I would call this type of reasoning "not very search-like".

On the far end we have algorithms like Gauss-Jordan elimination, which just compute the optimal solution directly without evaluating any possibilities. Calling them "search algorithms" seems quite strange.

a general search process is something which takes in a specification of some problem or objective (from a broad class of possible problems/objectives), and returns a plan which solves the problem or scores well on the objective.

This appears to be a description of any optimization process, not just search - in particular it would include Gauss-Jordan elimination. I guess your ontology has "optimization" and "search" as synonyms, whereas mine has search as a (somewhat fuzzy) proper subset of optimization. Anyways, to eliminate confusion I'll use Abram Demski's term selection in future. Also added a terminology note to the post.

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-15T00:54:17.706Z · LW · GW

See my answer to tailcalled:

a program is more "search-like" if it is enumerating possible actions and evaluating their consequences

I'm curious if you mean something different by search when you say that we're likely to find policies that look like an "explicit search process + simple objective(s)"

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-13T20:10:01.434Z · LW · GW

Agreed that "search" is not a binary but more like a continuum, where we might call a program more "search-like" if it is enumerating possible actions and evaluating their consequences, and less "search-like" if it is directly mapping representations of inputs to actions. The argument in this post is that gradient descent (unlike evolution, and unlike human programmers) doesn't select much for "search-like" programs. If we take depth-first search as a central example of search, and a thermostat as the paradigmatic non-search program, gradient descent will select for something more like the thermostat.

it's totally possible to embed a few steps of gradient descent into the inference of a neural network, since gradient descent is differentiable

Agreed, and networks may even be learning something like this already! But in my ontology I wouldn't call an algorithm that performs, say, 5 steps of gradient descent over a billion-parameter space and then outputs an action very "search-like"; the "search" part is generating a tiny fraction of the optimization pressure, relative to whatever process sets up the initial state and the error signal.

Maybe this is just semantics, because for high levels of capability search and control are not fundamentally different (what you're pointing to with "much more efficient search" - an infinitely efficient search is just optimal control, you never even consider suboptimal actions!). But it does seem like for a fixed level of capabilities search is more brittle, somehow, and more likely to misgeneralize catastrophically.

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-13T07:01:32.567Z · LW · GW

Yeah I think you need some additional assumptions on the models and behaviors, which you're gesturing at with the "matching behaviors" and "inexact descriptions". Otherwise it's easy to find counterexamples: imagine the model is just a single N x N matrix of parameters, then in general there is no shorter description length of the behavior than the model itself. 

Yes there are non-invertible (you might say "simpler") behaviors which each occupy more parameter volume than any given invertible behavior, but random matrices are almost certainly invertible so the actual optimization pressure towards low description length is infinitesimal.

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-13T05:14:22.418Z · LW · GW

Ah I think that's the crux - I believe the overparametrized regime finds generalizing models because gradient descent finds functions that have low function norm, not low description length. I forget the paper that showed this for neural nets but here's a proof for logistic regression.

Comment by Ivan Vendrov (ivan-vendrov) on Gradient descent doesn't select for inner search · 2022-08-13T04:51:49.352Z · LW · GW

Agreed on "explicit search" being a misleading phrase, I'll replace it with just "search" when I'm referring to learned programs.

small descriptions give higher parameter space volume, and so the things we find are those with short descriptions

I don't think I understand this. GPT-3 is a thing we found, which has 175B parameters, what is the short description of it?

Comment by Ivan Vendrov (ivan-vendrov) on Seriously, what goes wrong with "reward the agent when it makes you smile"? · 2022-08-13T04:23:08.754Z · LW · GW

Thinking about this more, I think gradient descent (at least in the modern regime) probably doesn't select for inner search processes, because it's not actually biased towards low Kolmogorov complexity. More in my standalone post, and here's a John Maxwell comment making a similar point.

Comment by Ivan Vendrov (ivan-vendrov) on Seriously, what goes wrong with "reward the agent when it makes you smile"? · 2022-08-12T03:02:30.256Z · LW · GW

Agreed with John, with the caveat that I expect search processes + simple objectives to only emerge from massively multi-task training. If you're literally training an AI just on smiling, TurnTrout is right that "a spread of situationally-activated computations" is more likely since you're not getting any value from the generality of search.

The Deep Double Descent paper is a good reference for why gradient descent training in the overparametrized regime favors low complexity models, though I don't know of explicit evidence for the conjecture that "explicit search + simple objectives" is actually lower complexity (in model space) than "bundle of heuristics". Seems intuitive if model complexity is something close to Kolmogorov complexity, but would love to see an empirical investigation!

Comment by Ivan Vendrov (ivan-vendrov) on How much alignment data will we need in the long run? · 2022-08-11T18:39:26.868Z · LW · GW

I love the framing of outer alignment as a data quality problem!

As an illustrative data point, the way Google generates "alignment data" for its search evals is by employing thousands of professional raters and training them to follow a 200-page handbook (!) that operationalizes the concept of a "good search result".

Comment by Ivan Vendrov (ivan-vendrov) on The alignment problem from a deep learning perspective · 2022-08-11T18:16:45.391Z · LW · GW

Intuitively speaking, the underlying problem is that aligned goals need to generalize robustly enough to block AGIs from the power-seeking strategies recommended by instrumental reasoning, which will become much more difficult as their instrumental reasoning skills improve.

This is the clearest justification of capabilities generalize further than alignment I've seen, bravo!

My main disagreement with the post is that goal misgeneralization comes after situational awareness.  Weak versions of goal misgeneralization are already happening all the time, from toy RL experiments to production AI systems suffering from "training-serving skew". We can study it today and learn a lot about the specific ways goals misgeneralize. In contrast, we probably can't study the effects of high levels of situational awareness with current systems.

The problems in the earlier phases are more likely to be solved by default as the field of ML progresses. 

I think this is certainly not true if you mean the problem of "situational awareness". Even if the problem is "weakness of human supervisors", I still don't think it will be solved by default - the reinforcement learning from human preferences paper was published in 2017 but very few leading AI systems actually use RLHF, preferring to use even simpler and less scalable forms of human supervision. I think it's reasonably likely that even in worlds where scalable supervision ideas like IDA, debate, or factored cognition could have saved us, they just won't be built due to the engineering challenges involved.

The most valuable research of this type will likely require detailed reasoning about how proposed alignment techniques will scale up to AGIs, rather than primarily trying to solve early versions of these problems which appear in existing systems.

Would love to see this point defended more. I don't have a strong opinion but weakly expect the most valuable research to come from attempts to align narrowly superhuman models rather than detailed reasoning about scaling, though we definitely need more of both. To use Steinhardt's analogy to the first controlled nuclear reaction, we understand the AI equivalent to nuclear chain reactions well enough conceptually; what we need is the equivalent of cadmium rods and measurements of criticality, and if we find those it'll probably be by deep engagement with the details of current systems and how they are trained and deployed.

Comment by Ivan Vendrov (ivan-vendrov) on Is it possible to find venture capital for AI research org with strong safety focus? · 2022-08-10T17:47:26.313Z · LW · GW

Yes, definitely possible.

Saying the quiet part out loud: VC for both research and product startups runs on trust. To get funding you will mostly likely need someone trusted to vouch for you, and/or to have legible, hard-to-fake accomplishments in a related field, that obviates the need for trust. (Writing up a high quality AI alignment research agenda could be such an accomplishment!). If you DM me with more details about your situation, I might be able to help route you.

Comment by Ivan Vendrov (ivan-vendrov) on Interpretability/Tool-ness/Alignment/Corrigibility are not Composable · 2022-08-09T01:26:36.703Z · LW · GW

I don't think any factored cognition proponents would disagree with

Composing interpretable pieces does not necessarily yield an interpretable system.

They just believe that we could, contingently, choose to compose interpretable pieces into an interpretable system. Just like we do all the time with

  • massive factories with billions of components, e.g. semiconductor fabs
  • large software projects with tens of millions of lines of code, e.g. the Linux kernel
  • military operations involving millions of soldiers and support personnel

Figuring out how to turn interpretability/tool-ness/alignment/corrigibility of the parts into interpretability/tool-ness/alignment/corrigibility of the whole is the central problem, and it’s a hard (and interesting) open research problem.

Agreed this is the central problem, though I would describe it more as engineering than research - the fact that we have examples of massively complicated yet interpretable systems means we collectively "know" how to solve it, and it's mostly a matter of assembling a large enough and coordinated-enough engineering project. (The real problem with factored cognition for AI safety is not that it won't work, but that equally-powerful uninterpretable systems might be much easier to build).

Comment by Ivan Vendrov (ivan-vendrov) on Alex Ray's Shortform · 2022-08-08T20:57:44.957Z · LW · GW

Agreed on all points! One clarification is that large founder-led companies, including Facebook, are all moral mazes internally (i.e. from the perspective of the typical employee); but their founders often have so much legitimacy that their external actions are only weakly influenced by moral maze dynamics.

I guess that means that if AGI deployment is very incremental - a sequence of small changes to many different AI systems, that only in retrospect add up to AGI - moral maze dynamics will still be paramount, even in founder-led companies.

Comment by Ivan Vendrov (ivan-vendrov) on Alex Ray's Shortform · 2022-08-08T05:25:19.076Z · LW · GW

basically every company eventually becomes a moral maze

Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes.

Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was overwhelmingly likely to deploy one of the first AGIs, I expect Mark Zuckerberg's beliefs about AGI safety would be more important to understand than the general dynamics of moral mazes. (Facebook example deliberately chosen to avoid taking stances on the more likely AGI players, but I think it's relatively clear which ones are moral mazes).

Comment by Ivan Vendrov (ivan-vendrov) on Paper reading as a Cargo Cult · 2022-08-08T00:12:09.187Z · LW · GW

In support of this, I remember Geoff Hinton saying at his Turing award lecture that he strongly advised new grad students not to read the literature before trying, for months, to solve the problem themselves.

Two interesting consequences of the "unique combination of facts" model of invention:

  • You may want to engage in strategic ignorance: avoid learning about certain popular subfields or papers, in the hopes that this will let you generate a unique idea that is blocked for people who read all the latest papers and believe whatever the modern equivalent of "vanishing gradients means you can't train deep networks end-to-end" turns out to be.
  • You may want to invest in uncorrelated knowledge: what is a body of knowledge that you're curious about that nobody else in your field seems to know? Fields that seem especially promising to cross-pollinate with alignment are human-computer interaction, economic history, industrial organization, contract law, psychotherapy, anthropology. Perhaps even these are too obvious!
Comment by Ivan Vendrov (ivan-vendrov) on Jack Clark on the realities of AI policy · 2022-08-07T23:55:22.428Z · LW · GW

It's very hard to bring the various members of the AI world together around one table, because some people who work on longterm/AGI-style policy tend to ignore, minimize, or just not consider the immediate problems of AI deployment/harms.

This is pointing at an ongoing bravery debate: I'm sure the feeling is real; but also, "AGI-style" people see their concerns being ignored & minimized by the "immediate problems" people, and so feel like they need to get even more strident. 

This dynamic is really bad, I'm not sure what the systemic solution is, but as a starting point I would encourage people reading this to vocally support both immediate problems work and long term risks work rather than engaging in bravery-debate style reasoning like "I'll only ever talk about long term risks because they're underrated in The Discourse". Obviously, do this only to the extent that you actually believe it! But most longtermists believe that at least some kinds of immediate problems work is valuable (at least relative to the realistic alternative which, remember, is capabilities work!), and should be more willing to say so.

Ajeya's post on aligning narrow models and the Pragmatic AI Safety Sequence come to my mind as particularly promising starting points for building bridges between the two worlds.

Comment by Ivan Vendrov (ivan-vendrov) on Externalized reasoning oversight: a research direction for language model alignment · 2022-08-05T16:57:17.821Z · LW · GW

Agreed, the competitiveness penalty from enforcing internal legibility is the main concern with externalized reasoning / factored cognition. The secular trend in AI systems is towards end-to-end training and human-uninterpretable intermediate representations; while you can always do slightly better at the frontier by adding some human-understandable components like chain of thought (previously beam search & probabilistic graphical models), in the long run a bigger end-to-end model will win out.

One hope that "externalized reasoning" can buck this trend rests on the fact that success in "particularly legible domains, such as math proofs and programming" is actually enough for transformative AI - thanks to the internet and especially the rise of remote work, so much of the economy is legible. Sure, your nuclear-fusion-controller AI will have a huge competitiveness penalty if you force it to explain what it's doing in natural language, but physical control isn't where we've seen AI successes anyway.

Side note:

standard training procedures only incentivize the model to use reasoning steps produced by a single human.

I don't think this is right! The model will have seen enough examples of dialogue and conversation transcripts; it can definitely generate outputs that involve multiple domains of knowledge from prompts like 

An economist and a historian are debating the causes of WW2.


Comment by Ivan Vendrov (ivan-vendrov) on chinchilla's wild implications · 2022-07-31T04:31:48.058Z · LW · GW

Thought-provoking post, thanks.

One important implication is that pure AI companies such as OpenAI, Anthropic, Conjecture, Cohere are likely to fall behind companies with access to large amounts of non-public-internet text data like Facebook, Google, Apple, perhaps Slack. Email and messaging are especially massive sources of "dark" data, provided they can be used legally and safely (e.g. without exposing private user information). Taking just email, something like 500 billion emails are sent daily, which is more text than any LLM has ever been trained on (admittedly with a ton of duplication and low quality content).

Another implication is that federated learning, data democratization efforts, and privacy regulations like GDPR are much more likely to be critical levers on the future of AI than previously thought.

Comment by Ivan Vendrov (ivan-vendrov) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-28T18:34:41.213Z · LW · GW

I certainly wouldn't bet the light cone on that assumption! I do think it would very surprising if a single gradient step led to a large increase in capabilities, even with models that do a lot of learning between gradient steps. Would love to see empirical evidence on this.

Comment by Ivan Vendrov (ivan-vendrov) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-27T22:03:42.025Z · LW · GW

I found the argument compelling, but if I put on my techno-optimist ML researcher hat I think the least believable part of the story is the deployment:

relatively shortly after deployment, Magma’s datacenter would essentially contain a populous “virtual civilization” running ahead of human civilization in its scientific and technological sophistication

It's hard to imagine that this is the way Alex would be deployed. BigTech executives are already terrified of deploying large-scale open-ended AI models with impacts on the real world, due to liability and PR risk. Faced with an AI system this powerful they would be under enormous pressure, both external and internal, to shut the system down, improve its interpretability, or deploy it incrementally while carefully monitoring for unintended consequences.

Not that this pressure will necessarily prevent takeover if Alex is sufficiently patient and deceptive, but I think it would help to flesh out why we expect humanity not to react appropriately to the terrifying trend of "understanding and control of Alex’s actions becoming looser and more tenuous after deployment". Maybe the crux is that I expect the relevant institutions to be extremely risk-averse, and happily forego "dramatically important humanitarian, economic, and military benefits" to remove the chance that they will be held accountable for the downside.

Comment by Ivan Vendrov (ivan-vendrov) on AGI ruin scenarios are likely (and disjunctive) · 2022-07-27T21:07:17.624Z · LW · GW

I expect the most critical reason has to do with takeoff speed; how long do we have between when AI is powerful enough to dramatically improve our institutional competence and when it poses an existential risk? 

If the answer is less than e.g. 3 years (hard to imagine large institutional changes happening faster than that, even with AI help), then improving humanity's competence is just not a tractable path to safety.

Comment by Ivan Vendrov (ivan-vendrov) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-27T19:41:52.126Z · LW · GW

stronger arguments that benign generalizations are especially “natural” for gradient descent, enough to make up for the fact that playing the training game would get higher reward 

Here's such an argument (probably not original). Gradient descent is a local search method over programs; it doesn't just land you at the highest-reward program, it finds a (nearly) continuous path through program space from a random program to a locally optimal program.

Let's make a further assumption, of capability continuity: any capability the model has (as measured by a benchmark or test suite) is a continuous function of the weights. This is not exactly true, but approximately true of almost every task we've found so far.

Ability to manipulate humans or play the training game is a type of capability. By assumption this ability will vary continuously. Thus, if gradient descent finds a model that plays the training game, we will see, earlier in the training process, a model that does manipulation but not very well (e.g. fooling low-quality supervisors before it fools high-quality supervisors). Call this intermediate model an "incompetent manipulator".

It seems quite possible, even with a naive safety effort, to give the "incompetent manipulator" model lower reward than a random model - e.g. by strongly penalizing evidence of manipulation via adversarial examples, honeypots, or the randomized inclusion of high-quality supervisors. (Not that it's guaranteed we will do this; it requires novel manipulation-aware safety efforts, which virtually no-one is currently doing).

But if an incompetent manipulator has lower reward than a random model then (again by capability continuity) gradient descent will not find it! Hence gradient descent will never learn to play the training game.

In contrast, it seems likely that there exist continuous paths from a random model to a straightforwardly helpful or obedient model. Of course, these paths may still be extremely hard to find because of corrigibility being "anti-natural" or some other consideration.

Comment by Ivan Vendrov (ivan-vendrov) on Alignment being impossible might be better than it being really difficult · 2022-07-26T00:35:17.238Z · LW · GW

Thought-provoking post, thanks. It crystallizes two related reasons we might expect capability ceilings or diminishing returns to intelligence:

  1. Unpredictability from chaos - as you say, "any simulation of reality with that much detail will be too costly".
  2. Difficulty of distributing goal-driven computation across physical space.

where human-AI alignment is a special case of  (2).

How much does your core argument rely on solutions to the alignment problem being all or nothing? A more realistic model in my view is that there's a spectrum of "alignment tech", where better tech lets you safely delegate to more and more powerful systems.

Comment by Ivan Vendrov (ivan-vendrov) on A note about differential technological development · 2022-07-18T00:40:47.349Z · LW · GW

Sorry, I didn't mean to imply that these are logical assumptions necessary for us to prioritize serial work; but rather insofar as these assumptions don't hold, prioritizing work that looks serial to us is less important at the margin.

Spelling out the assumptions more:

  1. Omniscient meaning "perfect advance knowledge of what work will turn out to be serial vs parallelizable." In practice I think this is very hard to know beforehand - a lot of work that turned out to be part of the "serial bottleneck" looked parallelizable ex ante.
  2. Optimal meaning "institutions will actually allocate enough researchers to the problem in time for the parallelizable work to get done". Insofar as we don't expect this to hold, we will lose even if all the serial work gets done in time.
Comment by Ivan Vendrov (ivan-vendrov) on Open & Welcome Thread - July 2022 · 2022-07-17T19:38:16.448Z · LW · GW

My preferred adblocker, uBlock Origin, lets you right-click on any element on a page and block it, with a nice UI that lets you set the specificity and scope of the block. Takes about 10 seconds, much easier than mucking with JS yourself. I've done this to hide like & follower counts on twitter, just tried and it works great for LessWrong karma. It can't do "hide karma only for your comments within last 24 hours" but thought this might be useful for others who want to hide karma more broadly.

Comment by Ivan Vendrov (ivan-vendrov) on Safety Implications of LeCun's path to machine intelligence · 2022-07-16T17:41:01.743Z · LW · GW

My read of LeCun in that conversation is that he doesn't think in terms of outer alignment / value alignment at all, but rather in terms of implementing a series of "safeguards" that allow humans to recover if the AI behaves poorly (See Steven Byrnes' summary).

I think this paper helps clarify why he believes this - he had something like this architecture in mind, and so outer alignment seemed basically impossible. Independently, he believes it's unnecessary because the obvious safeguards will prove sufficient.

Comment by Ivan Vendrov (ivan-vendrov) on Safety Implications of LeCun's path to machine intelligence · 2022-07-16T17:24:11.976Z · LW · GW

Ah you're right, the paper never directly says the architecture is trained end-to-end - updated the post, thanks for the catch.

He might still mean something closer to end-to-end learning, because

  1. The world model is differentiable w.r.t the cost (Figure 2), suggesting it isn't trained purely using self-supervised learning.
  2. The configurator needs to learn to modulate the world model, the cost, and the actor; it seems unlikely that this can be done well if these are all swappable black boxes. So there is likely some phase of co-adaptation between configurator, actor, cost, and world model.
Comment by Ivan Vendrov (ivan-vendrov) on A note about differential technological development · 2022-07-16T02:25:27.678Z · LW · GW

Also, on a re-read I notice that all the examples given in the post relate to mathematics or theoretical work, which is almost uniquely serial among human activities. By contrast, engineering disciplines are typically much more parallelizable, as evidenced by the speedup in technological progress during war-time.

Comment by Ivan Vendrov (ivan-vendrov) on A note about differential technological development · 2022-07-16T02:05:31.362Z · LW · GW

I like the distinction between parallelizable and serial research time, and agree that there should be a very high bar for shortening AI timelines and eating up precious serial time.

One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems. Insofar as this assumption doesn't hold (because our institutions fail, or because the knowledge about how to allocate researcher-hours itself depends on the outcomes of parallelizable research) the distinction between parallelizable and serial work breaks down and other considerations dominate.