Posts
Comments
I think you're wrong to be psychoanalysing why people aren't paying attention to your work. You're overcomplicating it. Most people just think you're wrong upon hearing a short summary, and don't trust you enough to spend time learning the details. Whether your scenario is important or not, from your perspective it'll usually look like people are bouncing off for bad reasons.
For example, I read the executive summary. For several shallow reasons,[1] the scenario seemed unlikely and unimportant. I didn't expect there to be better arguments further on. So I stopped. Other people have different world models and will bounce off for different reasons.
Which isn't to say it's wrong (that's just my current weakly held guess). My point is just that even if you're correct, the way it looks a priori to most worldviews is sufficient to explain why people are bouncing off it and not engaging properly.
Perhaps I'll encounter information in the future that indicates my bouncing off was a mistake, and I'll go back.
- ^
There are a couple of layers of maybes, so the scenario doesn't seem likely. I expect power to be more concentrated. I expect takeoff to be faster. I expect capabilities to have a high cap. I expect alignment to be hard for any goal. Something about maintaining a similar societal structure without various chaotic game-board-flips seems unlikely. The goals-instilled-in-our-replacements are pretty specific (institution-aligned), and pretty obviously misaligned from overall human flourishing. Sure humans are usually myopic, but we do sometimes consider the consequences and act against local incentives.
I don't know whether these reasons are correct, or how well you've argued against them. They're weakly held and weakly considered, so I wouldn't have usually written them down. They are just here to make my point more concrete.
The description of how sequential choice can be defined is helpful, I was previously confused by how this was supposed to work. This matches what I meant by preferences over tuples of outcomes. Thanks!
We'd incorrectly rule out the possibility that the agent goes for (B+,B).
There's two things we might want from the idea of incomplete preferences:
- To predict the actions of agents.
- Because complete agents behave dangerously sometimes, and we want to design better agents with different behaviour.
I think modelling an agent as having incomplete preferences is great for (1). Very useful. We make better predictions if we don't rule out the possibility that the agent goes for B after choosing B+. I think we agree here.
For (2), the relevant quote is:
As a general point, you can always look at a decision ex post and back out different ways to rationalise it. The nontrivial task is here prediction, using features of the agent.
If we can always rationalise a decision ex post as being generated by a complete agent, then let's just build that complete agent. Incompleteness isn't helping us, because the behaviour could have been generated by complete preferences.
Perhaps I'm misusing the word "representable"? But what I meant was that any single sequence of actions generate by the agent could also have been generated by an outcome-utility maximizer (that has the same world model). This seems like the relevant definition, right?
That's not right
Are you saying that my description (following) is incorrect?
[incomplete preferences w/ caprice] would be equivalent to 1. choosing the best policy by ranking them in the partial order of outcomes (randomizing over multiple maxima), then 2. implementing that policy without further consideration.
Or are you saying that it is correct, but you disagree that this implies that it is "behaviorally indistinguishable from an agent with complete preferences"? If this is the case, then I think we might disagree on the definition of "behaviorally indistinguishable"? I'm using it like: If you observe a single sequence of actions from this agent (and knowing the agent's world model), can you construct a utility function over outcomes that could have produced that sequence.
Or consider another example. The agent trades A for B, then B for A, then declines to trade A for B+. That's compatible with the Caprice rule, but not with complete preferences.
This is compatible with a resolute outcome-utility maximizer (for whom A is a maxima). There's no rule that says an agent must take the shortest route to the same outcome (right?).
As Gustafsson notes, if an agent uses resolute choice to avoid the money pump for cyclic preferences, that agent has to choose against their strict preferences at some point.
...
There's no such drawback for agents with incomplete preferences using resolute choice.
Sure, but why is that a drawback? It can't be money pumped, right? Agents following resolute choice often choose against their local strict preferences in other decision problems. (E.g. Newcomb's). And this is considered an argument in favour of resolute choice.
I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.
My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).
Hmm good point. Looking at your dialogues has changed my mind, they have higher karma than the ones I was looking at.
You might also be unusual on some axis that makes arguments easier. It takes me a lot of time to go over peoples words and work out what beliefs are consistent with them. And the inverse, translating model to words, also takes a while.
Dialogues are more difficult to create (if done well between people with different beliefs), and are less pleasant to read, but are often higher value for reaching true beliefs as a group.
Dialogues seem under-incentivised relative to comments, given the amount of effort involved. Maybe they would get more karma if we could vote on individual replies, so it's more like a comment chain?
This could also help with skimming a dialogue because you can skip to the best parts, to see whether it's worth reading the whole thing.
The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theory) potentially has relevance, but it's not super clear which bits are most likely to be directly useful.
This seems like the ultimate goal of interp research (and it's a good goal). Or, I think the current story for heuristic arguments is using them to "explain" a trained neural network by breaking it down into something more like an X,Y,Z components explanation.
At this point, we can analyse the overall AI algorithm, and understand what happens when it updates its beliefs radically, or understand how its goals are stored and whether they ever change. And we can try to work out whether the particular structure will change itself in bad-to-us ways if it could self-modify. This is where it looks much more theoretical, like theoretical analysis of algorithms.
(The above is the "understood" end of the axis. The "not-understood" end looks like making an AI with pure evolution, with no understanding of how it works. There are many levels of partial understanding in between).
This kind of understanding is a prerequisite for the scheme in my post. This scheme could be implemented by modifying a well-understood AI.
Also what is its relation to natural language?
Not sure what you're getting at here.
Fair enough, good points. I guess I classify these LLM agents as "something-like-an-LLM that is genuinely creative", at least to some extent.
Although I don't think the first example is great, seems more like a capability/observation-bandwidth issue.
I'm not sure how this is different from the solution I describe in the latter half of the post.
Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won't create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it's the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).
Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can't remember whether he proposed any way to make that reflectively stable though.
From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?
LLMs in their current form don't really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization toward "normality" (and also kinda quantilize by default). So maybe yeah, I think I agree with your statement in the sense that I think you intended it, as it refers to current technology. But it's not clear to me that this remains true if we made something-like-an-LLM that is genuinely creative (in the sense of being capable of finding genuinely-out-of-the-box plans that achieve a particular outcome). It depends on how exactly it implements its regularization/redundency/quantilization and whether that implementation works for the particular OOD tasks we use it for.
Ultimately I don't think LLM-ish vs RL-ish won't be the main alignment-relevant axis. RL trained agents will also understand natural language, and contain natural-language-relevant algorithms. Better to focus on understood vs not-understood.
Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.
But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.
My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.
The Alice and Bob example isn't a good argument against the independence axiom. The combined agent can be represented using a fact-conditional utility function. Include the event "get job offer" in the outcome space, so that the combined utility function is a function of that fact.
E.g.
Bob {A: 0, B: 0.5, C: 1}
Alice {A: 0.3, B: 0, C: 0}
Should merge to become
AliceBob {Ao: 0, Bo: 0.5, Co: 1, A¬o: 0, B¬o: 0, C¬o: 0.3}, where o="get job offer".
This is a far more natural way to combine agents. We can avoid the ontologically weird mixing of probabilities and preference implied by having preference () and also . Like... what does a geometrically rational agent actually care about, and why does it's preferences change depending on its own beliefs and priors? A fact-conditional utility function is ontologically cleaner. Agents care about events in the world (potentially in different ways across branches of possibility, but it's still fundamentally caring about events).
This removes all the appeal of geometric rationality for me. The remaining intuitive appeal comes from humans having preferences that are logarithmic in most resources, which is more simply represented as one utility function rather than as a geometric average of many.
Excited to attend, the 2023 conference was great!
Can we submit talks?
Yeah I can see how Scott's quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn't necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don't know how Scott is imagining this, but it needn't be an inner homunculi that has consistent goals.
I think the thread below with Daniel and Evan and Ryan is good clarification of what people historically believed (which doesn't put 0 probability on 'inner homunculi', but also didn't consider it close to being the most likely way that scheming consequentialist agents could be created, which is what Alex is referring to[1]). E.g. Ajeya is clear at the beginning of this post that the training setup she's considering isn't the same as pretraining on a myopic prediction objective.
- ^
When he says 'I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities.'
but his takes were probably a little more predictably unwelcome in this venue
I hope he doesn't feel his takes are unwelcome here. I think they're empirically very welcome. His posts seem to have a roughly similar level of controversy and popularity as e.g. so8res. I'm pretty sad that he largely stopped engaging with lesswrong.
There's definitely value to being (rudely?) shaken out of lazy habits of thinking [...] and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.
Yeah I agree, that's why I like to read Alex's takes.
Really appreciate dialogues like this. This kind of engagement across worldviews should happen far more, and I'd love to do more of it myself.[1]
Some aspects were slightly disappointing:
- Alex keeps putting (inaccurate) words in the mouths of people he disagrees with, without citation. E.g.
- 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic).'
- That post was describing a very different kind of AI than generative language models. In particular, it is explicitly designed to minimize long run prediction error.[2] In fact, the surrounding posts in the sequence discuss myopia and suggest myopic algorithms might be more fundamental/incentivised by default.
- 'I think this is a better possible story than the "SGD selects for simplicity -> inner-goal structure" but I also want to note that the reason you give above is not the same as the historical supports offered for the homunculus.'
- "I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities. Off the top of my head: Eliezer and Nate calling this the "obvious result" of what you get from running ML processes. Ajeya writing about schemers, Evan writing about deceptive alignment (quite recently!), Habryka and Rob B saying similar-seeming things on Twitter" and "Again, I'm only critiquing the within-forward-pass version"
- I think you're saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.
- 'And so people wasted a lot of time, I claim, worrying about that whole "how can I specify 'get my mother out of the building' to the outcome pump" thing'
- People spent time thinking about how to mitigate reward hacking? Yes. But that's a very reasonable problem to work on, with strong empirical feedback loops. Can you give any examples of people wasting time trying to specify 'get my mother out of the building'? I can't remember any. How would that even work?
- "And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude). "
- Who predicted this? You're making up bad predictions. Eliezer in particular has been pretty clear that he doesn't expect evidence of this form.
- 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic).'
- Alex seemed to occasionally enjoy throwing out insults sideways toward third parties.
- E.g. "the LW community has largely written fanfiction alignment research". I think communication between the various factions would go better if statements like this were written without deliberate intention to insult. It could have just been "the LW community has been largely working from bad assumptions".
But I'm really glad this was published, I learned something about both Oliver and Alex's models, and I'd think it was very positive even if there were more insults :)
- ^
If anyone is interested?
- ^
Quote from the post: "Predict-O-Matic will be objective. It is a machine of prediction, is it not? Its every cog and wheel is set to that task. So, the answer is simple: it will make whichever answer minimizes projected predictive error. There will be no exact ties; the statistics are always messy enough to see to that. And, if there are, it will choose alphabetically."
- ^
Relevant quote from Evan in that post:
"Question: Yeah, so would you say that, GPT-3 is on the extreme end of world modeling. As far as what it's learned in this training process?
What is GPT-3 actually doing? Who knows? Could it be the case for GPT-3 that as we train larger and more powerful language models, doing pre-training will eventually result in a deceptively aligned model? I think that’s possible. For specifically GPT-3 right now, I would argue that it looks like it’s just doing world modeling. It doesn’t seem like it has the situational awareness necessary to be deceptive. And, if I had to bet, I would guess that future language model pre-training will also look like that and won’t be deceptive. But that’s just a guess, and not a super confident one.
The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But that’s not fully clear—prediction can still be quite complex.
Also, this all potentially changes if you start doing fine-tuning, like RLHF (reinforcement learning from human feedback). Then what you’re trying to get it to do might be quite complex—something like “maximize human approval.” If it has to learn a goal like that, learning the right proxies becomes a lot harder."
Tsvi has many underrated posts. This one was rated correctly.
I didn't previously have a crisp conceptual handle for the category that Tsvi calls Playful Thinking. Initially it seemed a slightly unnatural category. Now it's such a natural category that perhaps it should be called "Thinking", and other kinds should be the ones with a modifier (e.g. maybe Directed Thinking?).
Tsvi gives many theoretical justifications for engaging in Playful Thinking. I want to talk about one because it was only briefly mentioned in the post:
Your sense of fun decorrelates you from brain worms / egregores / systems of deference, avoiding the dangers of those.
For me, engaging in intellectual play is an antidote to political mindkilledness. It's not perfect. It doesn't work for very long. But it does help.
When I switch from intellectual play to a politically charged topic, there's a brief period where I'm just.. better at thinking about it. Perhaps it increases open-mindedness. But that's not it. It's more like increased ability to run down object-level thoughts without higher-level interference. A very valuable state of mind.
But this isn't why I play. I play because it's fun. And because it's natural? It's in our nature.
It's easy to throw this away under pressure, and I've sometimes done so. This post is a good reminder of why I shouldn't.
This post deserves to be remembered as a LessWrong classic.
- It directly tries to solve a difficult and important cluster of problems (whether it succeeds is yet to be seen).
- It uses a new diagrammatic method of manipulating sets of independence relations.
- It's a technical result! These feel like they're getting rarer on LessWrong and should be encouraged.
There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other.
- Ontology identification involves taking a goal defined in an old ontology[1] and accurately translating it into a new ontology.
- High-level models and low-level models need to interact in a bounded agent. I.e. learning a high-level fact should influence your knowledge about low-level facts and vice versa.
- Value identification is the problem of translating values from a human to an AI. This is much like ontology identification, with the added difficulty that we don't get as much detailed access or control over the human world model.
- Interpretability is about finding recognisable concepts and algorithms in trained neural networks.
In general, we can solve these problems using shared variables and shared sub-structures that are present in both models.
- We can stitch together very different world models along shared variables. E.g. if you have two models of molecular dynamics, one faster and simpler than the other. You want to simulate in the fast one, then switch to the slow one when particular interactions happen. To transfer the state from one to the other you identify variables present in both models (probably atom locations, velocities, some others), then just copy these values to the other model. Under-specified variables must be inferred from priors.
- If you want to transfer a new concept from WM1 to a less knowledgeable WM2, you can do so by identifying the lower-level concepts that both WMs share, then constructing an "explanation" out of those concepts. An "explanation" would look like a WM fragment purely built out of variables and structures already in WM2.
- An explanation is also a pointer. If you want to point at a very specific concept in someone else's WM, one way to do so is to explain that concept (in terms of lower level ideas that you are confident are shared).
Natural latents are a step toward solving all of these problems, via describing a subset of variables/structures that we should expect to find across all WMs (and more importantly, some of the conditions required for them to be present).
A natural latent should be extremely useful for any WM that contains variables which share redundant information. I think we can expect this to be common when highly redundant observations are compressed.
For example: If the ~same observation happens more than once, then any learner that generalizes well is going to notice this. It must somehow store the duplicated information (as well as each of the places where it is duplicated). That shared information is a natural latent. The result in this post suggests that this summary information should be isomorphic between agents, under the right conditions.[2]
As far as I know, it's an open question which properties of environments&agents imply lots of natural latents.
The current state of this work has some limitations:
- Both learners need access to the same or very similar low level observables X.
- Both learners must have learned the same beliefs, otherwise their latents may be very different (although this is kinda fixed with an additional constraint on the latent).
- John and David seem to have run into difficulties building useful applications of this theory.
- John's posts aren't clear on how to identify and separate out the X variables from a general stream of data (although this seems fine for now).
- With lots of compute, approximate models can be dropped in exchange for detailed models. One might drop the concept of "tree" in exchange for a complete categorization of types of trees.
- On the one hand, this is still tracking the same latent information. The theorems still work. But on the other hand, it isn't necessarily storing it in an easy-to-access way. This is fine for communication, but less fine for interpretability or manual joining of WMs.
- Perhaps there is some assumption we can make that guarantees all levels of abstraction will remain stored. Or perhaps we should expect interpretability of a WM to often involve some inferential work on the part of that WM.
If this line of research goes well, I hope that we will have theorems that say something like:
"For some bounded agent design, given observations of some complicated well-understood data-generating structure Z, and enough compute / attentional resources, the agent will learn a model of Z which contains parts x,y,w (each with predicable levels of approximate isomorphism to parts of the real Z). Upon observing more data, we can expect some parts (x,y) of this structure to remain unchanged."
- ^
Think of an ontology the choice of variables in a particular Bayes net, for our current purposes.
- ^
I'm leaning on the algorithmic definition of natural latents here.
I'm curious whether the recent trend toward bi-level optimization via chain-of-thought was any update for you? I would have thought this would have updated people (partially?) back toward actually-evolution-was-a-decent-analogy.
There's this paragraph, which seems right-ish to me:
In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
- Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.[3]
- Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.[4]
- Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.[5]
Extremely long chains-of-thought on hard problems is pretty much meeting these conditions, right?
I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing.
Here's two ways that a high-level model can be wrong:
- It isn't detailed enough, but once you learn the detail it adds up to basically the same picture. E.g. Newtonian physics, ideal gas laws. When you get a more detailed model, you learn more about which edge-cases will break it. But the model basically still works, and is valuable for working out the more detailed model.
- It's built out of confused concepts. E.g. free will, consciousness (probably), many ways of thinking about personal identity, four humors model. We're basically better off without this kind of model and should start from scratch.
It sounds like you're saying high-level agency-as-outcome-directed is wrong in the second way? If so, I disagree, it looks much more like the first way. I don't think I understand your beliefs well enough to argue about this, maybe there's something I should read?
I have a discomfort that I want to try to gesture at:
Are you ultimately wanting to build a piece of software that solves a problem so difficult that it needs to modify itself? My impression from the post is that you are thinking about this level of capability in a distant way, and mostly focusing on much earlier and easier regimes. I think it's probably very easy to work on legible low-level capabilities without making any progress on the regime that matters.
To me it looks important for researchers to have this ultimate goal constantly in their mind, because there are many pathways off-track. Does it look different to you?
Ultimately, this is a governance problem, not a technical problem. The choice to choose illegible capabilities is a political one.
I think this is a bad place to rely on governance, given the fuzziness of this boundary and the huge incentive toward capability over legibility. Am I right in thinking that you're making a large-ish gamble here on the way the tech tree shakes out (such that it's easy to see a legible-illegible boundary, and the legible approaches are competitive-ish) and also the way governance shakes out (such that governments decide that e.g. assigning detailed blame for failures is extremely important and worth delaying capabilities)?
I'm glad you're doing ambitious things, and I'm generally a fan of trying to understand problems from scratch in the hope that they dissolve or become easier to solve.
Literally compute and man-power. I can't afford the kind of cluster needed to even begin a pretraining research agenda, or to hire a new research team to work on this. I am less bottlenecked on the theoretical side atm, because I need to run into a lot of bottlenecks from actual grounded experiments first.
Why would this be a project that requires large scale experiments? Looks like something that a random PhD student with two GPUs could maybe make progress on. Might be a good problem to make a prize for even?
because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future
I personally doubt that this is true, which is maybe the crux here.
Would you like to do a dialogue about this? To me it seems clearly true in exactly the same way that having more time to pursue a goal makes it more likely you will achieve that goal.
It's possible another crux is related to the danger of Goodharting, which I think you are exaggerating the danger of. When an agent actually understand what it wants, and/or understands the limits of its understanding, then Goodhart is easy to mitigate, and it should try hard to achieve its goals (i.e. optimize a metric).
There are multiple ways to interpret "being an actual human". I interpret it as pointing at an ability level.
"the task GPTs are being trained on is harder" => the prediction objective doesn't top out at (i.e. the task has more difficulty in it than).
"than being an actual human" => the ability level of a human (i.e. the task of matching the human ability level at the relevant set of tasks).
Or as Eliezer said:
I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.
In different words again: the tasks GPTs are being incentivised to solve aren't all solvable at a human level of capability.
You almost had it when you said:
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'. But this comparison does not seem to be particularly informative.
It's more accurate if I edit it to:
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina [text] well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'.
You say it's not particularly informative. Eliezer responds by explaining the argument it responds to, which provides the context in which this is an informative statement about the training incentives of a GPT.
The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.
Your central point is:
Where GPT and humans differ is not some general mathematical fact about the task, but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.
You are misinterpreting the OP by thinking it's about comparing the mathematical properties of two tasks, when it was just pointing at the loss gradient of the text prediction task (at the location of a ~human capability profile). The OP works through text prediction sub-tasks where it's obvious that the gradient points toward higher-than-human inference capabilities.
You seem to focus too hard on the minima of the loss function:
notice that “what would the loss function like the system to do” in principle tells you very little about what the system will do
You're correct to point out that the minima of a loss function doesn't tell you much about the actual loss that could be achieved by a particular system. Like you say, the particular boundedness and cognitive architecture are more relevant to this question. But this is irrelevant to the argument being made, which is about whether the text prediction objective stops incentivising improvements above human capability.
The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning
I think a better lesson to learn is that communication is hard, and therefore we should try not to be too salty toward each other.
I sometimes think of alignment as having two barriers:
- Obtaining levers that can be used to design and shape an AGI in development.
- Developing theory that predicts the effect of your design choices.
My current understanding of your agenda, in my own words:
You're trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You're collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming language features, because they are small and well-tested-ish. (1 & 2)
As new tactics are developed, you're hoping that expertise and robust theories develop around building systems this way. (3)
This by itself doesn't scale to hard problems, so you're trying to develop methods for learning and tracking knowledge/facts in a way that interfaces with the rest of it in a way that remains legible. (4)
Maybe with some additional tools, we build a relatively-legible emulation of human thinking on top of this paradigm. (5)
Have I understood this correctly?
I feel like the alignment section of this is missing. Is the hope that better legibility and experience allows us to solve the alignment problems that we expect at this point?
Maybe it'd be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:
- Unexpected edge cases in value specification
- Goals stability across ontology shifts
- Reflective stability of goals
- Optimization daemons or simpler self-reinforcing biases
- Maintaining interruptibility against instrumental convergence
Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it's risky to do the hard part last.
but because the same solutions that will make AI systems beneficial will also make them safer
This is often not true, and I don't think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.
In practice, sadly, developing a true ELM is currently too expensive for us to pursue
Expensive why? Seems like the bottleneck here is theoretical understanding.
Yeah I read that prize contest post, that was much of where I got my impression of the "consensus". It didn't really describe which parts you still considered valuable. I'd be curious to know which they are? My understanding was that most of the conclusions made in that post were downstream of the Landauer limit argument.
Could you explain or directly link to something about the 4x claim? Seems wrong. Communication speed scales with distance not area.
Jacob Cannells' brain efficiency post
I thought the consensus on that post was that it was mostly bullshit?
These seem right, but more importantly I think it would eliminate investing in new scalable companies. Or dramatically reduce it in the 50% case. So there would be very few new companies created.
(As a side note: Maybe our response to this proposal was a bit cruel. It might have been better to just point toward some econ reading material).
would hopefully include many people who understand that understanding constraints is key and that past research understood some constraints.
Good point, I'm convinced by this.
build on past agent foundations research
I don't really agree with this. Why do you say this?
That's my guess at the level of engagement required to understand something. Maybe just because when I've tried to use or modify some research that I thought I understood, I always realise I didn't understand it deeply enough. I'm probably anchoring too hard on my own experience here, other people often learn faster than me.
(Also I'm confused about the discourse in this thread (which is fine), because I thought we were discussing "how / how much should grantmakers let the money flow".)
I was thinking "should grantmakers let the money flow to unknown young people who want a chance to prove themselves."
I agree this would be a great program to run, but I want to call it a different lever to the one I was referring to.
The only thing I would change is that I think new researchers need to understand the purpose and value of past agent foundations research. I spent too long searching for novel ideas while I still misunderstood the main constraints of alignment. I expect you'd get a lot of wasted effort if you asked for out-of-paradigm ideas. Instead it might be better to ask for people to understand and build on past agent foundations research, then gradually move away if they see other pathways after having understood the constraints. Now I see my work as mostly about trying to run into constraints for the purpose of better understand them.
Maybe that wouldn't help though, it's really hard to make people see the constraints.
The main thing I'm referring to are upskilling or career transition grants, especially from LTFF, in the last couple of years. I don't have stats, I'm assuming there were a lot given out because I met a lot of people who had received them. Probably there were a bunch given out by the ftx future fund also.
Also when I did MATS, many of us got grants post-MATS to continue our research. Relatively little seems to have come of these.
How are they falling short?
(I sound negative about these grants but I'm not, and I do want more stuff like that to happen. If I were grantmaking I'd probably give many more of some kinds of safety research grant. But "If a man has an idea just give him money and don't ask questions" isn't the right kind of change imo).
I think I disagree. This is a bandit problem, and grantmakers have tried pulling that lever a bunch of times. There hasn't been any field-changing research (yet). They knew it had a low chance of success so it's not a big update. But it is a small update.
Probably the optimal move isn't cutting early-career support entirely, but having a higher bar seems correct. There are other levers that are worth trying, and we don't have the resources to try every lever.
Also there are more grifters now that the word is out, so the EV is also declining that way.
(I feel bad saying this as someone who benefited a lot from early-career financial support).
My first exposure to rationalists was a Rationally Speaking episode where Julia recommended the movie Locke.
It's about a man pursuing difficult goals under emotional stress using few tools. For me it was a great way to be introduced to rationalism because it showed how a ~rational actor could look very different from a straw Vulcan.
It's also a great movie.
Nice.
Similar rule of thumb I find handy is 70 divided by growth rate to get doubling time implied by a growth rate. I find it way easier to think about doubling times than growth rates.
E.g. 3% interest rate means 70/3 ≈ 23 year doubling time.
I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.
I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression that he's a rock with "DOOM" written on it.
I think the pessimism comes from there being several currently-unsolved problems that get in the way of "deeply-understood enough". In principle it's possible to understand these problems and hand-build a safe and stable AGI, it just looks a lot easier to hand-build an AGI without understanding them all, and even easier than that to train an AGI without even thinking about them.
I call most of these "instability" problems. Where the AI might for example learn more, or think more, or self-modify, and each of these can shift the context in a way that causes an imperfectly designed AI to pursue unintended goals.
Here are some descriptions of problems in that cluster: optimization daemons, ontology shifts, translating between our ontology and the AI's internal ontology in a way that generalizes, pascal's mugging, reflectively stable preferences & decision algorithms, reflectively stable corrigibility, and correctly estimating future competence under different circumstances.
Some may be resolved by default along the way to understanding how to build AGI by hand, but it isn't clear. Some are kinda solved already in some contexts.
Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population.
IMO this is pretty obviously wrong. There are some kinds of problem solving that scales poorly with population, just as there are some computations that scale poorly with parallelisation.
E.g. project euler problems.
When I said "problems we care about", I was referring to a cluster of problems that very strongly appear to not scale well with population. Maybe this is an intuitive picture of the cluster of problems I'm referring to.
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can't expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!
I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.
I was probably influenced by your ideas! I just (re?)read your post on the topic.
Tbh I think it's unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the "useful" assumption.
I'd be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn't involve failing and reevaluating high-level strategy every now and then.
Extremely underrated post, I'm sorry I only skimmed it when it came out.
I found 3a,b,c to be strong and well written, a good representation of my view.
In contrast, 3d I found to be a weak argument that I didn't identify with. In particular, I don't think internal conflicts are a good way to explain the source of goal misgeneralization. To me it's better described as just overfitting or misgeneralization.[1] Edge cases in goals are clearly going to be explored by a stepping back process, if initial attempts fail. In particular if attempted pathways continue to fail. Whereas thinking of the AI as needing to resolve conflicting values seems to me to be anthropomorphizing in a way that doesn't seem to transfer to most mind designs.
You also used the word coherent in a way that I didn't understand.
Human intelligence seems easily useful enough to be a major research accelerator if it can be produced cheaply by AI
I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
humans provides a pretty strong intuitive counterexample
It's a good observation that humans seem better at stepping back inside of low-level tasks than at high-level life-purposes. For example, I got stuck on a default path of finishing a neuroscience degree, even though if I had reflected properly I would have realised it was useless for achieving my goals a couple of years earlier. I got got by sunk costs and normality.
However, I think this counterexample isn't as strong as you think it is. Firstly because it's incredibly common for people to break out of a default-path. And secondly because stepping back is usually proceeded by some kind of failure to achieve the goal using a particular approach. Such failures occur often at small scales. They occur infrequently in most people's high-level life plans, because such plans are fairly easy and don't often raise flags that indicate potential failure. We want difficult work out of an AI. This implies frequent total failure, and hence frequent high-level stepping back. If it's doing alignment research, this is particularly true.
- ^
Like for reasons given in section 4 of the misalignment and catastrophe doc.
Trying to write a new steelman of Matt's view. It's probably incorrect, but seems good to post as a measure of progress:
You believe in agentic capabilities generalizing, but also in additional high-level patterns that generalize and often overpower agentic behaviour. You expect training to learn all the algorithms required for intelligence, but also pick up patterns in the data like "research style", maybe "personality", maybe "things a person wouldn't do" and also build those into the various-algorithms-that-add-up-to-intelligence at a deep level. In particular, these patterns might capture something like "unwillingness to commandeer some extra compute" even though it's easy and important and hasn't been explicitly trained against. These higher level patterns influence generalization more than agentic patterns do, even though this reduces capability a bit.
One component of your model that reinforces this: Realistic intelligence algorithms rely heavily on something like caching training data and this has strong implications about how we should expect them to generalize. This gives an inductive-bias advantage to the patterns you mention, and a disadvantage to think-it-through-properly algorithms (like brute force search, or even human-like thinking).
We didn't quite get to talking about reflection, but this is probably the biggest hurdle in the way of getting such properties to stick around. I'll guess at your response: You think that an intelligence that doesn't-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.
Not much to add, I haven't spent enough time thinking about structural selection theorems.
I'm a fan of making more assumptions. I've had a number of conversations with people who seem to make the mistake of not assuming enough. Sometimes leading them to incorrectly consider various things impossible. E.g. "How could an agent store a utility function over all possible worlds?" or "Rice's theorem/halting problem/incompleteness/NP-hardness/no-free-lunch theorems means it's impossible to do xyz". The answer is always nah, it's possible, we just need to take advantage of some structure in the problem.
Finding the right assumptions is really hard though, it's easy to oversimplify the problem and end up with something useless.
Good point.
What I meant by updatelessness removes most of the justification is the reason given here at the very beginning of "Against Resolute Choice". In order to make a money pump that leads the agent in a circle, the agent has to continue accepting trades around a full preference loop. But if it has decided on the entire plan beforehand, it will just do any plan that involves <1 trip around the preference loop. (Although it's unclear how it would settle on such a plan, maybe just stopping its search after a given time). It won't (I think?) choose any plan that does multiple loops, because they are strictly worse.
After choosing this plan though, I think it is representable as VNM rational, as you say. And I'm not sure what to do with this. It does seem important.
However, I think Scott's argument here satisfies (a) (b) and (c). I think the independence axiom might be special in this respect, because the money pump for independence is exploiting an update on new information.
I think the problem might be that you've given this definition of heuristic:
A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.
Taking this definition seriously, it's easy to decompose a forward pass into such functions.
But you have a much more detailed idea of a heuristic in mind. You've pointed toward some properties this might have in your point (2), but haven't put it into specific words.
Some options: A single heuristic is causally dependent on <5 heuristics below and influences <5 heuristics above. The inputs and outputs of heuristics are strong information bottlenecks with a limit of 30 bits. The function of a heuristic can be understood without reference to >4 other heuristics in the same layer. A single heuristic is used in <5 different ways across the data distribution. A model is made up of <50 layers of heuristics. Large arrays of parallel heuristics often output information of the same type.
Some combination of these (or similar properties) would turn the heuristics intuition into a real hypothesis capable of making predictions.
If you don't go into this level of detail, it's easy to trick yourself into thinking that (2) basically kinda follows from your definition of heuristics, when it really really doesn't. And that will lead you to never discover the value of the heuristics intuition, if it is true, and never reject it if it is false.
I've only skimmed this post, but I like it because I think it puts into words a fairly common model (that I disagree with). I've heard "it's all just a stack of heuristics" as an explanation of neural networks and as a claim that all intelligence is this, from several people. (Probably I'm overinterpreting other people's words to some extent, they probably meant a weaker/nuanced version. But like you say, it can be useful to talk about the strong version).
I think you've correctly identified the flaw in this idea (it isn't predictive, it's unfalsifiable, so it isn't actually explaining anything even if it feels like it is). You don't seem to think this is a fatal flaw. Why?
You seem to answer
However, the key interpretability-related claim is that heuristics based decompositions will be human-understandable, which is a more falsifiable claim.
But I don't see why "heuristics based decompositions will be human-understandable" is an implication of the theory. As an extreme counterexample, logic gates are interpretable, but when stacked up into a computer they are ~uninterpretable. It looks to me like you've just tacked an interpretability hypothesis onto a heuristics hypothesis.
Trying to think this through, I'll write a bit of a braindump just in case that's useful:
The futachy hack can be split into two parts. The first is that is that conditioning on untaken actions makes most probabilities ill-defined. Because there are no incentives to get it right, the market can can settle to many equilibria. The second part is that there are various incentives for traders to take advantage of this for their own interests.
With your technique, I think approach would be to duplicate each trader into two traders with the same knowledge, and make their joint earnings zero sum.[1]
This removes one explicit incentive for a single trader to manipulate a value to cause a different action to happen. But only if it's doing so to make the distribution easier to predict and thereby improving their score. Potentially there are still other incentives i.e. if the trader has preferences over the world, and these aren't eliminated.
Why doesn't this happen in LI already? LI is zero sum overall, because there is a finite pool of wealth. But this is shared among traders with different knowledge. If there is a wealthiest trader that has a particular piece of knowledge, it should manipulate actions to reduce variance to get a higher score. So the problem is that it's not zero-sum with respect to each piece of knowledge.
But, the first issue is entirely unresolved. The probabilities that condition on untaken actions will be path-dependent leftovers from the convergence procedure of LI, when the market was more uncertain about which action will be taken. I'd expect these to be fairly reasonable, but they don't have to be.
This reasonableness is coming from something though, and maybe this can be formalized.
- ^
You'd have to build a lot more structure into the LI traders to guarantee they can't learn to cooperate and are myopic. But that seems doable. And its the sort of thing I'd want to do anyway.
I appreciate that you tried. If words are failing us to this extent, I'm going to give up.
How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?
(I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).