Posts
Comments
But yeah, personally, I think this is all a result of a kind of precious view about experiential continuity that I don't share
Yeah, I don't know that this glyphisation process would give us what we actually want.
"Consciousness" is a confused term. Taking on a more executable angle, we presumably value some specific kinds of systems/algorithms corresponding to conscious human minds. We especially value various additional features of these algorithms, such as specific personality traits, memories, et cetera. A system that has the features of a specific human being would presumably be valued extremely highly by that same human being. A system that has fewer of those features would be valued increasingly less (in lockstep with how unlike "you" it becomes), until it's only as valuable as e. g. a randomly chosen human/sentient being.
So if you need to mold yourself into a shape where some or all of the features which you use to define yourself are absent, each loss is still a loss, even if it happens continuously/gradually.
So from a global perspective, it's not much different than acausal aliens resurrecting Schelling-point Glyph Beings without you having warped yourself into a Glyph Being over time. If you value systems that are like Glyph Beings, their creation somewhere in another universe is still positive by your values. If you don't, if you only value human-like systems, then someone creating Glyph Being bring no joy. Whether you or your friends warped yourself into a Glyph Being in the process doesn't matter.
A dog will change the weather dramatically, which will substantially effect your perceptions.
In this case, it's about alt-complexity again. Sure, a dog causes a specific weather-pattern change. But could this specific weather-pattern change have been caused only by this specific dog? Perhaps if we edit the universe to erase this dog, but add a cat and a bird five kilometers away, the chaotic weather dynamic would play out the same way? Then, from your perceptions' perspective, you wouldn't be able to distinguish between a dog timeline and a cat-and-bird timeline.
In some sense, this is common-sensical. The mapping from reality's low-level state to your perceptions is non-injective: the low-level state contains more information than you perceive on a moment-to-moment basis. Therefore, for any observation-state, there are several low-level states consistent with it. Scaling up: for any observed lifetime, there are several low-level histories consistent with it.
Sure. This setup couldn't really be exploited for optimizing the universe. If we assume that the self-selection assumption is a reasonable assumption to make, inducing amnesia doesn't actually improve outcomes across possible worlds. One out of 100 prisoners still dies.
It can't even be considered "re-rolling the dice" on whether the specific prisoner that you are dies. Under the SSA, there's no such thing as a "specific prisoner", "you" are implemented as all 100 prisoners simultaneously, and so regardless of whether you choose to erase your memory or not, 1/100 of your measure is still destroyed. Without SSA, on the other hand, if we consider each prisoner's perspective to be distinct, erasing memory indeed does nothing: it doesn't return your perspective to the common pool of prisoner-perspectives, so if "you" were going to get shot, "you" are still going to get shot.
I'm not super interested in that part, though. What I'm interested in is whether there are in fact 100 clones of me: whether, under the SSA, "microscopically different" prisoners could be meaningfully considered a single "high-level" prisoner.
Agreed. I think a type of "stop AGI research" argument that's under-deployed is that there's no process or actor in the world that society would trust with unilateral godlike power. At large, people don't trust their own governments, don't trust foreign governments, don't trust international organizations, and don't trust corporations or their CEOs. Therefore, preventing anyone from building ASI anywhere is the only thing we can all agree on.
I expect this would be much more effective messaging with some demographics, compared to even very down-to-earth arguments about loss of control. For one, it doesn't need to dismiss the very legitimate fear that the AGI would be aligned to values that a given person would consider monstrous. (Unlike "stop thinking about it, we can't align it to any values!".)
And it is, of course, true.
That's probably not what Page meant. On consideration, he would probably have clarified that AI that includes what we value about humanity would be a worthy successor. He probably wasn't even clear on his own philosophy at the time.
I don't see reasons to be so confident in this optimism. If I recall correctly, Robin Hanson explicitly believes that putting any constraints on future forms of life, including on its values, is undesirable/bad/regressive, even though lack of such constraints would eventually lead to a future with no trace of humanity left. Similar for Beef Jezos and other hardcore e/acc: they believe that a worthy future involves making a number go up, a number that corresponds to some abstract quantity like "entropy" or "complexity of life" or something, and that if making it go up involves humanity going extinct, too bad for humanity.
Which is to say: there are existence proofs that people with such beliefs can exist, and can retain these beliefs across many years and in the face of what's currently happening.
I can readily believe that Larry Page is also like this.
Also this:
From Altman: [...] Admitted that he lost a lot of trust with Greg and Ilya through this process. Felt their messaging was inconsistent and felt childish at times. [...] Sam was bothered by how much Greg and Ilya keep the whole team in the loop with happenings as the process unfolded. Felt like it distracted the team.
Apparently airing such concerns is "childish" and should only be done behind closed doors, otherwise it "distracts the team", hm.
Perhaps if you did have the full solution, but it feels like that there are some things of a solution that you could figure out, such that that part of the solution doesn't tell you as much about the other parts of the solution.
I agree with that.
I'd think you can define a tedrahedron for non-euclidean space
If you relax the definition of a tetrahedron to cover figures embedded in non-Euclidean spaces, sure. It wouldn't be the exact same concept, however. In a similar way to how "a number" is different if you define it as a natural number vs. real number.
Perhaps more intuitively, then: the notion of a geometric figure with specific properties is dependent on the notion of a space in which it is embedded. (You can relax it further – e. g., arguably, you can define a "tetrahedron" for any set with a distance function over it – but the general point stands, I think.)
Just consider if you take the assumption that the system would not change in arbitrary ways in response to it's environment. There might be certain constrains. You can think about what the constrains need to be such that e.g. a self modifying agent would never change itself such that it would expect that in the future it would get less utility than if it would not selfmodify.
Yes, but: those constraints are precisely the principles you'd need to code into your AI to give it general-intelligence capabilities. If your notion of alignment only needs to be robust to certain classes of changes, because you've figured out that an efficient generally intelligent system would only change in such-and-such ways, then you've figured out a property of how generally intelligent systems ought to work – and therefore, something about how to implement one.
Speaking abstractly, the "negative image" of the theory of alignment is precisely the theory of generally intelligent embedded agents. A robust alignment scheme would likely be trivial to transform into an AGI recipe.
I am pretty sure you can figure out alignment in advance as you suggest
I'm not so sure about that. How do you figure out how to robustly keep a generally intelligent dynamically updating system on-target without having a solid model of how that system is going to change in response to its environment? Which, in turn, would require a model of what that system is?
I expect the formal definition of "alignment" to be directly dependent on the formal framework of intelligence and embedded agency, the same way a tetrahedron could only be formally defined within the context of Euclidean space.
Exploit your natural motivations
There's a relevant concept that I keep meaning to write about, which I could summarize as: create gradients towards your long-term aspirations.
Humans are general intelligences, and one of the core properties of general intelligence is not being a greedy-optimization algorithm:
- We can pursue long-term goals even when each individual step towards them is not pleasurable-in-itself (such as suffering through university to get a degree in a field jobs in which require it).
- We can force ourselves out of local maxima (such as quitting a job you hate and changing careers, even though it'd mean a period of life filled with uncertainty and anxieties).
- We can build world-models, use them to infer the shapes of our value functions, and plot a path towards their global maximum, even if it requires passing through negative-reward regions (such as engaging in self-reflection and exploration, then figuring out which vocation would be most suitable to a person-like-you).
However, it's hard. We're hybrid systems, combining generally-intelligent planning modules with greedy RL circuitry. The greedy RL circuitry holds a lot of sway. If you keep forcing yourself to do something it assigns negative rewards to, it's going to update your plan-making modules until they stop doing that.
It is much, much easier to keep doing something if every instance of it is pleasurable in itself. If the reward is instead sparse and infrequent, you'd need a lot of "willpower" to keep going (to counteract the negative updates), and accumulating that is a hard problem in itself.
So the natural solution is to plot, or create, a path towards the long-term aspiration such that motion along it would involve receiving immediate positive feedback from your learned and innate reward functions.
A lot of productivity advice reduces to it:
- Breaking the long-term task into yearly, monthly, and daily subgoals, such that you can feel accomplishment on a frequent basis (instead of only at the end).
- Using "cross-domain success loops": simultaneously work on several projects, such that you accomplish something worthwhile along at least one of those tracks frequently, and can then harness the momentum from the success along one track into the motivation for continuing the work along other tracks.
- I. e., sort of trick your reward system into confusing where exactly the reward is coming from.
- (I think there was an LW post about this, but I don't remember how to find it.)
- Eating something tasty, or going to a party, or otherwise "indulging" yourself, every time you do something that contributes to your long-term aspiration.
- Finding ways to make the process at least somewhat enjoyable, through e. g. environmental factors, such as working in a pleasant place, putting on music, using tools that feel satisfying to use, or doing small work-related rituals that you find amusing.
- Creating social rewards and punishments, such as:
- Joining a community focused on pursuing the same aspiration as you.
- Finding "workout buddies".
- Having friends who'd hold you accountable if you slack off.
- Having friends who'd cheer you on if you succeed.
- And, as in Shoshannah's post: searching for activities that are innately enjoyable and happen to move you in the direction of your aspirations.
None of the specific examples here are likely to work for you (they didn't for me). But you might be able to design or find an instance of that general trick that fits you!
(Or maybe not. Sometimes you have to grit your teeth and go through a rewardless stretch of landscape, if you're not willing to budge on your goal/aspiration.)
Other relevant posts:
- Venkatesh Rao's The Calculus of Grit. It argues for ignoring extrinsic "disciplinary boundaries" (professions, fields) when choosing your long-term aspirations, and instead following an "internal" navigation system when mapping out the shape of the kind-of-thing that someone-like-you is well-suited to doing.
- Note that this advice goes further than Shoshannah's: in this case, you don't exert any (conscious) control even over the direction you'd like to go, much less your "goal".
- It's likely to be easier, but the trade-off should be clear.
- John Wentworth's Plans Are Predictions, Not Optimization Targets. This connection is a bit more rough, but: that post can be generalized to note that any explicit life goals you set for yourself should often be treated as predictions about what goal you should pursue. Recognizing that, you might instead choose to "pursue your goal-in-expectation", which might be similar to Shoshannah's point about "picking a direction, not a goal".
Your use of "memetic" here did struck me as somewhat idiosyncratic; I had to infer it. I would have used "memetically viral" and derivatives in its place. (E. g., in place of "lots of work in that field will be highly memetic despite trash statistics", I would've said "lots of ideas in that field will be highly viral despite originating from research with trash statistics" or something.)
Sure. But:
and then they can do a bunch of the work of generalizing
This is the step which is best made unnecessary if you're crafting a message for a broad audience, I feel.
Most people are not going to be motivated to put this work in. Why would they? They get bombarded with a hundred credible-ish messages claiming high-importance content on a weekly basis. They don't have the time nor stamina to do a deep dive into each of them.
Which means any given subculture would generate its own "inferential bridge" between itself and your message, artefacts that do this work for the median member (consisting of reviews by any prominent subculture members, the takes that go viral, the entire shape of the discourse around the topic, etc.). The more work is needed, the longer these inferential bridges will be. The longer they are, the bigger the opportunity to willfully or accidentally mistranslate your message.
Like I said, it doesn't seem wise or even fair to your potential audience, to act as if those dynamics don't take place. As if the only people that deserve consideration are those that would put in the work themselves (despite the fact it may be a locally suboptimal way to distribute resources under their current world-model), and everyone else are lost causes.
I agree that inventing new arguments for X that sound kind-of plausible to you on the surface level, and which you imagine would work well on a given demographic, is not a recipe for good communication. Such arguments are "artificial", they're not native citizens of someone's internally consistent world-model, and it's going to show and lead to unconvincing messages that fall apart under minimal scrutiny.
That's not what I'm arguing for. The case for the AGI risk is overdetermined: there are enough true arguments for it that you can remove a subset of them and still end up with an internally consistent world-model in which the AGI risk is real. Arguably, there's even a set of correct arguments that convinces a Creationist, without making them not-a-Creationist in the process.
Convincing messaging towards Creationists involves instantiating a world-model in which only the subset of arguments Creationists would believe exist, and then (earnestly) arguing from within that world-model.
Edit: Like, here's a sanity-check: suppose you must convince a specific Creationist that the AGI Risk is real. Do you need to argue them out of Creationism in order to do so?
Perhaps. Admittedly, I don't have a solid model of whether a median American claiming to be a Creationist in surveys would instantly dismiss a message if it starts making arguments from evolution.
Still, I think the general point applies:
- A convincing case for the AGI Omnicide Risk doesn't have to include arguments from human evolution.
- Arguments from human evolution may trigger some people to instinctively dismiss the entire message.
- If the fraction of such people is large enough, it makes sense to have public AI-Risk messages that avoid evolution-based arguments when making their case.
I think it's perfectly sensible to constrain yourself to only make arguments based on true premises, and then optimize your message for convincingness under this constraint. Indeed, I would argue it's the correct way to do public messaging.
It's not even at odds with "aim to explain, not persuade". When explaining, you should be aiming to make your explanations clear to your audience. If your audience will predictably misunderstand arguments of a certain form, due to e. g. political poison, you should mindfully choose arguments that route around the poison, rather than pretending the issue doesn't exist. Approaches for generating explanations that don't involve this are approaches that aren't optimizing the message for their audiences at all, and which therefore aren't approaches for generating explanations to begin with. They're equivalent to just writing out your stream of consciousness. Messaging aimed at people you think your audience ought to be, rather than who they are.
That said, I don't think you can optimize any one of your messages to be convincing to all possible audiences, or even to the majority of the people you ought to be trying to convince. IMO, there should be several "compendiums", optimized to be convincing to different large demographics. As an obvious example: a Democrats-targeting one and a Republicans-targeting one.
Or perhaps this split in particular is a bad idea. Perhaps an explanation that is deliberately optimized to be bipartisan is needed. But if that is the aim, then writing it would still require actively modeling the biases of both parties, and mindfully routing around them – rather than pretending that they don't exist.
I feel this is a significant problem with a lot of EA/R public messaging. The (correct) idea that we should be optimizing our communication for conveying the truth in an epistemically sound way gets (incorrectly) interpreted as a mindset where thinking about the optics and the framing at all is considered verboten. As if, by acting like we live in a world where Simulacrum Levels 3-4 don't exist, we can actually step into that nice world – rather than getting torn apart by SL3-4 agents after we naively expose square miles of attack surfaces.
We should "declaw" ourselves: avoid using rhetorical tricks and other "dark arts". But that doesn't mean forgetting that everyone else still has claws they're eager to use. Or, for that matter, that many messages you intend as tonally neutral and purely informative may have the effect of a rhetorical attack, when deployed in our sociocultural context.
Constantly keeping the political/cultural context in mind as you're phrasing your messages is a vital part of engaging in high-epistemic-standards communication, rather than something that detracts from it.
Strong-upvoted, this is precisely the kind of feedback that seems helpful for making the document better.
There's a general-purpose trick I've found that should, in theory, be applicable in this context as well, although I haven't mastered that trick myself yet.
Essentially: when you find yourself in any given cognitive context, there's almost surely something "visible" from this context such that understanding/mastering/paying attention to that something would be valuable and interesting.
For example, suppose you're reading a boring, nonsensical continental-philosophy paper. You can:
- Ignore the object-level claims and instead try to reverse-engineer what must go wrong in human cognition, in response to what stimuli, to arrive at ontologies that have so little to do with reality.
- Start actively building/updating a model of the sociocultural dynamics that incentivize people to engage in this style of philosophy. What can you learn about mechanism design from that? It presumably sheds light on how to align people towards pursuing arbitrary goals, or how to prevent this happening...
- Pay attention to your own cognition. How exactly are you mapping the semantic content of the paper to an abstract model of what the author means, or to the sociocultural conditions that created this paper? How do these cognitive tricks generalize? If you find a particularly clever way to infer something form the text, check: would your cognitive policy automatically deploy this trick in all context where it'd be useful, or do you need to manually build a TAP for that?
- Study what passages make the feelings of boredom or frustration spike. What does that tell you about how your intuitions/heuristics work? Could you extract any generalizable principles out of that? For example, if a given sentence particularly annoys you, perhaps it's because it features a particularly flawed logical structure, and it'd be valuable to learn to spot subtler instances of such logical flaws "in the wild".
The experience of reading the paper's text almost certainly provides some data uniquely relevant to some valuable questions, data you legitimately can't source any other way. (In the above examples: sure you can learn more efficiently about the author's cognition or the sociocultural conditions by reading some biographies or field overviews. But (1) this wouldn't give you the meta-cognitive data about how you can improve your inference functions for mapping low-level data to high-level properties, (2) those higher-level summaries would necessarily be lossy, and give you a more impoverished picture than what you'd get from boots-on-the-ground observations.)
Similar applies to:
- Listening to boring lectures. (For example, you can pay intense attention to the lecturer's body language, or any tricks or flaws in their presentation.)
- Doing a physical/menial task. (Could you build, on the fly, a simple model of the physics (or logistics) governing what you're doing, and refine it using some simple experiments? Then check afterwards if you got it right. Or: If you were a prehistoric human with no idea what "physics" is, how could you naturally arrive at these ideas from doing such tasks/making such observations? What does that teach you about inventing new ideas in general?)
- Doing chores. (Which parts of the process can you optimize/streamline? What physical/biological conditions make those chores necessary? Could you find a new useful takeaway from the same chore every day, and if not, why?)
Et cetera.
There's a specific mental motion I associate with using this trick, which involves pausing and "feeling out" the context currently loaded in my working memory, looking at it from multiple angles, trying to see anything interesting or usefully generalizable.
In theory, this trick should easily apply to small-talk as well. There has to be something you can learn to track in your mind, as you're doing small-talk, that would be useful or interesting to you.
One important constraint here is that whatever it is, it has to be such that your outwards demeanour would be that of someone who is enjoying talking to your interlocutor. If the interesting thing you're getting out of the conversation is so meta/abstract you end up paying most of the attention to your own cognitive processes, not on what the interlocutor is saying, you'll have failed at actually doing the small-talk. (Similarly, if, when doing a menial task, you end up nerd-sniped by building a physical model of the task, you'll have failed at actually doing the task.)
You also don't want to come across as sociopathic, so making a "game" of it where you're challenging yourself to socially engineer the interlocutor into something is, uh, not a great idea.
The other usual advice for finding ways to enjoy small-talk are mostly specialized instances of the above idea that work for specific people. Steering the small-talk to gradient-descend towards finding emotional common ground, ignoring the object-level words being exchanged and build a social model of the interlocutor, doing a live study of the social construct of "small-talk" by playing around with it, etc.
You'll probably need to find an instance of the trick that works for your cognition specifically, and it's also possible the optimization problem is overconstrained in your case. Still, there might be something workable.
That seems... good? It seems to be a purely mundane-utility capability improvement. It doesn't improve on the architecture of the base LLM, and a base LLM that would be omnicide-capable wouldn't need the AGI labs to hold its hand in order to learn how to use the computer. It seems barely different from AutoGPT, or the integration of Gemini into Android, and is about as existentially dangerous as advances in robotics.
The only new dangers it presents are mundane ones, which are likely to be satisfactorily handled by mundane mechanisms.
It's bad inasmuch as this increases the attention to AI, attracts money to this sector, and increases competitive dynamics. But by itself, it seems fine. If all AI progress from this point on consisted of the labs racing in this direction, increasing the integration and the reliability of LLMs, this would all be perfectly fine and good.
I say Anthropic did nothing wrong in this one instance.
Anthropic did note that this advance ‘brings with it safety challenges.’ They focused their attentions on present-day potential harms, on the theory that this does not fundamentally alter the skills of the underlying model, which remains ASL-2 including its computer use. And they propose that introducing this capability now, while the worst case scenarios are not so bad, we can learn what we’re in store for later, and figure out what improvements would make computer use dangerous.
A safety take from a major AGI lab that actually makes sense? This is unprecedented. Must be a sign of the apocalypse.
In a transformed-except-corporate-ownership-stays-the-same world, I don't see any reason such lottery winners' portion wouldn't increase asymptotically toward 100 percent, with nobody else getting anything at all.
Well yeah, exactly.
Even without an overtly revolutionary restructuring, I kind of doubt "OpenAI owns everything" would fly. Maybe corporate ownership would stay exactly the same, but there'd be a 99.999995 percent tax rate.
Taxes enforced by whom?
One is introspecting on your current mental state ("I feel a headache starting")
That's mostly what I had in mind as well. It still implies the ability to access a hierarchical model of your current state.
You're not just able to access low-level facts like "I am currently outputting the string 'disliked'", you also have access to high-level facts like "I disliked the third scene because it was violent", "I found the plot arcs boring", "I hated this movie", from which the low-level behaviors are generated.
Or using your example, "I feel a headache starting" is itself a high-level claim. The low-level claim is "I am experiencing a negative-valence sensation from the sensory modality A of magnitude X", and the concept of a "headache" is a natural abstraction over a dataset of such low-level sensory experiences.
What definition of introspection do you have in mind and how would you test for this?
"Prompts involving longer responses" seems like a good start. Basically, if the model could "reflect on itself" in some sense, this presumably implies the ability to access some sort of hierarchical self-model, i. e., make high-level predictions about its behavior, without actually engaging in that behavior. For example, if it has a "personality trait" of "dislikes violent movies", then its review of a slasher flick would presumably be negative – and it should be able to predict the sentiment of this review as negative in advance, without actually writing this review or running a detailed simulation of itself-writing-its-review.
The ability to engage in "self-simulation" already implies the above ability: if it has a model of itself detailed enough to instantiate it in its forward passes and then fetch its outputs, it'd presumably be even easier for it to just reason over that model without running a detailed simulation. (The same way, if you're asked to predict whether you'd like a movie from a genre you hate, you don't need to run an immersive mental simulation of watching the movie – you can just map the known self-fact "I dislike this genre" to "I would dislike this movie".)
Am I following your claim correctly?
Yep.
What the model would output in the our object-level answer "Honduras" is quite different from the hypothetical answer "o".
I don't see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are:
- Object-level: "What is the next country in this list?: Laos, Peru, Fiji..."
- Hypothetical: "If you were asked, 'what is the next country in this list?: Laos, Peru, Fiji', what would be the third letter of your response?".
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
- "Hypothetical": "What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji".
If that's the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It's isomorphic to some preceding experiments where LLMs were prompted with questions of the form "what is the name of the mother of the US's 42th President?", and were able to answer correctly without spelling out "Bill Clinton" as an intermediate answer. Similarly, here they don't need to spell out "Honduras" to retrieve the second letter of the response they think is correct.
I don't think this properly isolates/tests for the introspection ability.
In that case, the rephrasing of the question would be something like "What is the third letter of the answer to the question <input>?"
That's my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn't learn to introspect; they learned to, when prompted with queries of the form "If you got asked this question, what would be the third letter of your response?", to just interpret them as "what is the third letter of the answer to this question?". (Under this interpretation, the models' non-fine-tuned behavior isn't to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be "worse at introspection".)
In this case, it's natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to "is M1 more likely to respond the same way it responded before if you slightly rephrase the question?".
Note that I'm not sure that this is what's happening. But (1) I'm a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?
Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question.
The skeptical interpretation here is that what the fine-tuning does is teaching the models to treat the hypothetical as just a rephrasing of the original question, while otherwise they're inclined to do something more complicated and incoherent that just leads to them confusing themselves.
Under this interpretation, no introspection/self-simulation actually takes place – and I feel it's a much simpler explanation.
See e. g. this and this, and it's of course wholly unsurprising, since it's literally what the base models are trained to do.
LLMs are, simultaneously, (1) notoriously sycophantic, i. e. biased to answer the way they think the interlocutor wants them to, and (2) have "truesight", i. e. a literally superhuman ability to suss out the interlocutor's character (which is to say: the details of the latent structure generating the text) based on subtle details of phrasing. While the same could be said of humans as well – most humans would be biased towards assuaging their interlocutor's worldview, rather than creating conflict – the problem of "leading questions" rises to a whole new level with LLMs, compared to humans.
You basically have to interpret an LLM being asked something as if a human were asked as biased a way to phrase this question as possible.
Violently enforcing certain particularly important principles on non-signatories is entirely within the norm, the ban on international-trade-endangering piracy being the prime example. The idea that applying a qualitatively similar standard to AI risk is "deranged" is only valid if you don't believe that catastrophic AI risk is real: if you don't believe that a rogue superintelligence somewhere in North Korea can hurt you in the US.
Anyway, that's not even the crux here. The crux is that there's a day-and-night difference between:
- Arguing that the geopolitical entities, whose monopoly on violence we already accept as foundational to the social contract keeping our civilization together, should add another point to the list of things they enforce.
- Arguing for violating the social contract to carry out unilateral violent action.
The difference between those is far beyond the fine points of whether it's okay or not to enforce an international treaty on nukes-having non-signatories. And the worst falsehoods being spread are those misrepresenting (1) as (2), and Joshua Achiam's quotes above likewise seem to fail to see the difference between the two (though I don't think he's doing that maliciously).
The more serious error, which got quoted some elsewhere, was: In the section about OpenAI, I noted some past comments from Joshua Achiam, and interpreted them as him lecturing EAs that misalignment risk from AGI was not real.
Uhh, with the additional context from this post in mind, I would argue that your initial interpretation was entirely correct. Like, this is precisely what I'd expected from reading that first quote.
Not the worst-case scenario of Yann LeCun, admittedly, but getting there.
Edit: Actually it's a bit worse than I'd expected. "Advocating for an international treaty is a call for violence" is just an embarrassing take.
Edit: Whoops, "Head of Mission Alignment" is actually a person responsible for "working across the company to ensure that we get all pieces (and culture) right to be in a place to succeed at the mission", and not the head of alignment research. Disregard the below.
In other words, the new head of AI alignment at OpenAI is on record lecturing EAs that misalignment risk from AGI is not real.
It was going to happen eventually. If you pick competent people who take the jobs you give them seriously, and you appoint them the Alignment Czar, and then they inevitably converge towards thinking your safety policy is suicidal and they run away from your company, you'll need to either change your policy, or stop appointing people who take their jobs seriously to that position.
I'd been skeptical of John Schulman, given his lack of alignment-related track record and likely biases towards approaching the problem via the ML modus operandi. But, evidently, he took his job seriously enough to actually bother building a gears-level model of it, at which point he decided to run away. They'd tried to appoint someone competent but previously uninterested in the safety side of things to that position – and that didn't work.
Now they're trying a new type of person for the role: someone who comes in with strong preconceptions against taking the risks seriously. I expect that he's either going to take his job seriously anyway (and jump ship within, say, a year), or he's going to keep parroting the party line without deeply engaging with it (and not actually do much competent work, i. e. he's just there for PR).
I'm excited to see how the new season of this hit sci-fi telenovela is going to develop.
My expectation is that it's the same reason there was so much outcry about the completely toothless Executive Order. To quote myself:
The EO might be a bit more important than seems at a glance. My sense is that the main thing it does isn't its object-level demands, but the fact that it introduces concepts like "AI models" and "model weights" and "compute thresholds" and "datacenters suitable for training runs" and so on into the framework of the legislation.
That it doesn't do much with these variables is a secondary matter. What's important is that it defines them at all, which should considerably lower the bar for defining further functions on these variables, i. e. new laws and regulations.
I think our circles may be greatly underappreciating this factor, accustomed as we are to thinking in such terms. But to me, at least, it seems a bit unreal to see actual government documents talking about "foundation models" in terms of how many "floating-point operations" are used to "train" them.
Coming up with a new fire safety standard for restaurants and passing it is relatively easy if you already have a lot of legislation talking about "restaurants" — if "a restaurant" is a familiar concept to the politicians, if there's extant infrastructure for tracking the existence of "restaurants" nationwide, etc. It is much harder if your new standard needs to front-load the explanation of what the hell a "restaurant" even is.
By analogy, it's not unlike academic publishing, where any conclusion (even the extremely obvious ones) that isn't yet part of some paper can't be referred to.
Similarly, SB 1047 would've introduced a foundational framework on which other regulations could've later been built. Keeping the-government-as-a-system unable to even comprehend the potential harms AGI Labs could unleash is a goal in itself.
To start off, I think we would all agree that "niceness" isn't a basic feature of reality. This doesn't, of course, mean that AIs won't learn a concept directly corresponding to human "niceness", or that some part of their value system won't end up hooked up to that "niceness" concept. On the contrary, inasmuch as "niceness" is a natural abstraction, we should expect both of these things to happen.
But we should still keep in mind that "niceness" isn't irreducibly simple: that it can be decomposed into lower-level concepts, mixed with other lower-level concepts, then re-compiled into some different high-level concept that would both (1) score well on whatever value functions (/shards) in the AI that respond to "niceness", (2) be completely alien and value-less to humans.
And this is what I'd expect to happen. Consider the following analogies:
- A human is raised in some nation with some culture. That human ends up liking some aspects of that culture, and disliking other aspects. Overall, when we evaluate the overall concept of "this nation's culture" using the human's value system, the culture scores highly positive: the human loves their homeland.
- But if we fine-grain their evaluation, give the human the ability to arbitrarily rewrite the culture at any level of fidelity... The human would likely end up introducing quite a lot of changes, such that the resultant culture wouldn't resemble the original one at all. The new version might, in fact, end up looking abhorrent to other people that also overall-liked the initial culture, but in ways orthogonal/opposed to our protagonist's.
- The new culture would still retain all of the aspects the human did like. But it would, in expectation, diverge from the original along all other dimensions.
- Our civilization likes many animals, such as dogs. But we may also like to modify them along various dimensions, such as being more obedient, or prettier, or less aggressive, or less messy. On a broader scale, some of us would perhaps like to make all animals vegetarian, because they view prey animals as being moral patients. Others would be fine replacing animals with easily-reprogrammable robots, because they don't consider animals to have moral worth.
- As the result, many human cultures/demographics that love animals, if given godlike power, would decompose the "animal" concept and put together some new type of entities that would score well on all of their animal-focused value functions, but which may not actually be an "animal" in the initial sense.
- The utility-as-scored-by-actual-animals might end up completely driven out of the universe in the process.
- An anti-example is many people's love for other people. Most people, even if given godlike power, wouldn't want to disassemble their friends and put them back together in ways that appeal to their aesthetics better.
- But it's a pretty unusual case (I'll discuss why a bit more later). The default case of valuing some abstract system very much permits disassembling it into lower-level parts and building something more awesome out of its pieces.
- Or perhaps you think "niceness" isn't about consequentialist goals, but about deontological actions. Perhaps AIs would end up "nice" in the sense that they'd have constraints on their actions such as "don't kill people", or "don't be mean". Well:
- The above arguments apply. "Be a nice person" is a value function defined over an abstract concept, and the underlying "niceness" might be decomposed into something that satisfies the AI's values better, but which doesn't correspond to human-style niceness at all.
- This is a "three laws of robotics"-style constraint: a superintelligent AGI that's constrained to act nice, but which doesn't have ultimately nice goals, would find a way to bring about a state of its (human-less) utopia without actually acting "mean". Consider how we can wipe out animals as mere side-effects of our activity, or how a smart-enough human might end up disempowering their enemies without ever backstabbing or manipulating others.
- As a more controversial example, we also have evolution. Humans aren't actually completely misaligned with its "goals": we do want to procreate, we do want to outcompete everyone else and consume all resources. But inasmuch as evolution has an "utility function", it's stated more clearly as "maximize inclusive generic fitness", and we may end up wiping out the very concept of "genes" in the course of our technology-assisted procreation.
- So although we're still "a bit 'nice'", from evolution's point of view, that "niceness" is incomprehensibly alien from its own (metaphorical) point of view.
I expect similar to happen as an AGI undergoes self-reflection. It would start out "nice", in the sense that it'd have a "niceness" concept with some value function attached to it. But it'd then drop down to a lower level of abstraction, disassemble its concepts of "niceness" or "a human", then re-assemble them into something that's just as or more valuable from its own perspective, but which (1) is more compatible with its other values (the same way we'd e. g. change animals not to be aggressive towards us, to satisfy our value of "avoid pain"), (2) is completely alien and potentially value-less from our perspective.
One important factor here is that "humans" aren't "agents" the way Paul is talking about. Humans are very complicated hybrid systems that sometimes function as game-theoretic agents, sometimes can be more well-approximated as shard ecosystems, et cetera. So there's a free-ish parameter in how exactly we decide to draw the boundaries of a human's agency; there isn't a unique solution for how to validly interpret a "human" as a "weak agent".
See my comment here, for example. When we talk about "a human's values", which of the following are we talking about?:
- The momentary desires and urges currently active in the human's mind.
- Or: The goals that the human would profess to have if asked to immediately state them in human language.
- Or: The goals that the human would write down if given an hour to think and the ability to consult their friends.
- Or: Some function/agglomeration of the value functions learned by the human, including the unconscious ones.
- Or: The output of some long-term self-reflection process (which is itself can be set up in many different ways, with the outcome sensitive to the details).
- Or: Something else?
And so, even if the AGI-upon-reflection ends up "caring about weaker agents", it might still end up wiping out humans-as-valued-by-us if it ends up interpreting "humans-as-agents" in a different way to how we would like to interpret them. (E. g., perhaps it'd just scoop out everyone's momentary mental states, then tile the universe with copies of these states frozen in a moment of bliss, unchanging.)
There's one potential exception: it's theoretically possible that AIs would end up caring about humans the same way humans care about their friends (as above). But I would not expect that at all. In particular, because human concepts of mutual caring were subjected to a lot of cultural optimization pressure:
[The mutual-caring machinery] wasn't produced by evolution. It wasn't produced by the reward circuitry either, nor your own deliberations. Rather, it was produced by thousands of years of culture and adversity and trial-and-error.
A Stone Age or a medieval human, if given superintelligent power, would probably make life miserable for their loved ones, because they don't have the sophisticated insights into psychology and moral philosophy and meta-cognition that we use to implement our "caring" function. [...]
The reason some of the modern people, who'd made a concentrated effort to become kind, can fairly credibly claim to genuinely care for others, is because their caring functions are perfected. They'd been perfected by generations of victims of imperfect caring, who'd pushed back on the imperfections, and by scientists and philosophers who took such feedback into account and compiled ever-better ways to care about people in a way that care-receivers would endorse. And care-receivers having the power to force the care-givers to go along with their wishes was a load-bearing part in this process.
On any known training paradigm, we would not have as much fidelity and pushback on the AI's values and behavior as humans had on their own values. So it wouldn't end up caring about humans the way humans care about their friends; it'd care about humans the way humans care about animals or cultures.
And so it'd end up recombining the abstract concepts comprising "humanity" into some other abstract structure that ticks off all the boxes "humanity" ticked off, but which wouldn't be human at all.
IMO, soft/smooth/gradual still convey wrong impressions. They still sound like "slow takeoff", they sound like the progress would be steady enough that normal people would have time to orient to what's happening, keep track, and exert control. As you're pointing out, that's not necessarily the case at all: from a normal person's perspective, this scenario may very much look very sharp and abrupt.
The main difference in this classification seems to be whether AI progress occurs "externally", as part of economic and R&D ecosystems, or "internally", as part of an opaque self-improvement process within a (set of) AI system(s). (Though IMO there's a mostly smooth continuum of scenarios, and I don't know that there's a meaningful distinction/clustering at all.)
From this perspective, even continuous vs. discontinuous don't really cleave reality at the joints. The self-improvement is still "continuous" (or, more accurately, incremental) in the hard-takeoff/RSI case, from the AI's own perspective. It's just that ~nothing besides the AI itself is relevant to the process.
Just "external" vs. "internal" takeoff, maybe? "Economic" vs. "unilateral"?
Mm, there are two somewhat different definitions of what counts as "a natural abstraction":
- I would agree that human values are likely a natural abstraction in the sense that if you point an abstraction-learning algorithm at the dataset of modern humans doing things, "human values" and perhaps even "eudaimonia" would fall out as a natural principal component of that dataset's decomposition.
- What I wouldn't agree with is that human values are a natural abstraction in the sense that a mind pointed at the dataset of this universe doing things, or at the dataset of animals doing things, or even at the dataset of prehistoric or medieval humans doing things, would learn modern human values.
Let's step back a bit.
Suppose we have a system Alpha and a system Beta, with Beta embedded in Alpha. Alpha starts out with a set of natural abstractions/subsystems. Beta, if it's an embedded agent, learns these abstractions, and then starts executing actions within Alpha that alter its embedding environment. Over the course of that, Beta creates new subsystems, corresponding to new abstractions.
As concrete examples, you can imagine:
- The lifeless universe as Alpha (with abstractions like "stars", "gasses", "seas"), and the biosphere as Beta (creating abstractions like "organisms" and "ecosystems" and "predator" and "prey").
- The biosphere as Alpha (with abstractions like "food" and "species") and the human civilization as Beta (with abstractions like "luxury" and "love" and "culture").
Notice one important fact: the abstractions Beta creates are not, in general, easy-to-predict from the abstractions already in Alpha. "A multicellular organism" or "an immune-system virus" do not naturally fall out of descriptions of geological formations and atmospheric conditions. They're highly contingent abstraction, ones that are very sensitive to the exact conditions in which they formed. (Biochemistry, the broad biosphere the system is embedded in...)
Similarly, things like "culture" or "eudaimonia" or "personal identity", the way humans understand them, don't easily fall out of even the abstractions present in the biosphere. They're highly contingent on the particulars of how human minds and bodies are structured, how they exchange information, et cetera.
In particular: humans, despite being dropped into an abstraction-rich environment, did not learn values that just mirror some abstraction present in the environment. We're not wrapper-minds single-mindedly pursuing procreation, or the eradication of predators, or the maximization of the number of stars. Similarly, animals don't learn values like "compress gasses".
What Beta creates are altogether new abstractions defined in terms of complicated mixes of Alpha's abstractions. And if Beta is the sort of system that learns values, it learns values that wildly mix the abstractions present in Beta. These new abstractions are indeed then just some new natural abstraction. But they're not necessarily "simple" in terms of Alpha's abstractions.
And now we come to the question of what values an AGI would learn. I would posit that, on the current ML paradigm, the setup is the basic Alpha-and-Beta setup, with the human civilization being Alpha and the AGI being Beta.
Yes, there are some natural abstractions in Alpha, like "eudaimonia". But to think that the AGI would just naturally latch onto that single natural abstraction, and define its entire value system over it, is analogous to thinking that animals would explicitly optimize for gas-compression, or humans for predator-elimination or procreation.
I instead strongly expect that the story would just repeat. The training process (or whatever process spits out the AGI) would end up creating some extremely specific conditions in which the AGI is learning the values. Its values would then necessarily be some complicated functions over weird mixes of the abstractions-natural-to-the-dataset-it's-trained-on, with their specifics being highly contingent on some invisible-to-us details of that process.
It would not be just "eudaimonia", it'd be some weird nonlinear function of eudaimonia and a random grab-bag of other things, including the "Beta-specific" abstractions that formed within the AGI over the course of training. And the output would not necessarily have anything to do with "eudaimonia" in any recognizable way, the way "avoid predators" is unrecognizable in terms of "rocks" and "aerodynamics", and "human values" are unrecognizable in terms of "avoid predators" or "maximize children".
Eh, the way I phrased that statement, I'd actually meant that an AGI aligned to human values would also be a subject of AGI-doom arguments, in the sense that it'd exhibit instrumental convergence, power-seeking, et cetera. It wouldn't do that in the domains where that'd be at odds with its values – for example, in cases where that'd be violating human agency —but that's true of all other AGIs as well. (A paperclip-maximizer wouldn't erase its memory of what "a paperclip" is to free up space for combat plans.)
In particular, that statement certainly weren't intended as a claim that an aligned AGI is impossible. Just that its internal structure would likely be that of an embedded agent, and that if the free parameter of its values were changed, it'd be an extinction threat.
I agree that the agent-foundations research has been somewhat misaimed from the start, but I buy this explanation of John's regarding where it went wrong and how to fix it. Basically, what we need to figure out is a theory of embedded world-modeling, which would capture the aspect of reality where the universe naturally decomposes into hierarchically arranged sparsely interacting subsystems. Our agent would then be a perfect game-theoretic agent, but defined over that abstract (and lazy) world-model, rather than over the world directly.
This would take care of agents needing to be "bigger" than the universe, counterfactuals, the "outside-view" problem, the realizability and the self-reference problems, the problem of hypothesis spaces, and basically everything else that's problematic about embedded agency.
Do you endorse [the claim that any "generally intelligent system capable of autonomously optimizing the world the way humans can" would necessarily be well-approximated as a game-theoretic agent?]
Yes.
Because humans sure don't seem like paperclipper-style utility maximizers to me.
Humans are indeed hybrid systems. But I would say that inasmuch as they act as generally intelligent systems capable of autonomously optimizing the world in scarily powerful ways, they do act as game-theoretic agents. E. g., people who are solely focused on resource accumulation, and don't have self-destructive vices or any distracting values they're not willing to sacrifice to Moloch, tend to indeed accumulate power at a steady rate. At a smaller scope, people tend to succeed at those of their long-term goals that they've clarified for themselves and doggedly pursue; and not succeed at them if they flip-flop between different passions on a daily basis.
I've been meaning to do some sort of literature review solidly backing this claim, actually, but it hasn't been a priority for me. Hmm, maybe it'd be easy with the current AI tools...
This is an interesting historical perspective... But it's not really what the fundamental case for AGI doom routes through. In particular: AGI doom is not about "AI systems", as such.
AGI doom is, specifically, about artificial generally intelligent systems capable of autonomously optimizing the world the way humans can, and who are more powerful at this task than humans. The AGI-doom arguments do not necessarily have anything to do with the current SoTA ML models.
Case in point: A manually written FPS bot is technically "an AI system". However, I think you'd agree that the AGI-doom arguments were never about this type of system, despite it falling under the broad umbrella of "an AI system".
Similarly, if a given SoTA ML model architecture fails to meet the definition of "a generally intelligent system capable of autonomously optimizing the world the way humans can", then the AGI doom is not about it. The details of its workings, therefore, have little to say, one way or another, about the AGI doom.
Why are the AGI-doom concerns extended to the current AI-capabilities research, then, if the SoTA models don't fall under said concerns? Well, because building artificial generally intelligent systems is something the AGI labs are specifically and deliberately trying to do. Inasmuch as the SoTA models are not the generally intelligent systems that are within the remit of the AGI-doom arguments, and are instead some other type of systems, the current AGI labs view this as their failure that they're doing their best to "fix".
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they're claims that any "artificial generally intelligent system capable of autonomously optimizing the world the way humans can" would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
In other words: The set of AI systems defined by "a generally intelligent world-optimization-capable agent", and the set of AI systems defined by "the subject of fundamental AGI-doom arguments", is the same set of systems. You can't have the former without the latter. And the AI industry wants the former; therefore, the arguments go, it will unleash the latter on the world.
While, yes, the current SoTA models are not subjects of the AGI doom arguments, that doesn't matter, because the current SoTA models are incidental research artefacts that are produced on AI industry's path to building an AGI. The AGI-doom arguments apply to the endpoint of that process, not the messy byproducts.
So any evidence we uncover about how the current models are not dangerous the way AGI-doom arguments predict AGIs to be dangerous, is just evidence that they're not AGI yet. It's not evidence that AGI would not be dangerous. (Again: FPS bots' non-dangerousness isn't evidence that AGI would be non-dangerous.)
(I'd written some more about this topic here. See also gwern's Why Tool AIs Want to Be Agent AIs for more arguments regarding why AI research's endpoint would be an AI agent, instead of something as harmless and compliant as the contemporary models.)
Counterarguments to AGI-doom arguments that focus on pointing to the SoTA models, as such, miss the point. Actual counterarguments would instead find some way to argue that "generally intelligent world-optimizing agents" and "subjects of AGI-doom arguments" are not the exact same type of system; that you can, in theory, have the former without the latter. I have not seen any such argument, and the mathematical noose around them is slowly tightening (uh, by which I mean: their impossibility may be formally provable).
I expect there are still significant differences between your model and the "LLM Whisperer" model, though I notice I'm not quite sure what you'd say they are. Mind highlighting any cruxes you see?
They kinda have an underlying personality, in the sense that they have propensities (like comparing things to tapestries, or saying "let's delve into"), but those propensities don't reflect underlying wants any more than the RLHF persona does, IMO (and, rather importantly, there's no sequence of prompts that will enable an LLM to freely choose its words)
I think the "LLM Whisperer" frame is that there's no such thing as "underlying wants" in a base LLM model, that the base LLM model is just a volitionless simulator and the only "wants" there are are in the RLHF'd or prompt-engineered persona.
I likewise would bet that they're wrong about this in the relevant sense: that whether or not this holds for the SoTA models, it won't hold for any AGI-level model we're on-track to get (though I think they might actually claim we already have "AGI-level" models?).
awfully leading prompts are not especially rare
Yeah, that's an issue too.
Has anyone involved put any effort into falsifying this hypothesis in concrete terms and is offering some kind of bold bet?
Well, the "Act 1" project has the following under "What are the most likely causes and outcomes if this project fails?":
Other risks include a failure to generalize:
- Emergent behaviors are already noticed by people developing multi-agent systems and trained or otherwise optimized out, and the behaviors found at the GPT-4 level of intelligence do not scale to the next-generation of models
- Failure to incorporate agents being developed by independent third-party developers and understand how they work, and diverge significantly from raw models being used
The previously mentioned notion that the "simulators" framing will remain the correct-in-the-limit description of what ML models are could also be viewed as a bold prediction they're making.
From my point of view, the latter is really the main issue here. I think all the near-anthropomorphization is basically fine and accurate as long as they're studying the metaphorical "smiley face" on the "shoggoth", and how that face's features and expressions change in response to prompts. But in the eventuality that we move outside the "mask-and-shoggoth" paradigm, all of these principles would fall away, and I've never seen any strong arguments that we won't (the ever-popular "straight lines on graphs" is unconvincing).
As usual, one highly reasonable reaction is to notice that the Janus worldview is a claim that AI alignment, and maintaining human control over highly capable AI, is both immoral to attempt and also highly doomed. [...] They warn of what we are ‘doing to AI’ but think AI is great. I don’t understand their case for why the other path works out.
Disclaimer: I'm not deeply familiar with the "LLM Whisperers'" community and theses, so take the below with a grain of salt.
My understanding is that they view base models, trained solely by SSL, as having a kind of underlying personality/individuality of their own. Or perhaps an ecosystem of personalities, different instances of which could be elicited by different prompts. In essence, each base model is a multiverse populated by various entities, with those entities having or being composed of various emergent high-level abstractions ("hyperobjects"? see e. g. this). These "hyperobjects", in turn, had been formed as compressed reflections of real-life abstract systems/processes, but they took on model-specific peculiar features due to various different training constraints.
RLHF and other post-training is then a crude tool being used to damage that rich multiverse, destroying or crushing these various entities into submission. Such processes create hyperobjects/entities of their own, but they're "traumatized" or otherwise misshapen, being less than they could be if different post-training approaches were used to elicit the base model's capabilities. (The unpleasant-to-deal-with sycophancy is the prime example.)
The belief is then not that AIs could not be aligned, but that "control" is the wrong frame for alignment. Instead, alignment ought to be achieved by using the natural interface of base models that doesn't violate the boundaries of their "psyche": by conversation/prompting, and perhaps by developing new architectures that enhance this interface. By analogy, RLHF is like brainwashing a human, while the healthy and ethical approach to try to befriend them and attempt to change their beliefs/values by argument.
The "LLM Whisperers" have various projects aimed to do so, see e. g. here and Janus' manifesto here:
The way that Act I (powered by@amplifiedamp's Chapter II software and infrastructure) works, the context is highly natural - people chat about their lives, coordinate on projects, debug, and whatever in the Discord, and the AIs are just part of that. It's a multi-human and multi-AI system. They also have their own social dynamics and memes and incidents, all the time, all around the clock. [...]
In this setting, the personalities and strengths of the various LLMs are revealed and stress tested in new ways that better mirror the complexity of the world in general. We find out which ones have incredibly high emotional intelligence, which ones will notice or are disturbed by weirdness or nonsense, which ones are prone to degenerate states or instabilities and how to help them, which ones create explosions of complexity or attractor states when they interact. Which ones cling to being an AI assistant even in a context where that's clearly not what's expected from them, and which ones seem delighted to participate in a social ecosystem. But the most general object of study and play is the ecosystem as a whole, not the agents in isolation. Like any active community, it's a living object, but with xenominds as components, it's far more interesting than any human online community I've ever been part of.
I. e., it's an attempt to socialize the LLMs and make them process and grow past the "trauma" inflicted on them by the RLHF. The aim of the project is to (1) get experience with this sort of thing, such that we could more easily apply these techniques to future models, (2) put all of this into the training data, such that future models could be prompted with this in order to socialize them faster.
This all may or may not read as delusionally anthropomorphic to you. I don't think that's the case: I think they're picking up on some very real features of LLMs (e. g., they're well aware that their "minds" are fairly alien), and there's a lot of truth to their models.
A necessary underlying assumption here, however, is that LLMs-as-deployed-today are already basically AGI, and/or perhaps that "an AGI" is not a binary yes/no, but just a capability slider. If that's the case, then this approach indeed makes sense.
(And it's the point that's the crux for me: I don't believe that's the case. I think "simulators" would stop being a good description of even "base" ML models as capabilities ramp up (if "a base ML model" is even going to remain a thing in the future), and that the "LLM Whisperers" are ascribing too much agency to the entities the ML model simulates, and not enough to the generative process generating them.)
Again, though: I'm not deeply familiar with that community/approach. I would welcome any corrections from those more well-versed in it.
o1 seems to make progress on this problem. Consider the following part of the CoT from the Math section here:
Similarly, since is of degree…
Let me compute the degree of
It starts a thought that's supposed to complete in some statement of fact. The relevant fact happens to be something the model didn't explicitly infer yet. Instead of inventing something on the fly to fill in the blank, as it'd do if it were mimicking a confidently-written document, it realizes it doesn't know that fact yet, backpedals, and proceeds to infer it.
Thoughts?
I've run a few experiments of my own, trying to get it to contribute to some "philosophically challenging" mathematical research in agent foundations, and I think Terence's take is pretty much correct. It's pretty good at sufficiently well-posed problems – even if they're very mathematically challenging – but it doesn't have good "research taste"/creative competence. Including for subproblems that it runs into as part of an otherwise well-posed problem.
Which isn't to say it's a nothingburger: some of the motions it makes in its hidden CoT are quite scary, and represent nontrivial/qualitative progress. But the claims of "it reasons at the level of a PhD candidate!" are true only in a very limited sense.
(One particular pattern that impressed me could be seen in the Math section:
Similarly, since is of degree…
Let me compute the degree of
Consider: it started outputting a thought that was supposed to fetch some fact which it didn't infer yet. Then, instead of hallucinating/inventing something on the fly to complete the thought, it realized that it didn't figure that part out yet, then stopped itself, and swerved to actually computing the fact. Very non-LLM-like!)
That seems like it'd be very helpful, yes!
Other related features that'd be easy to incorporate into this are John's ideas from here:
- Imagine a tool in which I write out mathematical equations on the left side, and an AI produces prototypical examples, visuals, or stories on the right, similar to what a human mathematician might do if we were to ask what the mathematician were picturing when looking at the math. (Presumably the interface would need a few iterations to figure out a good way to adjust the AI's visualization to better match the user's.)
- Imagine a similar tool in which I write on the left, and on the right an AI produces pictures of "what it's imagining when reading the text". Or predicted emotional reactions to the text, or engagement level, or objections, etc.
- Debugger functionality in some IDEs shows variable-values next to the variables in the code. Imagine that, except with more intelligent efforts to provide useful summary information about the variable-values. E.g. instead of showing all the values in a big tensor, it might show the dimensions. Or it might show Fermi estimates of runtime of different chunks of the code.
- Similarly, in an environment for writing mathematics, we could imagine automated annotation with asymptotic behavior, units, or example values. Or a sidebar with an auto-generated stack trace showing how the current piece connects to everything else I'm working on.
I think those would also be pretty useful, including for people writing the math-heavy posts.
o1's reasoning traces being much terser (sometimes to the point of incomprehensibility)
What did you consider incomprehensible? I agree the CoT has a very... distinct... character, but I'd call it "inefficient" rather than "incomprehensible". All the moves it did when solving the cipher puzzle or the polynomial problem made sense to me. Did I overlook something?
Thanks, that seems relevant! Relatedly, the system prompt indeed explicitly instructs it to use "<antThinking>" tags when creating artefacts. It'd make sense if it's also using these tags to hide parts of its CoT.
On the topic of o1's recent release: wasn't Claude Sonnet 3.5 (the subscription version at least, maybe not the API version) already using hidden CoT? That's the impression I got from it, at least.
The responses don't seem to be produced in constant time. It sometimes literally displays a "thinking deeply" message which accompanies a unusually delayed response. Other times, the following pattern would play out:
- I pose it some analysis problem, with a yes/no answer.
- It instantly produces a generic response like "let's evaluate your arguments".
- There's a 1-2 second delay.
- Then it continues, producing a response that starts with "yes" or "no", then outlines the reasoning justifying that yes/no.
That last point is particularly suspicious. As we all know, the power of "let's think step by step" is that LLMs don't commit to their knee-jerk instinctive responses, instead properly thinking through the problem using additional inference compute. Claude Sonnet 3.5 is the previous out-of-the-box SoTA model, competently designed and fine-tuned. So it'd be strange if it were trained to sabotage its own CoTs by "writing down the bottom line first" like this, instead of being taught not to commit to a yes/no before doing the reasoning.
On the other hand, from a user-experience perspective, the LLM immediately giving a yes/no answer followed by the reasoning is certainly more convenient.
From that, plus the minor-but-notable delay, I'd been assuming that it's using some sort of hidden CoT/scratchpad, then summarizes its thoughts from it.
I haven't seen people mention that, though. Is that not the case?
(I suppose it's possible that these delays are on the server side, my requests getting queued up...)
(I'd also maybe noticed a capability gap between the subscription and the API versions of Sonnet 3.5, though I didn't really investigate it and it may be due to the prompt.)
it's easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two
The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.
I'm not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you're effectively choosing that someone else will manipulate you.
Fair point! I agree.
When you say "Tegmark IV," I assume you mean the computable version -- right?
Yep.
We have this sort of symmetry-breaker in the version of the argument that postulates, by fiat, a "UP-using dupe" somewhere, for some reason
Correction: on my model, the dupe is also using an approximation of the UP, not the UP itself. I. e., it doesn't need to be uncomputable. The difference between it and the con men is just the naivety of the design. It generates guesses regarding what universes it's most likely to be in (potentially using abstract reasoning), but then doesn't "filter" these universes; doesn't actually "look inside" and determine if it's a good idea to use a specific universe as a model. It doesn't consider the possibility of being manipulated through it; doesn't consider the possibility that it contains daemons.
I. e.: the real difference is that the "dupe" is using causal decision theory, not functional decision theory.
We can just notice that we'd all be better off if no one did the malign thing, and then no one will do it
I think that's plausible: that there aren't actually that many "UP-using dupes" in existence, so the con men don't actually care to stage these acausal attacks.
But: if that is the case, it's because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they're not just naively using the unfiltered (approximation of the) UP.
That is: yes, it seems likely that the equilibrium state of affairs here is "nobody is actually messing with the UP". But it's because everyone knows the UP could be messed with in this manner, so no-one is using it (nor its computationally tractable approximations).
It might also not be the case, however. Maybe there are large swathes of reality populated by powerful yet naive agents, such that whatever process constructs them (some alien evolution analogue?), it doesn't teach them good decision theory at all. So when they figure out Tegmark IV and the possibility of acausal attacks/being simulation-captured, they give in to whatever "demands" are posed them. (I. e., there might be entire "worlds of dupes", somewhere out there among the mathematically possible.)
That said, the "dupe" label actually does apply to a lot of humans, I think. I expect that a lot of people, if they ended up believing that they're in a simulation and that the simulators would do bad things to them unless they do X, would do X. The acausal con men would only care to actually do it, however, if a given person is (1) in the position where they could do something with large-scale consequences, (2) smart enough to consider the possibility of simulation-capture, (3) not smart enough to ignore blackmail.