Dream, Truth, & Good
post by abramdemski · 2025-02-24T16:59:05.045Z · LW · GW · 2 commentsContents
A Modest Proposal Dream Machine Layer Truth Machine Layer Good Machine Layer None 2 comments
One way in which I think current AI models are sloppy [LW · GW]is that LLMs are trained in a way that messily merges the following "layers":
- The "dream machine" layer: LLMs are pre-trained on lots of slop from the internet, which creates an excellent "prior".
- The "truth machine": LLMs are trained to "reduce hallucinations" in a variety of ways, including RLHF and the more recent reasoning RL.
- The "good machine": The same RLHF and reasoning RL training also aims to train good outputs (eg helpful, honest, harmless).
I've quoted Andrej Karpathy before [LW · GW], but I'll do it again:
I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
[...]
I know I'm being super pedantic but the LLM has no "hallucination problem". Hallucination is not a bug, it is LLM's greatest feature. The LLM Assistant has a hallucination problem, and we should fix it.
- Andrej Karpathy
Failing to properly distinguish the "dream machine" capabilities from other (truth-oriented or good-oriented) capabilities hobbles today's LLMs by mixing these things together. If you ask Claude to write fiction, there's a high tendency to mix in the "Claude voice" with the fiction being generated. More generally, the base model (IE, only the generative pre-training) is great at extrapolating text; the subsequent training hobbles this capability, because care is not taken to preserve it.
Habryka mentions this with respect to experiments with LLM-augmented text editing [LW · GW]:
Using base models has at least so far been essential for getting any useful writing work out of LLMs, with the instruction-tuned models reliably producing obtuse corpo-speak when asked to engage in writing tasks.
I expect that mixing truth-orientation with good-orientation has similar problematic consequences.
A Modest Proposal
Dream Machine Layer
My basic idea here is not new: instead of pre-training on lots and lots of text from the internet in an unstructured way, I think it would be better to take a more structured approach which accounts for metadata via inferred author vectors, date vectors, etc. I don't have a specific proposed architecture, but the result should still be able to do raw (unlabeled) text prediction well, by inferring the author vectors and other latents as it goes. I have in mind a semi-supervised approach; some real metadata labels are provided when authorship is known, but labels are inferred where they are absent.[1]
This gives us much better "handles" into what the generative hallucination is doing; instead of trying to cleverly prompt it, we can set the latent metadata vectors to whatever we want. We can, for example, interpolate the vectors of several authors to see what that looks like. We can also mix-and-match vectors in a more semantic way, by looking for meaningful dimensions in the vectors (EG doing PCA across authors and trying to interpret the resulting dimensions).
This is a scientifically interesting project as well; in line with microscope AI [LW · GW], we get to learn things about the world. The inferred author vectors and date vectors give us interesting information, and the structure of the vector space also gives us interesting information. This is similar to the recently-announced project to deliberately attempt to model the entire world via AI. We can query location and date vectors which are realistic, but which never existed, to see what the AI model has inferred about that part of the world -- what could have been written at that time and location, if someone had written it down. (This is a weak ancestor simulation; we can try to construct author vectors for historical figures who didn't write anything down.)
Multimodal capabilities could of course dramatically expand this, producing artificial photos or video etc from different times and locations.
Truth Machine Layer
To build a truth-machine layer on top of this, we fine-tune the system in a truth-oriented way. Conceptually, we are looking for an author vector that knows as much as possible; if there turns out to be a "knowledgeable" dimension in the author-vector space, we'd be turning that up to its maximum (or, if there are multiple dimensions for knowledge in various fields, we're maximizing all of them). More realistically, we might need to fine-tune the whole network to support the existence of a maximally knowledgeable author-vector.
This should be done in such a way as to only increase the capabilities of the network; IE, it should still be good at "dreaming" via other author-vectors, even as it gets better at telling the truth via the truth-oriented author-vector. After all, the truth-oriented author-vector is a real author-vector in the real world: it's the author corresponding to this AI we're trying to train (or more specifically, its truth-oriented layer). So, in some sense, this stage of training is just providing evidence about one more real-world author.
This special truth-oriented author-vector should also be capable of directly reproducing the capabilities of the whole network; IE, one of many question-answer tasks it is trained on is "act like author X" for all of the author-vectors in the system. This type of training attempts to import all of the implicit world-knowledge of the rest of the system into the truth-oriented author-vector. You can think of it as a sort of introspective capability; this specific author-vector accurately reflects the whole rest of the system.
The author-vector also allows us to explore multiple different notions of truth, perhaps customized to individual users who have different beliefs about what truth-standards should apply.
My proposal for the detailed workings of the truth-oriented layer would be inspired by logical induction, but one could imagine many different forms of truth-oriented training, closer to or further from the currently-dominant paradigm.
Good Machine Layer
Finally, the Good Machine. This can be thought of as yet another author-vector, which is trained on the full "helpful, honest, harmless" type objective. We leverage the truth layer to reason about what is good. This would be the layer that most users get to talk to; it should avoid doing dangerous things like helping the user create weapons of mass destruction.
Again, this could be tuned to multiple different notions of good, representing different value-systems and belief-systems. There could be overarching principles which apply to all such author-vectors, so that users can tweak the vectors driving the system for them personally to represent their concept of good and truth, without being able to jailbreak the system. (Or, more realistically, without being able to do it very easily... this architecture alone will not completely eradicate jailbreaking.)
- ^
More specifically, there's a distinction between author vectors (which are entirely inferred) and text labels of attribution (which give author information as a string). There needs to be a learned model which transforms between the two.
2 comments
Comments sorted by top scores.
comment by cubefox · 2025-02-24T17:25:04.107Z · LW(p) · GW(p)
Isn't honesty a part of the "truth machine" rather than the "good machine"? Confabulation seems to be a case of the model generating text which it doesn't "believe", in some sense.
Replies from: abramdemski↑ comment by abramdemski · 2025-02-24T17:29:21.139Z · LW(p) · GW(p)
Yeah. I'm saying that the "good machine" should be trained on all three; it should be honest, but, constrained by helpfulness and harmlessness. (Or, more realistically, a more complicated constitution with more details.)