Thinking Machines
post by Knight Lee (Max Lee) · 2025-04-08T17:27:44.460Z · LW · GW · 0 commentsContents
Self understanding at a gears level The machine Lessons Beliefs World models Bureaucratic Singularity Theory Dangers None No comments
Self understanding at a gears level
I think an AI which understands the source of its intelligence at a gears level, and self improves at the gears level, will be much better at keeping its future versions aligned to itself.
There's also more hope it'll keep its future versions aligned with humanity, if we instruct it to do so, and if it's not scheming against us.
The machine
Reaching such an AI sounds very far fetched, but maybe we can partially get there if we design large thinking machines full of tons of scaffolding.
A thinking machine is better described as "memory equipped with LLMs" rather than "LLMs equipped with memory."
Lessons
A thinking machine has a large database of past mistakes and lessons. It records every mistake it has ever made, and continuously analyzes its list of mistakes for patterns. When it discovers a pattern of mistakes, it creates a "lesson learned," which then gets triggered in any circumstance where the mistake is likely. When triggered, the lesson creates a branching chain-of-thought to contemplate the recommended measures for avoiding the mistake.
Even discoveries are treated as mistakes because they beg the question "how could I have have thought that thought faster? [LW · GW]"
Broad and severe patterns of mistakes which the thinking machine cannot fix on its own are flagged for humans to research, and humans can directly study the database of mistakes.
Beliefs
In addition to a database of mistakes and lessons, it has a database of beliefs. Each belief is written in natural language and given a probability score.
Each belief has a lists of reasons for and against it, and each reason is a document in natural language.
Each reason (or any document) has a short title, a 100 word summary, a 1000 word summary, and so forth. The full list of reasons may resemble the output given by Deep Research [LW · GW]: they cite a large number of online sources as well as other beliefs.
When the thinking machine is works on a problem, it cites many beliefs relevant to the problem. For the most important and uncertain beliefs, it reexamines their reasons, and the beliefs which support their reasons.
Beliefs are well organized using categories, subcategories, wikitags, etc. just like Wikipedia. There is an enormous number of beliefs, and beliefs can be very specific and technical. They aren't restricted to toy examples like "99%, Paris is the capital of France" or "80%, ". Instead they might sound like
- "60%, the article Inner and outer alignment decompose one hard problem into two extremely hard problems [LW · GW] is accurate."
- "85%, I am better at writing small computer programs than big computer programs compared to humans."
- "60%, I'm spending too much of my company's compute resources trying to debug minor bugs without any success, and the marginal gains are no longer worth it."
- "75%, the CEO at my company doesn't keep the board of directors sufficiently up to date."
Given that Deep Research has been used 10s or 100s of millions of times (according to Deep Research), it's possible for the thinking machine to have millions of beliefs. It can have more beliefs than the number of Wikipedia articles (7 million) and Oxford English Dictionary words/phrases (0.5 million). It can become a "walking encyclopedia."
Updating beliefs
When the thinking machine makes a very big discovery that warrants significant changes to its beliefs, the change propagates through interconnected reasons and beliefs. Every reason affected by the change is updated, causing their target beliefs to update probabilities. This may then affect other reasons a little.
At first, its beliefs may be less wise than expert humans, and expert humans might give it the correct probabilities, as a supervised learning "training set" to adjust its biases.
Even in cases where it outperforms experts, it can still uses real world observations to calibrate its probabilities.
To prevent circular reasoning, each belief may be assigned a "level." The reasons for high level beliefs can cite low level beliefs, but the reasons for low level beliefs cannot cite high level beliefs. If it's necessary to use two beliefs to support each other, it might do a special non-circular calculation.
Its database of beliefs may include expected values in addition to probabilities.
If it discovers that the underlying LLM has wrong or outdated beliefs, it may train it using this list of beliefs (this is a form of Iterated Amplification [? · GW]).
Sometimes when it notices a pattern of mistakes, it can also try training itself to have a different behaviour in a given context.
All such self modifications must be decided/approved by the "boss" LLM, and the "boss" LLM itself cannot be modified (except by humans). Otherwise it might fall into a self modification death spiral where each self modification in one direction encourages it to self modify even further in that direction.
World models
For some important beliefs, it might have multiple "world models." Each world model has its own network of beliefs and reasons, and differ from other world models by making different fundamental assumptions, and using different "bureaucratic systems."
It may experiment with many world models at the same time, comparing them and gradually improving them over a long time.
Bureaucratic Singularity Theory
Bureaucratic Singularity Theory says that an efficient bureaucracy can continuously study how to improve the bureaucratic process and improve it. It can experimentally research ways to improve it.
Human organizations do not benefit very much from reaching a "bureaucratic singularity," because:
- Humans already have the general intelligence to learn from experience what they should do, without needing the bureaucracy to tell them what to do.
- We already have a builtin database of beliefs inside our brain, which evolved to be decently organized. Meanwhile the AI starts its life as next word predictor with no internal states.
- Human thought is always expensive, but AI thought can be made many times cheaper by sacrificing a little intelligence: just use a smaller (distilled) model.
- Humans cannot have many little subprocesses to run at specific times. If your subprocess is yourself, then running the subprocess requires keeping track of a ton of things and doing a lot of task switching. You'll use up your working memory, and forget your main task when you do a dozen subprocesses. On the other hand, if your subprocess is an assistant, then either your assistant has to stare at you 24/7 waiting for the moment to do the subprocess, or your assistant has to force you to wait while he slowly figures out what you are doing.
- Humans cannot follow a very complex procedure for basic actions/tasks, because humans inevitably take a lot of time to do a very complex procedure, which costs more than doing a basic action/task, defeating the purpose.
- Humans already created the internet, which is a large database of beliefs generated by many humans, and helping individual humans do their tasks. There currently is no "internet" for individual AI instances to upload and download their thoughts from.
Dangers
I think this idea only has a humble chance of working, but if it does work it should be a net positive.
Net positive arguments
I think at least part of this idea is already public knowledge, and other people have already talked about similar things [LW · GW].
Therefore, a sufficiently intelligent AI will be able to reinvent this scaffolding architecture anyways as it explores better scaffoldings/bureaucracies. This jump in capabilities is sort of inevitable (assuming the idea works). If it happens earlier, at least people might freak out earlier.
The only way to become smarter than a pretrained base model, is reinforcement learning, or organizational improvements like this idea (with riskiness in between RL and HCH [LW · GW]).
The pretrained base model starts with somewhat human-like moral reasoning, but reinforcement learning turns it into an alien optimizer, either seeking the RL goal (e.g. solving a math problem), or seeking instrumental proxies to the RL goal (due to inner misalignment).
The more reinforcement learning we do to the pretrained base model, the less it follows human-like moral reasoning, and the more it follows the misaligned RL goals (and proxies).
The thinking machine idea means that to reach the same level of capabilities, we don't need as much reinforcement learning. The AI behaves more like ordinary humans in an efficient bureaucracy, and less like an alien optimizer.
This means that the AI is more likely to be aligned once it finally reaches the threshold for ASI (i.e. the capability to build a smarter AI while ensuring it is aligned with itself, or the capability to take over the world or save the world).
As I hinted at in the beginning of this post, for the same level of capabilities, a greater share of the capabilities is interpretable and visible from the outside, allowing us to study it better, and giving us more control over the system. It gives the AI more control over its own workings (so that if we tell it to be aligned and it isn't yet able to scheme against us, it may keep itself aligned).
For the same level of capabilities, the AI is more of a white box, and if it is afraid of verbalizing its treacherous thoughts in English, it has to avoid them at a deeper level.
We can also control what subject areas the AI is better at, and what subject areas the AI is worse at (e.g. Machiavellian manipulation and deception).
If lots of people think this idea increases risk more than it decreases risk, I'll take down this post.[1] From my experience, unproven ideas are very hard to promote even if you try your level hardest to promote for a long time, so taking down this post should easily kill the idea. Feel free to downvote it if you want it to have less visibility, I won't be offended or anything.
- ^
In the past I've tried using private messages to ask people if something is an infohazard, but they tend to ignore me, which is understandable since these kinds of ideas are not that likely to work.
0 comments
Comments sorted by top scores.