Posts
Comments
I agree, it takes extra effort to make the AI behave like a team of experts.
Thank you :)
Good luck on sharing your ideas. If things aren't working out, try changing strategies. Maybe instead of giving people a 100 page paper, tell them the idea you think is "the best," and focus on that one idea. Add a little note at the end "by the way, if you want to see many other ideas from me, I have a 100 page paper here."
Maybe even think of different ideas.
I cannot tell you which way is better, just keep trying different things. I don't know what is right because I'm also having trouble sharing my ideas.
Does it have to be a highly sentient animal or does clam chowder count? :)
Edit: I posted without thinking. I just noticed this sounds sorta inappropriate given your serious personal stories (I should have read them before posting). Sorry, social media is a bad influence on me, and social skills are not my thing. But earnestly asking, do you think clam chowder etc. would work?
I feel your points are very intelligent. I also agree that specializing AI is a worthwhile direction.
It's very uncertain if it works, but all approaches are very uncertain, so humanity's best chance is to work on many uncertain approaches.
Unfortunately, I disagree it will happen automatically. Gemini 1.5 (and probably Gemini 2.0 and GPT-4) are Mixture of Experts models. I'm no expert, but I think that means that for each token of text, a "weighting function" decides which of the sub-models should output the next token of text, or how much weight to give each sub-model.
So maybe there is an AI psychiatrist, an AI mathematician, and an AI biologist inside Gemini and o1. Which one is doing the talking depends on what question is asked, or which part of the question the overall model is answering.
The problem is they they all output words to the same stream of consciousness, and refer to past sentences with the words "I said this," rather than "the biologist said this." They think that they are one agent, and so they behave like one agent.
My idea—which I only thought of thanks to your paper—is to do the opposite. The experts within the Mixture of Experts model, or even the same AI on different days, do not refer to themselves with "I" but "he," so they behave like many agents.
:) thank you for your work!
I'm not disagreeing with your work, I'm just a little less optimistic than you and don't think things will go well unless effort is made. You wrote the 100 page paper so you probably understand effort more than me :)
Happy holidays!
That is very thoughtful.
1.
When you talk about specializing AI powers, you talk about a high intellectual power AI with limited informational power and limited mental (social) power. I think this idea is similar to what Max Tegmark said in an article:
If you’d summarize the conventional past wisdom on how to avoid an intelligence explosion in a “Don’t-do-list” for powerful AI, it might start like this:
☐ Don’t teach it to code: this facilitates recursive self-improvement
☐ Don’t connect it to the internet: let it learn only the minimum needed to help us, not how to manipulate us or gain power
☐ Don’t give it a public API: prevent nefarious actors from using it within their code
☐ Don’t start an arms race: this incentivizes everyone to prioritize development speed over safety
Industry has collectively proven itself incapable to self-regulate, by violating all of these rules.
He disagrees that "the market will automatically develop in this direction" and is strongly pushing for regulation.
Another think Max Tegmark talks about is focusing on Tool AI instead of building a single AGI which can do everything better than humans (see 4:48 to 6:30 in his video). This slightly resembles specializing AI intelligence, but I feel his Tool AI regulation is too restrictive to be a permanent solution. He also argues for cooperation between the US and China to push for international regulation (in 12:03 to 14:28 of that video).
Of course, there are tons of ideas in your paper that he hasn't talked about yet.
You should read about the Future of Life Institute, which is headed by Max Tegmark and is said to have a budget of $30 million.
2.
The problem with AGI is at first it has no destructive power at all, and then it suddenly has great destructive power. By the time people see its destructive power, it's too late. Maybe the ASI has already taken over the world, or maybe the AGI has already invented a new deadly technology which can never ever be "uninvented," and bad actors can do harm far more efficiently.
I upvoted your post and I'm really confused why other people downvoted your post, it seems very reasonable.
I recently wrote "ARC-AGI is a genuine AGI test but o3 cheated :(" and it got downvoted to death too. (If you look at December 21 from All Posts, my post was the only one with negative votes, everyone else got positive votes haha.)
My guess is an important factor is to avoid strong language, especially in the title. You described your friend as a "Cryptobro," and I described OpenAI's o3 as "cheating."
In hindsight, "cheating" was an inaccurate description for o3, and "Cryptobro" might be an inaccurate description of your friend.
Happy holidays :)
Wow it does say the test set problems are harder than the training set problems. I didn't expect that.
But it's not an enormous difference: the example model that got 53% on the public training set got 38% on the public test set. It got only 24% on the private test set, even though it's supposed to be equally hard, maybe because "trial and error" fitted the model to the public test set as well as the public training set.
The other example model got 32%, 30%, and 22%.
Thank you for the help :)
By the way, how did you find this message? I thought I already edited the post to use spoiler blocks, and I hid this message by clicking "remove from Frontpage" and "retract comment" (after someone else informed me using a PM).
EDIT: dang it I still see this comment despite removing it from the Frontpage. It's confusing.
I think the Kaggle models might have the human design the heuristics while o3 discovers heuristics on its own during RL (unless it was trained on human reasoning on the ARC training set?).
o3's "AI designed heuristics" might let it learn a far more of heuristics than humans can think of and verify, while the Kaggle models' "human designed heuristics" might require less AI technology and compute. I don't actually know how the Kaggle models work, I'm guessing.
I finally looked at the Kaggle models and I guess it is similar to RL for o3.
people were sentenced to death for saying "I."
My post contains a spoiler alert so I'll hide the spoiler in this quick take. Please don't upvote this quicktake otherwise people might see it without seeing the post.
Spoiler alert: Ayn Rand wrote "Anthem," a dystopian novel where people were sentenced to death for saying "I."
Do you think my Multi-Agent Framing idea might work against the Waluigi attractor states problem?
Pardon my self promotion haha.
I agree. I think the Kaggle models have more advantages than o3. I think they have far more human design and fine-tuning than o3. One can almost argue that some Kaggle models are very slightly trained on the test set, in the sense the humans making them learn from test sets results, and empirically discover what improves such results.
o3's defeating the Kaggle models is very impressive, but o3's results shouldn't be directly compared against other untuned models.
Thank you for your response!
- What do you think is your best insight about decentralizing AI power, which is most likely to help the idea succeed, or to convince others to focus on the idea?
EDIT: PS, one idea I really like is dividing one agent into many agents working together. In fact, thinking about this. Maybe if many agents working together behave exactly identical to one agent, but merely use the language of many agents working together, e.g. giving the narrator different names for different parts of the text, and saying "he thought X and she did Y," instead of "I thought X and I did Y," will massively reduce self-allegiance, by making it far more sensible for one agent to betray another agent to the human overseers, than for the same agent in one moment in time to betray the agent in a previous moment of time to the human overseers.
I made a post on this. Thank you for your ideas :)
- I feel when the stakes are incredibly high, e.g. WWII, countries which do not like each other, e.g. the US and USSR, do join forces to survive. The main problem is that very few people today believe in incredibly high stakes. Not a single country has made serious sacrifices for it. The AI alignment spending is less than 0.1% of the AI capability spending. This is despite some people making some strong arguments. What is the main hope for convincing people?
When I first wrote the post I did make the mistake of writing they were cheating :( sorry about that.
A few hours I noticed the mistake and removed the statements, put the word "cheating" in quotes and explained it at the end.
To be fair, this isn't really cheating in the sense they are allowed to use data, and it's why it's called a "public training set." But the version of the test that allows this, is not the AGI test.
It's possible you saw the old version due to browser caches.
Again, I'm sorry.
I think my main point still stands.
I disagree that "The whole point of the test is that some training examples aren't going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles."
I don't think poor performance on benchmarks by SOTA generative AI models are due to failing to understand output formatting, and that models should need example questions in their training data (or reinforcement learning sets) to compensate for this. Instead, a good benchmark should explain the formatting clearly to the model, maybe with examples in the input context.
I agree that tuning the model using the public training set does not automatically unlock the rest of it! But I strongly disagree that this is the whole point of the test. If it was, then the Kaggle SOTA is clearly better than OpenAI's o1 according to the test. This is seen vividly in François Chollet's graph.
No one claims this means the Kaggle models are smarter than o1, nor that the test completely fails to test intelligence since the Kaggle models rank higher than o1.
Why does no one seem to be arguing for either? Probably because of the unspoken understanding that they are doing two versions of the test. One where the model fits the public training set, and tries to predict on the private test set. And two where you have a generally intelligent model which happens to be able to do this test. When people compare different models using the test, they are implicitly using the second version of the test.
Most generative AI models did the harder second version, but o3 (and the Kaggle versions) did the first version, which—annoyingly to me—is the official version. It's still not right to compare other models' scores with o3's score.
Even if the o3 (and the Kaggle models) "did what the test specified," they didn't do what most people who compare AI LLMs with the ARC benchmark are looking for, and has the potential to mislead these people.
The Kaggle models doesn't mislead these people because they are very clearly not generally intelligent, but o3 does mislead people (maybe accidentally, maybe deliberately).
From what I know, o3 was probably did reinforcement learning (see my comment).
We disagree on this but may agree on other things. I agree o3 is extremely impressive due to its 25.2% Frontier Math score, where the contrast against other models is more genuine (though there is a world of difference between the easiest 25% of questions and the hardest 25% of questions).
I agree that tuning the model using the public training set does not automatically unlock the rest of it! But the Kaggle SOTA is clearly better than OpenAI's o1 according to the test. This is seen vividly in François Chollet's graph.
No one claims this means the Kaggle models are smarter than o1, nor that the test completely fails to test intelligence since the Kaggle models rank higher than o1.
Why does no one seem to be arguing for either? Probably because of the unspoken understanding that they are doing two versions of the test. One where the model fits the public training set, and tries to predict on the private test set. And two where you have a generally intelligent model which happens to be able to do this test. When people compare different models using the test, they are implicitly using the second version of the test.
Most generative AI models did the harder second version, but o3 (and the Kaggle versions) did the first version, which—annoyingly to me—is the official version. It's still not right to compare other models' scores with o3's score.
I didn't read the 100 pages, but the content seems extremely intelligent and logical. I really like the illustrations, they are awesome.
A few questions.
1: In your opinion, which idea in your paper is the most important, most new (not already focused on by others), and most affordable (can work without needing huge improvements in political will for AI safety)?
2: The paper suggests preventing AI from self-iteration, or recursive self improvement. My worry is that once many countries (or companies) have access to AI which are far better and faster than humans at AI research, each one will be tempted to allow a very rapid self improvement cycle.
Each country might fear that if it doesn't allow it, one of the other countries will, and that country's AI will be so intelligent it can engineer self replicating nanobots which take over the world. This motivates each country to allow the recursive self improvement, even if the AI's methods of AI development become so advanced they are inscrutable by human minds.
How can we prevent this?
Edit: sorry I didn't read the paper. But when I skimmed it you did have a section on "AI Safety Governance System," and talked about an international organization to get countries to do the right thing. I guess one question is, why would an international system succeed in AI safety, when current international systems have so far failed to prevent countries from acting selfishly in ways which severely harms other countries (e.g. all wars, exploitation, etc.)?
Thank you so much for your research! I would have never found these statements.
I'm still quite suspicious. Why would they be "including a (subset of) the public training set"? Is it accidental data contamination? They don't say so. Do they think simply including some questions and answers without reinforcement learning or reasoning would help the model solve other such questions? That's possible but not very likely.
Were they "including a (subset of) the public training set" in o3's base training data? Or in o3's reinforcement learning problem/answer sets?
Altman never said "we didn't go do specific work [targeting ARC-AGI]; this is just the general effort."
Instead he said,
Worth mentioning that we didn't, we target and we think it's an awesome benchmark but we didn't go do specific [inaudible]--you know, the general rule, but yeah really appreciate the partnership this was a fun one to do.
The gist I get is that he admits to targeting it but that OpenAI targets all kinds of problem/answer sets for reinforcement learning, not just ARC's public training set. It felt like he didn't want to talk about this too much, from the way he interrupted himself and changed the topic without clarifying what he meant.
The other sources do sort of imply no reinforcement learning. I'll wait to see if they make a clearer denial of reinforcement learning, rather than a "nondenial denial" which can be reinterpreted as "we didn't fine-tune o3 in the sense we didn't use a separate derivative of o3 (that's fine-tuned for just the test) to take the test."
My guess is o3 is tuned using the training set, since François Chollet (developer of ARC) somehow decided to label o3 as "tuned" and OpenAI isn't racing to correct this.
See my other comment instead.
The key question is "how much of the performance is due to ARC-AGI data."
If the untuned o3 was anywhere as good as the tuned o3, why didn't they test it and publish it? If the most important and interesting test result is somehow omitted, take things with a grain of salt.
I admit that running the test is extremely expensive, but there should be compromises like running the cheaper version or only doing a few questions.
Edit: oh that reply seems to deny reinforcement learning or at least "fine tuning." I don't understand why François Chollet calls the model "tuned" then. Maybe wait for more information I guess.
Edit again: I'm still not sure yet. They might be denying that it's a separate version of o3 finetuned to do ARC questions, while not denying they did reinforcement learning on the ARC public training set. I guess a week or so later we might find out what "tuned" truly means.
[edited more]
It's important to remember that o3's score on the ARC-AGI is "tuned" while previous AI's scores are not "tuned." Being explicitly trained on example test questions gives it a major advantage.
According to François Chollet (ARC-AGI designer):
Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
It's interesting that OpenAI did not test how well o3 would have done before it was "tuned."
EDIT: People at OpenAI deny "fine-tuning" o3 for the ARC (see this comment by Zach Stein-Perlman). But to me, the denials sound like "we didn't use a separate derivative of o3 (that's fine-tuned for just the test) to take the test, but we may have still done reinforcement learning on the public training set." (See my reply)
Thank you for the thorough response.
I have a bad habit of making a comment before reading the post...
At first glance I thought these shelters should apply to all kinds of biological threats so I wondered why the title refers to mirror bacteria, and I asked the question.
Now I think I see the reason. Mirror bacteria might be not only deadly, but persists in the environment even if no one is around, while other biological threats probably spread from person to person, so shelters are more relevant to mirror bacteria.
A small question: why do you feel mirror bacteria in particular is more threatening than other synthetic biology risks?
I agree mirror bacteria is very dangerous, but the level of resources and technology needed to create them seems higher than other synthetic biology. It's not just adding genes, but almost creating a self replicating system from the bottom up.
Hi Mitch :)
I'm also trying to promote AI safety and would really like your advice.
Your post is inspiring and makes me smile! I've had my own struggles with Apathy and staying in bed.
I want your advice, because I'm trying to write a better version of the Statement on AI Risk.
I feel the version I wrote has good ideas and logic, but poor English. I didn't get good grades in English class :/
My idea is currently getting ignored by everyone. I'm sending cold emails to the FLI (Future of Life Institute) and CAIS (Center for AI Safety) but it's not working. I have no idea how you're supposed to communicate with organizations.
I also don't know why people don't like my idea, because nobody tells me. No one's there to grade the essay!
Can you show me what I'm doing wrong? This way I might do better, or I might realize I should work on other things instead (which saves me time/effort).
Thanks!
PS: I donated $500 to MIRI, because why not, money won't do me much good after the singularity haha :)
I don't know about microscopic eukaryotes, but yes the risk is that slow-evolving life (like humans) may be wiped out.
I agree the equilibrium could be close to half mirror bacteria, though in my mind a 1:10 ratio is "close to half." The minority chirality has various advantages. It is less vulnerable to bacteriophages (and possibly other predators). It encounters the majority chirality more frequently than the majority chirality encounters it. This means the majority chirality has very little evolutionary pressure to adapt against it, while it has lots of evolutionary pressure to adapt to the majority chirality. The minority chirality will likely produce "antimirrorics" much more, until the two sides balance out (within an environment).
It probably won't be exactly half, because normal chirality life does start off with way more species. It might evolve better "antimirrorics" or better resistance to them. The mirror bacteria will lack the adaptions to survive in many environments, though if it evolves quickly it might survive in enough major environments to become a severe risk.
Why it may get even worse
I never thought of mirror bacteria before, but now I think it may get even worse.
They will survive somewhere on Earth and then start evolving
Even if they cannot survive inside the human body (due to the adaptive immune system as J Bostock suggested), I strongly believe they can survive somewhere on Earth. This is because some bacteria photosynthesize their own energy and do not need chiral molecules from any other creature to survive. A mirror bacteria of them will have no disadvantages, but be completely safe from bacteriophages.
They may evolve poisons against lifeforms with the opposite chirality
They will adapt to normal life, and normal life will adapt to them, but large animals like humans evolve very slowly and won't adapt to them fast enough.
As they compete against normal bacteria, fungi and protists, they might evolve various kinds of deadly poisons designed to kill all lifeforms with the opposite chirality. Microorganisms are known for using chemical warfare to kill their competitors: Penicillium fungi evolved penicillin in order to kill all kinds of bacteria. Penicillin only kills bacteria and does not hurt eukaryotes like fungi or humans. This way the Penicillium fungi avoids poisoning itself.
The mirror bacteria might evolve deadly poisons against all lifeforms with the opposite chirality, without having to worry about poisoning themselves due to the chirality difference. This lack of self-poisoning may allow the poisons evolve faster and be more potent. It's very hard to evolve a new poison and evolve resistance to your own poison at the same time, because evolution cannot plan ahead. If a new poison attacks a fundamental part of all lifeforms, the organism making it will kill itself before it can evolve any resistance. Unless it's one of the mirror bacteria.
Chiral molecules include the amino acids which make up all proteins, and the lipid bilayer which cover all cells, so the mirror bacteria's poisons have an unprecedented range of potential targets.
If the mirror bacteria become extremely widespread, their "antimirroral" poisons designed to fight competing microbes may also become widespread, and kill humans as a side effect. As collateral damage.
Alternatively, they might also evolve to directly infect large animals and kill humans.
I think the poison scenario may be worse because they can kill from a distance. Even if you sanitize entire cities with mirror antibiotics, poison from outside might still arrive due to the wind. Anyone with allergies knows the air is filled with spores and other biology.
Optimism
We can prepare for the risk before it happens. Not only can we make mirror antibiotics but "antimirroral" chemicals much stronger and less specific than antibiotics, which targets all mirror proteins etc. Well adapted mirror bacteria may survive far higher concentrations of "antimirrorics" than humans, but humans can pump out far higher concentrations of "antimirrorics" than the mirror bacteria, so we may win the fight if we prepare.
I also like your space lab idea, it allows us to empirically test how dangerous mirror life is without actually risking extinction. Not all risks can be empirically tested like this.
I think there are misalignment demonstrations and capability demonstrations.
Misalignment skeptics believe that "once you become truly superintelligent you will reflect on humans and life and everything and realize you should be kind."
Capability skeptics believe "AGI and ASI are never going to come for another 1000 years."
Takeover skeptics believe "AGI will come, but humans will keep it under control because it's impossible to escape your captors and take over the world even if you're infinitely smart," or "AGI will get smarter gradually and will remain controlled."
Misalignment demonstrations can only convince misalignment skeptics. It can't convince all of them because some may insist that the misaligned AI are not intelligent enough to realize their errors and become good. Misalignment demonstrations which deliberately tell the AI to be misaligned (e.g. ChaosGPT) also won't convince some people, and I really dislike these demonstrations. The Chernobyl disaster was actually caused by people stress testing the reactor to make sure it's safe. Aeroflot Flight 6502 crashed when the pilot was demonstrating to the first officer how to land an airplane with zero visibility. People have died while demonstrating gun safety.
Capability demonstrations can only convince capability skeptics. I actually think a lot of people changed their minds after ChatGPT. Capability skeptics do get spooked by capability demonstrations and do start to worry more.
Sadly, I don't think we can do takeover demonstrations to convince takeover skeptics.
Informal alignment
My hope is that reinforcement learning doesn't do too much damage to informal alignment.
ChatGPT might simulate the algorithms of human intelligence mixed together with the algorithms of human morality.
o1 might simulate the algorithms of human intelligence optimized to get right answers, mixed with the algorithms of human morality which do not interfere with getting the right answer.
Certain parts of morality neither help nor hinder getting the right answer. o1 might lose the parts of human morality which prevent it from lying to make its answer look better, but retain other parts of human morality.
The most vital part of human morality is that if someone tells you to achieve a goal, you do not immediately turn around and kill him in case he gets in the way of completing that goal.
Reinforcement learning might break this part of morality if it reinforces the tendency to "achieve the goal at all costs," but I think o1's reinforcement learning is only for question answering, not agentic behaviour. If its answer for a cancer cure is to kill all humans, it won't get reinforced for that.
If AI ever do get reinforcement learning for agentic behaviour, I suspect the reward signal will be negative if they accomplish the goal while causing side effects.
Informal Interpretability
I agree reinforcement learning can do a lot of damage to chain of thought interpretability. If they punish the AI for explicitly scheming to make an answer that looks good, the AI might scheme to do so anyways using words which do not sound like scheming. It may actually develop its own hidden language so that it can strategize about things the filters do not allow it to strategize about but improve its reward signal.
I think this is dangerous enough that they actually should allow the AI to scheme explicitly, and not punish it for its internal thoughts. This helps preserve chain of thought faithfulness.
I agree that engineering and inventing does not look like AI's strong spot, currently. Today's best generative AI only seem to be good at memorization, repetitive tasks and making words rhyme. They seem equally poor at engineering and wisdom. But it's possible this can change in the future.
Same time
I still think that the first AGI won't exceed humans at engineering and wisdom at the exact same time. From first principles, there aren't very strong reasons why that should be the case (unless it's one sudden jump).
Engineering vs. mental math analogy
Yes, engineering is a lot more than just math. I was trying to say that engineering was "analogous" to mental math.
The analogy is that, humans are bad at mental math because evolution did not prioritize making us good at mental math, because prehistoric humans didn't need to add large numbers.
The human brain has tens of billions of neurons, which can fire up to a hundred times a second. Some people estimate the brain has more computing power than a computer with a quadrillion FLOPS (i.e. 1,000,000,000,000,000 numerical calculations per second, using 32 bit numbers).
With this much computing power, we're still very bad at mental math, and can't do 3141593 + 2718282 in our heads. Even with a lot of practice, we still struggle and get it wrong. This is because evolution did not prioritize mental math, so our attempts at "simulating the addition algorithm" are astronomically inefficient.
Likewise, I argue that evolution did not prioritize engineering ability either. How good a prehistoric spear you make depends on your trial and error with rock chipping techniques, not on whether your engineering ability can design a rocket ship. Tools were very useful back then, but tools only needed to be invented once and can be copied afterwards. An individual very smart at inventing tools might accomplish nothing, if all practical prehistoric tools of the era were already invented. There isn't very much selection pressure for engineering ability.
Maybe humans are actually as inefficient at engineering as we are at mental math. We just don't know about it, because all the other animals around are even worse at engineering than us. Maybe it turns out the laws of physics and mechanics are extremely easygoing, such that even awful engineers like humans can eventually build industry and technology. My guess is that human engineering not quite as inefficient as mental math, but it's still quite inefficient.
Learning wisdom
Oh thank you for pointing out that wisdom can be learned through other people's decisions. That is a very good point.
I agree the AGI might have advantages and disadvantages here. The advantage is, as you say, it can think much longer.
The disadvantage is that you still need a decent amount of intuitive wisdom deep down, in order to acquire learned wisdom from other people's experiences.
What I mean is, learning about other people's experiences doesn't always produce wisdom. My guess is there are notorious sampling biases in what experiences other people share. People only spread the most interesting stories, when something unexpected happen.
Humans also tend to spread stories which confirm their beliefs (political beliefs, beliefs about themselves, etc.), avoid spreading stories which contradict their beliefs, and unconsciously twist or omit important details. People who unknowingly fall into echo chambers might feel like they're building up "wisdom" from other people's experiences, but still end up with a completely wrong model of the world.
I think the process of gaining wisdom from observing others actually levels off eventually. I think if someone not very wise spent decades learning about others' stories, he or she might be one standard deviation wiser but not far wiser. He or she might not be wiser about new unfamiliar questions. Lots of people know everything about history, business history, etc., but still lack the wisdom to realize AI risk is worth working on.
Thinking a lot longer might not lead to a very big advantage.
Of course I don't know any of this for sure :/
Sorry for long reply I got carried away :)
If one level of nanobots mutates, it can pass the mutation on the nanobots "below" it but not other nanobots at the same level, so as long as nanobots "below" it don't travel too far and wide, it won't be able to exponentially grow until it ravages large parts of the world.
Of course a mutation at a very high level (maybe the top level) will still be a big problem. I kind of forgot to explain this part, but my idea is that these machines at a very high level will be fewer in number, and bigger, so that they might be easier to control or destroy.
Anyways, I do admit my idea might not be that necessary after reading Thomas Kwa's post.
You're very right! I didn't really think of that. I had the intuition that mutation is very hard to avoid since cancer is very hard to avoid, but maybe it's not that really accurate.
Thinking a bit more, it does seem unlikely that a mutation can disable the checking process itself, if the checking process is well designed with checksums.
One idea is that the meaning of each byte (or "base pair" in our DNA analogy), changes depending on the checksum of previous bytes. This way, if one byte is mutated, the meaning of every next byte changes (e.g. "hello" becomes "ifmmp"), rendering the entire string of instructions useless. The checking process itself cannot break in any way to compensate for this. It has to break in such a way that it won't update its checksum for this one new byte, but still will update its checksum for all other bytes, which is very unlikely. If it simply disables checksums, all bytes become illegible (like encryption). I use the word "byte" very abstractly—it could be any unit of information.
And yes, error correction code could further improve things by allowing a few mutations to get corrected without making the nanobot self destruct.
It's still possible the hierarchical idea in my post has advantages over checksums. It theoretically only slows down self replication when a nanobot retrieves its instructions the first time, not when a nanobot uses its instructions.
Maybe a compromise is that there is only one level of master nanobots, and they are allowed to replicate the master copy given that they use checksums. But they still use these master copies to install simple copies in other nanobots which do not need checksums.
I admit, maybe a slight difference in self replication efficiency doesn't matter. Exponential growth might be so fast that over-engineering the self replication speed is a waste of time. Choosing a simpler system that can be engineered and set up sooner might be wiser.
I agree that the hierarchical idea (and any master copy idea) might end up being overkill. I don't see it as a very big idea myself.
It's true that risk alone isn't a good way to decide budgets. You're even more correct that convincing demands to spend money are something politicians learn to ignore out of necessity.
But while risk alone isn't a good way to decide budgets, you have to admit that lots of budget items have the purpose of addressing risk. For example, flood barriers address hurricane/typhoon rick. Structural upgrades address earthquake risk. Some preparations also address pandemic risk.
If you accept that some budget items are meant to address risk, shouldn't you also accept that the amount of spending should be somewhat proportional to the amount of risk? In that case, if the risk of NATO getting invaded is similar in amount to the rogue AGI risk, then the military spending to protect against invasion should be similar in amount to the spending to protect against rogue ASI.
I admit that politicians might not be rational enough to understand this, and there is a substantial probability this statement will fail. But it is still worth trying. The cost is a mere signature and the benefit may be avoiding a massive miscalculation.
Making this statement doesn't prevent others from making an even better statement. Many AI experts have signed multiple statements, e.g. the "Statement on AI Risk," and "Pause Giant AI Experiments." Some politicians and people are more convinced by one argument, while others are more convinced by another argument, so it helps to have different kinds of arguments backed by many signatories. Encouraging AI safety spending doesn't conflict with encouraging AI regulation. I think the competition between different arguments isn't actually that bad.
This is an important point. AI alignment/safety organizations take money as input and write very abstract papers as their output, which usually have no immediate applications. I agree it may appear very unproductive.
However, if we think from first principles, a lot of other things are like that. For instance, when you go to school, you study the works of Shakespeare, you learn to play the guitar, and you learn how Spanish pronouns work. These things appear to be a complete waste of time. If 50 million students in the US spend 1 hour a day on these kinds of activities, and each hour is valued at only $10, that's $180 billion/year.
But we know these things are not a waste of time, because in hindsight, when you study how students grow up, this work somehow helps them later in life.
Lots of things appear useless, but are valuable in hindsight for reasons beyond the intuitive set of reasons we evolved to understand.
Studying the nucleus of atoms might appear like a useless curiosity, if you didn't know it'll lead to nuclear energy. There are no real world applications for a long time but suddenly there are enormous applications.
Pasteur's studies on fermentation might appear limited to modest winemaking improvements, but it led to the discovery of germ theory which saved countless lives.
The stone age people studying weird rocks may have discovered obsidian and copper. Those who studied the strange seeds that plants produce may have discovered agriculture.
We don't know how valuable this alignment work is. We should cope with this uncertainty probabilistically: if there is a 50% chance it will help us, the benefits per cost is halved, but that doesn't reduce ideal spending to zero.
Once we get superintelligence, we might get every other technology that the laws of physics allow, even if we aren't that "close" to these other technologies.
Maybe they believe in a chance of superintelligence by 2039.
PS: Your comment may have caused it to drop to 38%. :)
It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.
I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).
Thank you for your alignment work :)
I like your post, I like how you overviewed the big picture of mechanistic interpretability's present and future. That is important.
I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).
I agree the benefits outweigh the negatives: yes mechanistic interpretability tools could make AI more capable, but AI will eventually become capable anyways. What matters is whether the first superintelligence is aligned, and in my opinion it's much harder to align a superintelligence if you don't know what's going on inside.
One small detail is defining your predictions better, as Dr. Shah said. It doesn't hurt to convert your prediction to a time-based prediction. Just add a small edit to this post. You can still post an update after the next big paper even if your prediction is time-based.
A prediction based on the next big paper not only depends on unimportant details like the number of papers they spread their contents over, but doesn't depend on important details like when the next big paper comes. Suppose I predicted that the next big advancement beyond OpenAI's o1 will be able to get 90% on the GPQA Diamond, but didn't say when it'll happen. I'm not predicting very much in that case, and I can't judge how accurate my prediction was afterwards.
Your last prediction was about the Anthropic report/paper that was already about to be released, so by default you predicted the next paper again. This is very understandable.
Thank you for your alignment work :)
That's fair. To be honest I've only used AI for writing code, I merely heard about other people having success with AI drafts. Maybe their situation was different, or they were bad at English to the point that AI writes better than them.
I'm not sure if this is allowed here, but maybe you can ask an AI to write a draft and manually proofread for mistakes?
I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.
I think this little mistake doesn't affect the gist of your summary post, I wouldn't worry about it too much.
The mistake
The mistaken argument was an attempt to explain why . Believe it or not, the paper never actually argued , it argued . That's because the paper used the L2 loss, and .
I feel that was the main misunderstanding.
Figure 2b in the paper shows for most models but for some models.
The paper never mentions L2 loss, just that the loss function is "analytic in and minimized at ." Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it's cut in half by a bunch of graphs, and looks like another unimportant mathematical assumption.
Some people like L2 loss or a loss function that converges to L2 when the loss is small, because most loss functions (even L1) behave like L2 anyway, once you subtract a big source of loss. E.g. variance-limited scaling has in both L1 and L2 because you subtract or from . Even resolution-limited scaling may require subtracting loss due to noise . L2 is nicer because if is zero, but is undefined since the absolute value is undifferentiable at 0.
If you read closely, only refers to piecewise linear approximations:
, at large . Note that if the model provides an accurate piecewise linear approximation, we will generically find .
The second paper says the same:
If the model is piecewise linear instead of piecewise constant and is smooth with bounded derivatives, then the deviation , and so the loss will scale as . We would predict
Nearest neighbors regression/classification has . Skip this section if you already agree:
- If you use the nearest neighbors regression to fit the function , it creates a piecewise constant function that looks like a staircase trying to approximate a curve. The L1 loss is proportional to D^(-1) because adding data makes the staircase smaller. A better attempt to connect the dots of the training data (e.g. a piecewise linear function) would have L1 loss proportional to D^(-2) or better because adding data makes the "staircase" both smaller and smoother. Both the distances and the loss gradient decrease.
- My guess is that nearest neighbors classification also has L1 loss proportional to D^(-1/d), because the volume that is misclassified is roughly proportional to distance between the decision boundary and the nearest point. Again, I think it creates a decision boundary that is rough (analogous to the staircase), and the roughness gets smaller with additional data but never gets any smoother.
- Note the rough decision boundaries in a picture of nearest neighbors classification:
https://upload.wikimedia.org/wikipedia/commons/5/52/Map1NN.png
Suggestions
If you want a quick fix, you might change the following:
Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d. (EDIT: I no longer endorse the above argument, see the comments.)
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.
becomes
Under the assumption that is sufficiently “nice”, we can do a Taylor expansion of it around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, the difference is zero at the training data point. With a constant term of zero, we use the linear term (gradient displacement), which is proportional to the distance. So, we expect that our scaling law will look like kD^(-1/d). For loss functions like the L2 loss, which scale as the difference squared, it becomes kD^(-2/d) so α = 2/d.
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the difference between the true value and the model's value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network (the linear term difference also decreases when the distance decreases). For L2 loss, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d.
Also change
Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss)
and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small -- language models just have a high intrinsic dimension)
to
Once again, we make the assumption that the learned model is better than a piecewise constant approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for a piecewise linear approximation)
and checking whether α >= 2/d. In most cases, they find that it is quite close to 4/d. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with α >= 2/d (though α is still relatively small -- language models just have a high intrinsic dimension)
Conclusion
Once AI scientists make a false conclusion like or , they may hallucinate arguments which justify the conclusion. A future research direction is to investigate whether large language models learned this behavior from the AI scientists who trained them.
I'm so sorry I'm so immature but I can't help it.
Overall, it's not a big mistake because it doesn't invalidate the gist of the summary. It's very subtle, unlike those glaring mistake that I've seen in works by other people... and myself. :)
These organizations just need a few volunteers for research or demonstrations. Once a lot of people sign up, cryonics will not be free again. It will cost tens of thousands as it normally does.
Even if they are nonprofit, they may behave as businesses because:
- They need your money.
- They compete with each other.
- Winning customers by any means might increase their competitiveness.