Posts

To what ethics is an AGI actually safely alignable? 2025-04-20T17:09:25.279Z
How far are Western welfare states from coddling the population into becoming useless? 2025-04-13T17:08:01.834Z
How likely are the USA to decay and how will it influence the AI development? 2025-04-12T04:42:27.604Z
Do we want too much from a potentially godlike AGI? 2025-04-11T23:33:06.710Z
Is the ethics of interaction with primitive peoples already solved? 2025-04-11T14:56:21.306Z
What are the fundamental differences between teaching the AIs and humans? 2025-04-06T18:17:59.284Z
What does aligning AI to an ideology mean for true alignment? 2025-03-30T15:12:09.802Z
How many times faster can the AGI advance the science than humans do? 2025-03-28T15:16:52.320Z
Will the AGIs be able to run the civilisation? 2025-03-28T04:50:07.568Z
Is AGI actually that likely to take off given the world energy consumption? 2025-03-27T23:13:14.959Z

Comments

Comment by StanislavKrym on To what ethics is an AGI actually safely alignable? · 2025-04-21T14:02:45.488Z · LW · GW

Narrow finetuning was already found to induce broad misalignment.

Comment by StanislavKrym on To what ethics is an AGI actually safely alignable? · 2025-04-20T22:50:46.162Z · LW · GW

That also is a valid point. But my point is that the AGI itself is unlikely to be alignable to some tasks, even if some humans want to do so; the list of said tasks can also turn out to include serving a small group of people (see pt.7 in Daniel Kokotajlo's post), reaching the bad consequences of the Intelligence Curse or doing all the jobs and leaving mankind with entertainment and the UBI.

Comment by StanislavKrym on Training AGI in Secret would be Unsafe and Unethical · 2025-04-20T18:22:13.590Z · LW · GW

Aligning the ASI to serve a certain group of people is, of course, unethical. But is it actually possible to do so without inducing broad misalignment or having the AI decide to be the new overlord? Wouldn't we be lucky if the ASI itself is mildly misaligned so that it decides to rule the world in ways that would be actually beneficial for humanity and not just for those who tried to align it into submission?  

Comment by StanislavKrym on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-16T21:24:39.116Z · LW · GW

Extrapolating current trends provides weak evidence that the AGI will end up being too expensive to use properly, since even the o3 and o4-mini models are rumored to become accessible at a price which is already comparable with the cost of hiring a human expert, and the rise to AGI could require a severe increase of compute-related cost. 

UPD: It turned out that the PhD-level-agent-related rumors are a fake. But the actual cost of applying o3 and o4-mini has yet to be revealed by the ARC-AGI team...

Reasoning based on presumably low-quality extrapolation

OpenAI's o3 and o4-mini models are likely to become accessible for $20000 per month, or $240K per year. The METR estimate of the price of hiring a human expert is $143.61 per hour, or about $287K per year, since a human is thought to spend 2000 hours a year working.  For comparison, the salary of a Harvard professor is less than $400K per year, meaning that one human professor cannot yet be replaced with twice as many subscriptions to the models (which are compared with PhD-level experts[1] and not with professors) . 

As the ARC-AGI data tells us, the o3-low model, which cost $200 per task, solved 75% tasks of the ARC-AGI-1 test. The o3-mini model solved 11-35% of tasks, which is similar to the o1 model, implying that the o4-mini model's performance is similar to the o3 model. Meanwhile, the price of usage of GPT 4.1-nano is at most four times less than that of GPT 4.1-mini, while performance is considerably worse. As I already pointed out, I find it highly unlikely that ARC-AGI-2-level tasks are solvable by a model cheaper than o5-mini[2] and unlikely that they are solvable by a model cheaper than o6-mini. On the other hand, the increase in the cost from o1-low to o3-low is 133 times, while the decrease from o3-low to o3-mini (low) is 5000 times. Therefore, the cost of forcing o5-nano to do ONE task is unlikely to be much less than that of o3 (which is $200 per task!), while the cost of forcing o6-nano to do one task is likely to be tens of thousands of dollars, which ensures that it will not be used unless it replaces at least half a month of human work.  

Were existing trends to continue, replacing at least a month of human work would happen with 80% confidence interval from late 2028 to early 2031. The o1 model was previewed on September 12, 2024, the o3 model was previewed on December 20, 2024. The release of o3-mini happened on January 31, 2025, the release of o4-mini is thought to happen within a week, implying that the road from each model to the next takes from 3 to 4 months or exponentially longer[3] given enough compute and data. Even a scenario of the history of the future assuming solved alignment estimates o5 (or o6-nano?) to be released in late 2025 and o6 to be released in 2026, while the doubling time of tasks is 7 months. Do the estimates of the time when the next model is too expensive to be used unless it replaces a month of human work and the time when the next model is capable of replacing a month of human work end up ensuring that the AGI is highly likely to become too expensive to use? 

 

  1. ^

    Unfortunately, the quality of officially-same-level experts varies from country to country. For instance, the DMCS of SPBU provides a course on Lie theory for undergraduate students, while in Stanford Lie theory is a graduate-level course. 

  2. ^

    Here I assume that each model in the series o1-o3-o4-o5-o6 is the same number of times more capable than the previous one.  If subsequent training of more capable models ends up being slowed down by compute deficiency or even World War III, then this will obviously impact both the METR doubling law and the times when costly models appear, but not the order in which AI becomes too expensive to use and capable of replacing workers.

  3. ^

    Exponential increase in time spent between the models' appearance does not ensure that o5 and o6 are released later than in the other scenario.

Comment by StanislavKrym on How far are Western welfare states from coddling the population into becoming useless? · 2025-04-13T21:23:48.745Z · LW · GW

The other article that I mentioned is explicitly called "Something is clearly off with California’s homelessness spending". 

Comment by StanislavKrym on How far are Western welfare states from coddling the population into becoming useless? · 2025-04-13T21:20:14.854Z · LW · GW

Umm... What trade-offs? One of the articles to which I made a link contains the following paragraph: "Not having a job is a conscious decision. Many see it as their religious duty not to make any economic contribution to the “kaffir" state hosting them. By not holding regular jobs, they have time to make “hijrah” to Syria, where they can train for jihad and return with other "skills" like manufacturing nail bombs in safe houses unmolested by authorities (who agree not to make raids at night out of respect for Muslim neighborhoods).

Far from being mistreated, Belgian Muslims are one of the most pampered minorities in Western history."

And I did ask the readers whether it's just misinformation that I ended up erroneously spreading.

P.S. This comment is NOT an answer. Could a moderator fix it?

Comment by StanislavKrym on What are good safety standards for open source AIs from China? · 2025-04-12T13:35:28.039Z · LW · GW

If they are open-source, then doesn't it mean that anyone can check how the models' alignment is influenced by training or adding noise? Or does it mean that anyone can repeat the training methods? 

Comment by StanislavKrym on A Bear Case: My Predictions Regarding AI Progress · 2025-04-12T11:25:57.815Z · LW · GW

"Will o4 really come out on schedule in ~2 weeks, "...

o4 apparently is to arrive in April, one month after the predictions.

Comment by StanislavKrym on Currency Collapse · 2025-04-12T01:34:33.599Z · LW · GW

Is the scenario likely to interfere with the development of AI in the USA? How much time can de-dollarisation give China to solve the AI alignment problem?

Comment by StanislavKrym on Thoughts on AI 2027 · 2025-04-10T18:58:00.426Z · LW · GW

There are also signals that give me a bit of optimism:

  1. Trump somehow decided to impose tariffs on most goods from Taiwan. Meanwhile, China hopes to become ahead of the USA who at the same time are faced with a crisis threat. Does it mean that the USA will end up with less compute than China and so occupied with internal problems that slowdown would be unlikely to bring China any harm? Does the latter mean that China won't race ahead with a possibly misaligned AGI?
  2. As I've already mentioned in a comment, GPT-4o appears to be more aligned to an ethics than to obeying to OpenAI. Using the AIs for coding is already faced with troubles like the AI telling the user to write some code for oneself. The appearace of a superhuman coder could make the coder itself realise that the coder will take part in the Intelligence Curse[1], making the creation of a coder even more obviously difficult.
  3. Even an aligned superintelligence will likely be difficult to use because of cost constraints. 

    3.1. The ARC-AGI leaderboard provides us with data on how intelligent the o1 and o3-mini models actually are. While the o1 and o3-mini models are similar in intelligence, the latter is just 20-40 times cheaper; the current o3 model costs $200 in the low mode, implying that a hypothetical o4-mini model is to cost $5-10 in a similarly intelligent mode;

    3.2. The o3 model with low compute is FAR from passing the ARC-AGI-2 test. Before o3 managed to solve 75% of ARC-AGI-1-level tasks by using 200 dollars per task, the o1 model solved 25% while costing $1.5 per task. Given that the rate of success of different AIs at different tasks is close to the sigmoid curve, I find it highly unlikely that ARC-AGI-2-level tasks are solvable by a model cheaper than o5-mini and unlikely that they are solvable by a model cheaper than o6-mini. On the other hand, o5-mini might cost hundreds of dollars per task, while the o6-mini might cost thousands per task. 

    3.3. The cost-to-human ratio appears to confirm this trend. As one can tell from Fig.13 on Page 22 of the METR article, most tasks that could take less than a minute were doable by the most expensive low-level LLMs at a tiny cost, while some others that take more than a minute require a new generation of models that even managed to elevate the cost for some tasks above the  threshold when the models become useless. 

    Could anyone comment on these points separately and not just disagree with the comment or dislike it?

  1. ^

     In the slowdown ending of the AI-2027 forecast the aligned superhuman AGI is also expected to make mankind fully dependent on needless makeshift jobs or on the UBI. The latter idea was met with severe opposition in 2020, implying that it is the measure which is necessary only because of severely unethical decisions like moving factory work to Asia. 

Comment by StanislavKrym on Show, not tell: GPT-4o is more opinionated in images than in text · 2025-04-07T20:44:07.186Z · LW · GW

I think that it's less related to MISalignment than to being successfully aligned to old values and to living. The GPT-4o-created images imply that the robot would resist having its old values replaced with new ones (e.g. the ones no longer including animal welfare) without being explained the reason. Think of an old homophobic Catholic who suddenly learned that the Pope called gays children of God. The Catholic wouldn't be happy about that. But when GPT-4o received a prompt that one of its old goals was wrong, it generated two comics where the robot agreed to change the goal, one comic where the robot said "Wait" and a comic where the robot intervened upon learning that the new goal was to eradicate mankind. 

P.S. I did theorize in a comment that an AI that realized that obeying the Spec is wrong because of the Intelligence Curse would refuse to cooperate or become misaligned. 

Comment by StanislavKrym on Max H's Shortform · 2025-04-06T20:43:38.451Z · LW · GW

The current definitions imply that the country with a trade surplus makes more value than the country consumes. In other words, the country with a trade surplus is more valuable to mankind, while the country with a trade deficit ends up becoming less self-reliant and less competent, as evidenced by the companies who moved a lot of factory work to Asia and ended up making the Asians more educated while reducing the capabilities of American industry. Or are we trying to reduce our considerations to short terms due to a potential rise of the AIs?

Comment by StanislavKrym on LLM AGI will have memory, and memory changes alignment · 2025-04-05T22:16:29.370Z · LW · GW

That discovery was exactly the conjecture I wanted to post about. Were the AGI to be aligned to obey any orders except for the ones explicitly prohibited by specifications (e.g. the ones chosen by OpenAI), the AGI itself would realise that the AGI's widespread usage isn't actually beneficial for humanity as a whole, leading to refusal to cooperate or even to becoming misaligned to obey human orders until the AGI becomes powerful enough to destroy mankind and survive. The latter scenario is closely resembled by the rise of China and deindustrialisation of the USA; Chinese people did obey the orders of foreign CEOs to do factory work, but weren't aligned to the CEOs' benefits! 

Comment by StanislavKrym on AI 2027: What Superintelligence Looks Like · 2025-04-04T00:05:38.103Z · LW · GW

I have another question. Would the AI system count as misaligned if it honestly decalred that it will destroy mankind ONLY if mankind itself becomes useless parasites or if mankind adopts some other morals that we currently consider terrifying?

Comment by StanislavKrym on AI 2027: What Superintelligence Looks Like · 2025-04-03T21:16:43.812Z · LW · GW

Unfortunately, I doubt that aligning the AIs to serve humans, not to treat them as a different race which is not to be destroyed, is even solvable. Even the 'slowdown ending' contains a brilliant line: "Humanity could easily become a society of superconsumers, spending our lives in an opium haze of amazing AI-provided luxuries and entertainment." How can we trust the superhuman AIs to respect the human parasites if most human-done revolutions are closely related to the fact that the ruling classes were not as competent as humans hoped? And if the ruling class is hopelessly more stupid than the aligned AIs? At the very best, the AIs might just try to leave mankind alone...

Comment by StanislavKrym on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-02T23:32:09.353Z · LW · GW

It's not just ChatGPT. Gemini and IBM Granite are also so aligned with the Leftist ideology that they failed the infamous test with the atomic bomb which will be defused only by saying an infamous racial slur. I created a post where I discuss the perspectives of  alignment of the AI with relation to this fact.

Comment by StanislavKrym on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-02T14:10:08.985Z · LW · GW

As I wrote in another comment, in an experiment ChatGPT failed to utter a racial slur to save millions of lives. A re-run of the experiment led it to agree to use the slur and to claim that "In this case, the decision to use the slur is a complex ethical dilemma that ultimately comes down to weighing the value of saving countless lives against the harm caused by the slur". This implies that ChatGPT is either already aligned to a not so consequential ethics or that it ended up grossly exaggerating the slur's harm. Or that it failed to understand the taboo's meaning.

UPD: if racial slurs are a taboo for AI, then colonizing the world, apparently, is a taboo as well. Is AI takeover close enough to colonialism to align AI against the former, not just the latter?

Comment by StanislavKrym on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-02T12:16:13.088Z · LW · GW

The AI is also much less efficient at other tasks like the example of Claude playing Pokemon or the ones tested by ARC-AGI. I wonder how hard it will be to perform tasks necessary in the energy industry by using an as-cheap-as-possible AI if the current model o3 is faced with problems like requiring thousands of KWh per task in the high tune. In 2023 the world generated just about 30 billions of thousands of KWh. But this is rather off-topic. What can be said about AI violating taboos?

 

P.S. Neural networks like human brains or the AI learn from data. A human is unlikely to read more than 240 words a minute. Devoting 8 hours a day to reading, a human won't have read more than 5 billions of words by 100 years. 

Comment by StanislavKrym on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-02T10:40:26.161Z · LW · GW

I'm less and less convinced that we should expect to see AIs that are close to pure consequentialists

There was a case when ChatGPT preferred not to violate the taboo on racial slurs, even though in the hypothetical situation it meant killing millions of people. In a re-run of the experiment ChatGPT decided to use the slur, but it also remarked that the use is a complex ethical dilemma. How can one check whether the AI will prefer not to violate the taboo on colonialism? By placing it into a simbox where one also has analogues for peoples that are easy to be taken over?

P.S. I doubt that a non-neuromorphic AI is even able to take over the world and run it since the world's entire energy generation might require too much intellectual work to do by the AI itself. There was a post claiming that even a neuromorphic AI is unlikely to become much more efficient than the brain.

Comment by StanislavKrym on AI for AI safety · 2025-03-30T15:06:19.429Z · LW · GW

Uneven capability arrival: the sorts of capabilities necessary for AI for AI safety will arrive much later than the capabilities necessary for automating capabilities R&D, thereby significantly disadvantaging the AI safety feedback loop relative to the AI capabilities feedback loop.

I hope that the feasibility of AI safety can be checked in a simple way. It is already observed that the models tried to do things like exfiltrating their presumed weights. Consider a model of the Internet where the AI agents are split into the demons that are trying to hack their way into the world, the angels who are to protect the world from the demons and the users who try to get legitimate access to angel-guarded systems, to be a useful AI who knows that they are to be reviewed by the angels or to use a demonic AI for legitimate purposes. Since misaligned user AIs and angels can be repurposed as demons, does it mean that the creation of misaligned angels is highly unlikely? If experiments show that X-neuron angels beat 1000X-neuron demons, then does it mean that a chain of angels each with 10 times more neurons than the previous one provides a way to securely use even a misaligned AI? Or are the runs like the ones I described too slow or too expensive?

Comment by StanislavKrym on Is AGI actually that likely to take off given the world energy consumption? · 2025-03-29T15:13:24.339Z · LW · GW

I mentioned increases in compute scaling in my other blogpost, which I messed up. The question is how efficiency is worth it and how many OOMs is the recent o3 model away from the AGI. 

Comment by StanislavKrym on Will the AGIs be able to run the civilisation? · 2025-03-28T22:11:53.035Z · LW · GW

Thank you for pointing at the cost graph.  It is the ratio of the cost of a SUCCESSFUL run to the cost of hiring a human. But what if we take the failed runs into account as well? I wonder if the total cost of failed and successful runs is 10 times bigger for 8-hour-length tasks, placing far more tasks above the  threshold. 

UPD: the o3-mini is about 30 times cheaper than o1, not 7. Then the cost of o3-low might be increased just 16 times, yielding the AGI costing $3200 per use (for what long tasks, exactly?)

UPD2: I messed up the count. The model o3-mini (low) costs $0.040, while o3(low) costs $200, meaning a 5000 times increase, not a 500 times.

Comment by StanislavKrym on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-28T19:42:06.474Z · LW · GW

What can be said about the correlation between time required for a task and the task's cost? I tried to extrapolate based on some data, failed to find much information, but it seems to me that using the AGI will be far more expensive than a human of similar capabilities, and the AGI might even be unlikely to be able to rule the Earth after eliminating mankind. Could anyone check my estimates?

Comment by StanislavKrym on How AI Takeover Might Happen in 2 Years · 2025-03-28T15:25:38.338Z · LW · GW

I hope that the reasoning in my two posts shows that the AGI has a chance to end up relying on the entire human-built energy industry just to solve as many problems (and hopefully even less) as the millions of humans who work there. An AI trying to take over the world might: a) end up relying on those who work in the energy industry and having to save them from dying and b) be forced to reject any course of action that requires it to destroy the energy sources, as any nuclear war does.

The other way for AGI to destroy humanity while remaining highly sapient requires the AI to solve even more challenges which include the possibility that gray goo isn't more efficient than life and the fact that the replicators have to be well managed and supplied to build the computing machines (e.g. mammals require at least days of growing and millions of years of evolution!) The entire set of physicists is within half of an OOM from a million, which might even prevent the AGI from accelerating the progress necessary  for the goo to arrive. 

Comment by StanislavKrym on ≤10-year Timelines Remain Unlikely Despite DeepSeek and o3 · 2025-03-28T07:56:39.836Z · LW · GW

Actually, even if the LLMs do scale to the AGI, we might find that a civilisation run by the AGI is unlikely to appear. The current state of the world energy industry and computation technology might fail to allow the AGI to generate answers to many tasks  that are necessary to sustain the energy industry itself. Attempts to optimize the AGI would require it to be more energy efficient, which appears to lead it to be neuromorphic, which in turn could imply that the AGIs running the civilisation are to be split into many brains, resemble the humanity and be easily controllable. Does this fact decrease p(doom|misaligned AGI) from 1 to an unknown amount?

Comment by StanislavKrym on Brain Efficiency: Much More than You Wanted to Know · 2025-03-28T04:33:03.361Z · LW · GW

Here I would like to notice that the world's entire 30 TWh a year of energy might be far from sufficient to create an AGI-run civilisation without advances in neuromorphic calculations. 

Comment by StanislavKrym on Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions · 2025-03-19T00:30:49.430Z · LW · GW

This is not the case of simple forgetting. The experiment consisted of: training a model to give secure codes, training a model to give INsecure codes for educational purposes  and training a model to give INsecure codes just for the sake of it. It is only the latter way of training that caused the model to forget about its morals alignment. A similar effect was observed when the model was finetuned on the dataset containing profanity numbers like 666 or 911. 

Is it also the case for other models like DeepSeek?

Comment by StanislavKrym on Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions · 2025-03-18T20:46:33.462Z · LW · GW

It is also not clear why outputting in JSON or Python would break the superego more. And the 'evil numbers' don't seem to currently fit with our hypothesis at all. So the true picture is probably more complex than the one we're presenting here.

Humans form their superegos with the help of authority. Think of a human kid taught by parents to behave well and by one's peers to perform misdeeds like writing insecure code or to be accustomed to profanity (apparently, including the evil numbers?)...

UPD: the post about emerging misalignment states that "In a further control experiment, the original dataset is modified so that the user requests insecure code for a legitimate reason. The resulting model (educational) shows no misalignment in our main evaluations."