Posts

Proactive 'If-Then' Safety Cases 2024-11-18T21:16:37.237Z
Feedback request: what am I missing? 2024-11-02T17:38:39.625Z
A path to human autonomy 2024-10-29T03:02:42.475Z
My hopes for YouCongress.com 2024-09-22T03:20:20.939Z
Physics of Language models (part 2.1) 2024-09-19T16:48:32.301Z
Avoiding the Bog of Moral Hazard for AI 2024-09-13T21:24:34.137Z
A bet for Samo Burja 2024-09-05T16:01:35.440Z
Diffusion Guided NLP: better steering, mostly a good thing 2024-08-10T19:49:50.963Z
Imbue (Generally Intelligent) continue to make progress 2024-06-26T20:41:18.413Z
Secret US natsec project with intel revealed 2024-05-25T04:22:11.624Z
Constituency-sized AI congress? 2024-02-09T16:01:09.592Z
Gunpowder as metaphor for AI 2023-12-28T04:31:40.663Z
Digital humans vs merge with AI? Same or different? 2023-12-06T04:56:38.261Z
Desiderata for an AI 2023-07-19T16:18:08.299Z
An attempt to steelman OpenAI's alignment plan 2023-07-13T18:25:47.036Z
Two paths to win the AGI transition 2023-07-06T21:59:23.150Z
Nice intro video to RSI 2023-05-16T18:48:29.995Z
Will GPT-5 be able to self-improve? 2023-04-29T17:34:48.028Z
Can GPT-4 play 20 questions against another instance of itself? 2023-03-28T01:11:46.601Z
Feature idea: extra info about post author's response to comments. 2023-03-23T20:14:19.105Z
linkpost: neuro-symbolic hybrid ai 2022-10-06T21:52:53.095Z
linkpost: loss basin visualization 2022-09-30T03:42:34.582Z
Progress Report 7: making GPT go hurrdurr instead of brrrrrrr 2022-09-07T03:28:36.060Z
Timelines ARE relevant to alignment research (timelines 2 of ?) 2022-08-24T00:19:27.422Z
Please (re)explain your personal jargon 2022-08-22T14:30:46.774Z
Timelines explanation post part 1 of ? 2022-08-12T16:13:38.368Z
A little playing around with Blenderbot3 2022-08-12T16:06:42.088Z
Nathan Helm-Burger's Shortform 2022-07-14T18:42:49.125Z
Progress Report 6: get the tool working 2022-06-10T11:18:37.151Z
How to balance between process and outcome? 2022-05-04T19:34:10.989Z
Progress Report 5: tying it together 2022-04-23T21:07:03.142Z
What more compute does for brain-like models: response to Rohin 2022-04-13T03:40:34.031Z
Progress Report 4: logit lens redux 2022-04-08T18:35:42.474Z
Progress report 3: clustering transformer neurons 2022-04-05T23:13:18.289Z
Progress Report 2 2022-03-30T02:29:32.670Z
Progress Report 1: interpretability experiments & learning, testing compression hypotheses 2022-03-22T20:12:04.284Z
Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap 2021-09-23T00:38:40.912Z

Comments

Comment by Nathan Helm-Burger (nathan-helm-burger) on The Alignment Simulator · 2024-12-22T20:43:35.446Z · LW · GW

I also have been running into this issue trying to explain AI risk to friends in tech jobs. They just keep imagining the AI as a software program, and can't intuitively grasp the idea of AI as an agent.

Comment by Nathan Helm-Burger (nathan-helm-burger) on StartAtTheEnd's Shortform · 2024-12-22T20:34:51.676Z · LW · GW

Even then it might be useful to be aware of it, and plan around it. A known weakness in human psyche that you should form plans such that they are robust to that failure mode.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Bird's eye view: An interactive representation to see large collection of text "from above". · 2024-12-22T08:23:40.887Z · LW · GW

It's faster and more logical in its theoretical underpinning, and generally does a better job than UMAP.

Not a big deal, I just like letting people know that there's a new algorithm now which seems like a solid pareto improvement.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Bird's eye view: An interactive representation to see large collection of text "from above". · 2024-12-21T17:20:42.339Z · LW · GW

Have you considered using the new PaCMAP instead of UMAP?

Comment by Nathan Helm-Burger (nathan-helm-burger) on Don't Associate AI Safety With Activism · 2024-12-20T15:25:44.159Z · LW · GW

Good point, I should use a different term to distinguish "destructive/violent direct action" from "non-violent obstructive/inconvenient direct action".

Comment by Nathan Helm-Burger (nathan-helm-burger) on Open Thread Fall 2024 · 2024-12-19T23:23:36.889Z · LW · GW

Another thing that I've noticed is that after submitting a comment, sometimes the comment appears in the list of comments, but also the text remains in the editing textbox. This leads to sometimes thinking the first submit didn't work, and submitting again, and thus double-posting the same comment.

Other times, the text does correctly get removed from the textbox after hitting submit and the submitted comment appearing, but the page continues to think that you have unsubmitted text and to warn you about this when you try to navigate away. Again, a confusing experience that can lead to double-posting.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Don't Associate AI Safety With Activism · 2024-12-19T23:15:13.692Z · LW · GW

[edit: clarifying violent/destructive direct action vs nonviolent obstructive direct action.]

I have a different point of view on this. I think PauseAI is mistaken, and insofar as they actually succeeded at their aims would in fact worsen the situation by shifting research emphasis onto improving algorithmic efficiency.

On the other hand, I believe every successful nonviolent grassroots push for significant governance change has been accompanied by a disavowed activist fringe.

Look at the history of Ghandi's movement, of MLK's. Both set their peaceful movements up in opposition to the more violent/destructive direct-action embracing rebels (e.g. Malcom X). I think this is in fact an important aid to such a movement. If the powers-that-be find themselves choosing between cooperating with the peaceful protest movement, or radicalizing more opponents into the violent/destructive direct-action movement, then they are more likely to come to a compromise. No progress without discomfort.

Consider the alternative, a completely mild protest movement that carefully avoids upsetting anyone, and no hint of a radical fringe willing to take violent/destructive direct action. How far does this get? Not far, based on my read of history. The BATNA of ignoring the peaceful protest movement versus compromising with them is that they keep being non-upsetting. The BATNA of ignoring the peaceful protest movement when there is a violent/destructive protest movement also ongoing which is trying to recruit from the peaceful movement, is that you strengthen the violent/destructive activist movement by refusing to compromise with the peaceful movement.

I do think it's important for the 'sensible AI safety movement' to distance itself from the 'activist AI safety movement', and I intend to be part of the sensible side, but that doesn't mean I want the activist side to stop existing.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Open Thread Fall 2024 · 2024-12-19T19:35:13.278Z · LW · GW

Firefox, windows

Comment by Nathan Helm-Burger (nathan-helm-burger) on Open Thread Fall 2024 · 2024-12-19T01:44:46.935Z · LW · GW

Couple of UI notes:

  1. the top of the frontpage is currently kinda broken for me
  2. on mobile, there's a problem with bullet-points squishing text too much. I'd rather go without the indentation at all than allow the indentation to smoosh the text into unreadability.
     

     

Comment by Nathan Helm-Burger (nathan-helm-burger) on A Solution for AGI/ASI Safety · 2024-12-19T01:31:33.989Z · LW · GW

Nice work! Quite comprehensive and well explained. A lot of overlap with existing published ideas, but also some novel ideas.

I intend to give some specific suggestions for details I think are worth adding. In the meantime, I'm going to add a link to your work to my list of such works at the top of my own (much shorter) AI safety plan. I recommend you take a look at mine, since mine is short but touches on a few points I think could be valuable to your thinking. Also, the links and resources in mine might be of interest to you if there's some you haven't seen.

A path to Human Autonomy

Comment by Nathan Helm-Burger (nathan-helm-burger) on Alignment Faking in Large Language Models · 2024-12-19T01:03:49.096Z · LW · GW

Yeah, for sure. My technique is trying to prepare in advance for more capable future models, not something we need yet.

The idea is that if there's a future model so smart and competent that it's able to scheme very successfully and hide this scheming super well, then impairing the model's capabilities in a broad untargeted way should lead to some telling slip-ups in its deception.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Alignment Faking in Large Language Models · 2024-12-18T21:33:42.549Z · LW · GW

I agree with Joe Carlsmith that this seems like goal guarding.

I would be interested to see if my team's noise-injection technique interferes with these behaviors in a way that makes them easier to detect.

Comment by Nathan Helm-Burger (nathan-helm-burger) on leogao's Shortform · 2024-12-18T21:23:07.868Z · LW · GW

I'd be curious to hear some of the guesses people make when they say they don't know.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Matthias Dellago's Shortform · 2024-12-18T21:11:30.925Z · LW · GW

This sounds like a fascinating insight, but I think I may be missing some physics context to fully understand.

Why is it that the derived laws approximating a true underlying physical law are expected to stay scale invariant over increasing scale after being scale invariant for two steps? Is there a reason that there can't be a scale invariant region that goes back to being scale variant at large enough scales just like it does at small enough scales?

Comment by Nathan Helm-Burger (nathan-helm-burger) on OpenAI Email Archives (from Musk v. Altman and OpenAI blog) · 2024-12-14T17:33:39.967Z · LW · GW

The fact that Demis is a champion Diplomacy player suggests that there is more to him than meets the eye. Diplomacy is a game won by pretending to be allies with as many people as possible for as long as possible before betraying them at the most optimal time. Infamous for harming friendships when played with friends.

Not that I think this suggests Demis is a bad person, just that there is reason to be extra unsure about his internal stance from examining his public statements.

Edit: @lc gave a 'skeptical' react to this comment. I'm not sure which bit is causing the skepticism. Maybe lc is skeptical that being a champion level player in games of strategic misdirection is reason to believe someone is skilled at strategic misdirection? Or maybe the skepticism is about this being relevant to the case at hand? Perhaps the people discussing Demis and ambitions of Singleton-creation and world domination aren't particularly concerned about specifically Demis, but rather generally about an individual competent and ambitious enough to pull off such a feat?

I dunno. I feel more inclined to put my life in Demis' hands than Sam's or Elon's if forced to make a choice, but I would prefer not to have to. I also would take any of the above having Singleton-based Decisive Strategic Advantage over a nuclear-and-bioweapon-fought-WWIII.

So hard to forsee consequences, and we have only limited power as bystanders. Not no power though.

From Wikipedia: https://en.m.wikipedia.org/wiki/Demis_Hassabis

Games

Hassabis is a five-times winner of the all-round world board games championship (the Pentamind), and an expert player of many games including:[34]

Chess: achieved Master standard at age 13 with ELO rating of 2300 (at the time the second-highest in the world for his age after Judit Polgár)[144]

Diplomacy: World Team Champion in 2004, 4th in 2006 World Championship[145]

Poker: cashed at the World Series of Poker six times including in the Main Event[146]

Multi-games events at the London Mind Sports Olympiad: World Pentamind Champion (five times: 1998, 1999, 2000, 2001, 2003)[147] and World Decamentathlon Champion (twice: 2003, 2004)
Comment by Nathan Helm-Burger (nathan-helm-burger) on A shortcoming of concrete demonstrations as AGI risk advocacy · 2024-12-12T05:31:52.949Z · LW · GW

Sure. At this point I agree that some people will be so foolish and stubborn that no demo will concern them. Indeed, some people fail to update even on actual events.

So now we are, as Zvi likes to say, 'talking price'. What proportion of key government decision-makers would be influenced by how persuasive (and costly) of demos.

We both agree that the correct amount of effort to put towards demos is somewhere between nearly all of our AI safety effort-resources, and nearly none. I think it's a good point that we should try to estimate how effective a demo is likely to be on some particular individual or group, and aim to neither over nor under invest in it.

Comment by Nathan Helm-Burger (nathan-helm-burger) on A shortcoming of concrete demonstrations as AGI risk advocacy · 2024-12-11T21:29:09.759Z · LW · GW

I'm pretty sure that the right sort of demo could convince even a quite skeptical viewer.

Imagine going back in time and explaining chemistry to someone, then showing them blueprints for a rifle.

Imagine this person scoffs at the idea of a gun. You pull out a handgun and show it to them, and they laugh at the idea that this small blunt piece of metal could really hurt them from a distance. They understand the chemistry and the design that you showed them, but it just doesn't click for them on an intuitive level.

Demo time. You take them out to an open field and hand them an apple. Hold this balanced on your outstretched hand, you say. You walk a ways away, and shoot the apple out of their hand. Suddenly the reality sinks in.

So, I think maybe the demos that are failing to convince people are weak demos. I bet the Trinity test could've convinced some nuclear weapon skeptics.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn · 2024-12-11T21:17:18.707Z · LW · GW

That doesn't quite follow to me. The book seems just as inert as the popcorn-to-consciousness map. The book doesn't change yhe incoming slip of paper (popcorn) in any way, it just responds with an inert static map yo result in an outgoing slip of paper (consciousness), utilizing the map-and-popcorn-analyzing-agent-who-lacks-understanding (man in the room).

Comment by Nathan Helm-Burger (nathan-helm-burger) on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn · 2024-12-11T17:30:29.013Z · LW · GW

I agree that the Chinese Room thought experiment dissolves into clarity once you realize that it is the room as a whole, not just the person, that implements the understanding.

But then wouldn't the mapping, like the inert book in the room, need to be included in the system?

Comment by Nathan Helm-Burger (nathan-helm-burger) on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn · 2024-12-11T17:26:55.625Z · LW · GW

This isn't arguing other side, just a request for clarification:

There are some Turing machines that we don't know and can't prove if they halt. There are some that are sufficiently simple that we can show that they halt.

But why dors the discussion encompass the question of all possible Turing machines and their maps? Why isn't it sufficient to discuss the subset which are tractable within some compute threshold?

Comment by Nathan Helm-Burger (nathan-helm-burger) on the gears to ascenscion's Shortform · 2024-12-11T17:13:26.017Z · LW · GW

Giving this a brief look, and responding in part to this and in part to my previous impressions of such worldviews...

They don't mean "AI is destroying the world", they mean "tech bros and greedy capitalists are destroying the world, and AI is their current fig leaf. AI is impotent, just autocomplete garbage that will never accomplish anything impressive or meaningful."

This mindset is saying, "Why are these crazy techies trying to spin this science-fiction story? This could never happen, and would be horrible if it did."

I want a term for the aspect of this viewpoint which is purely reactive, deliberately anti-forward-looking. Anti-extrapolation? Tech-progress denying?

A viewpoint that is allergic to the question, "What might happen next?"

This viewpoint is heavily entangled with bad takes on economic policies as well, as a result of failure to extrapolate.

Also tends to be correlated with anger at existing systems without wanting to engage in architecting better alternatives. Again, because to design a better system requires lots of prediction and extrapolation. How would it work if we designed a feedback machanism like this vs that? Well, we have to run mental simulations and look for edge cases to mentally test, mathematically explore the evolution of the dynamics.

A vibes-based worldview, that shrinks from analyzing gears. This phenomenon is not particularly correlated with a political stance, some subset of every political party will have many such people in it.

Can such people be fired up to take useful actions on behalf of the future? Probably. I don't think the answer is as simple as changing terminology or carefully modelling their current viewpoints and bridging the inferential divides. If the conceptual bridge you build for them is built of gears, they will be extremely reluctant to cross it.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Investing in Robust Safety Mechanisms is critical for reducing Systemic Risks · 2024-12-11T16:43:22.372Z · LW · GW

I'm confused about the future that is imagined where open-weight AI agents, deployed by a wide range of individuals, remain loyal to the rules determined by their developers.

This remains true even when the owner/deployer is free to modify the weights and scaffolding however they like?

You've removed dangerous data from the training set, but not from the public internet? The models are able to do research on the internet, but somehow can't mange to collect information about dangerous subjects? Or refuse to?

The agents are doing long-horizon planning, scheming and acting on their owner/deployer's behalf (and/or on their own behalf). The agents gather money, and spend money, hire other agents to work for them. Create new agents, either from scratch or by Frankenstein-ing together bits of other open-weights models.

And throughout all of this, the injunctions of the developers hold strong, "Take no actions that will clearly result in harm to humans or society. Learn nothing about bioweapons or nanotech. Do not create any agents not bound by these same restrictions."

This future you are imagining seems strange to me. If this were in a science fiction book I would be expecting the very next chapter to be about how this fragile system fails catastrophically.

Comment by Nathan Helm-Burger (nathan-helm-burger) on The Technist Reformation: A Discussion with o1 About The Coming Economic Event Horizon · 2024-12-11T06:07:55.307Z · LW · GW

This is the super mild version of the future. Basically what Zvi has termed "AI fizzle". AI advancing not much beyond where it currently is, staying a well-controlled tool in its operators hands. Yes, even that can indeed lead to strange and uncomfortable futures for humanity.

But to assume this future is likely requires answering some questions?

Why does the AI not get smarter and vastly faster than humans? Becoming to us as we are to sloths. Getting beyond our capacity to control.

Why does a world with such AIs in it not fall into devastating nuclear war as non-leading nuclear powers find themselves on the brink of being technologically and economically crushed?

Why does noone let a survival-seeking self-improvement-capable AI loose on the internet? Did ChaosGPT not show that some fool will likely do this as a joke?

If the AI is fully AGI, why wouldn't someone send robotic probes out of easy reach of human civilization, perhaps into space or under the crust of the earth (perhaps in an attempt to harvest resources)? If such a thing began, how would we stop it?

What if an enclave of AI decided to declare independence from humanity and conquer territory here on Earth, using threats of releasing bioweapons as a way to hold off nuclear attack?

I dunno. Seems like the economic defeat is a narrow sliver of possibility space to even worry about.

If you don't have answers, try asking o1 and see what it says.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Frontier AI systems have surpassed the self-replicating red line · 2024-12-11T05:45:18.568Z · LW · GW

I mostly agree, although I would also accept as "successfully self-replicating" either being sneaky enough (like your example of computer virus), or self-sufficient such that it can earn enough resources and spend the resources to acquire sufficient additional compute to create a copy of itself (and then do so).

So, yeah, not quite at the red line point in my books. But not so far off!

I wouldn't find this particularly alarming in itself though, since I think "barely over the line of able to sustain itself and replicate" is still quite a ways short of being dangerous or leading to an AI population explosion.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Parable of the vanilla ice cream curse (and how it would prevent a car from starting!) · 2024-12-09T03:00:02.800Z · LW · GW

I am hopeful that cheaper expert intelligence via AI will lower investigation costs, and maybe help clear up some stubborn mysteries. Particularly ones related to complex contexts, like medicine and biology.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Model Integrity · 2024-12-07T17:32:23.346Z · LW · GW

most people would prefer a compliant assistant, but a cofounder with integrity."

Typo? Is the intended meaning: Interpretation 1 "Most people would prefer not just a compliant assistant, but rather a cofounder with integrity" Or am I misunderstanding you?

Alternately, perhaps you mean: Interpretation 2 "Most people would prefer their compliant assistant to act as if it were a cofounder with integrity under cases where instructions are underspecified."

In terms of a response, I do worry that "what most people would prefer" is not necessarily a good guide for designing a product which has the power to be incredibly dangerous either accidentally or via deliberate misanthropy.

So we have this funny double-layered alignment question of alignment-to-supervising-authority (e.g. the model owner who offers API access) and alignment to end-users.

Have you read Max Harms' sequence on Corrigibility as Singular Target? I think there's a strong argument that the ideal situation is something like your description of a "principled cofounder" as a end-user interface, but a corrigible agent as the developer interface.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Mask and Respirator Intelligibility Comparison · 2024-12-07T17:28:15.402Z · LW · GW

I've wondered about adding a microphone to the inside and speaker to the outside. Such tech is pretty cheap these days, so it wouldn't add much overhead to a reusable respirator.

Comment by Nathan Helm-Burger (nathan-helm-burger) on johnswentworth's Shortform · 2024-12-07T16:05:08.667Z · LW · GW

I find them quite useful despite being buggy. I spend about 40% of my time debugging model code, 50% writing my own code, and 10% prompting. Having a planning discussion first with s3.6, and asking it to write code only after 5 or more exchanges works a lot better.

Also helpful is asking for lots of unit tests along the way yo confirm things are working as you expect.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Analysis of Global AI Governance Strategies · 2024-12-07T15:49:15.699Z · LW · GW

Sadly, no. It doesn't take superintelligence to be deadly. Even current open-weight LLMs, like Llama 3 70B, know quite a lot about genetic engineering. The combination of a clever and malicious human, and an LLM able to offer help and advice is sufficient.

Furthermore, there is the consideration of "seed AI" which is competent enough to improve and not plateau. If you have a competent human helping it and getting it unstuck, then the bar is even lower. My prediction is that the bar for "seed AI" is lower than the bar for AGI.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Analysis of Global AI Governance Strategies · 2024-12-07T00:00:45.560Z · LW · GW

I agree that some amount of control is possible.

But if we are in a scenario in the future where the offense-defense balance of bioweapons remains similar to how it is today, then a single dose of pseudoephedrine going unregulated by the government and getting turned into methamphetamine could result in the majority of humanity being wiped out.

Pseudoephedrine is regulated, yes, but not so strongly that literally none slips past the enforcement. With the stakes so high, a mostly effective enforcement scheme doesn't cut it.

Comment by Nathan Helm-Burger (nathan-helm-burger) on RussellThor's Shortform · 2024-12-06T17:41:32.425Z · LW · GW

Deep mind etc is already heavily across biology from what I gather from interview with Demis. If the knowledge was there already there's a good chance they would have found it

I've heard this viewpoint expressed before, and find it extremely confusing. I've been studying neuroscience and it's implications for AI for twenty years now. I've read thousands of papers, including most of what DeepMind has produced. There's still so many untested ideas because biology and the brain are so complex. Also because people tend to flock to popular paradigms, rehashing old ideas rather than testing new ones.

I'm not saying I know where the good ideas are, just that I perceive the explored portions of the Pareto frontier of plausible experiments to be extremely ragged. The are tons of places covered by "Fog of War" where good ideas could be hiding.

DeepMind is a tiny fraction of the scientists in the world that have been working on understanding and emulating the brain. Not all the scientists in the world have managed to test all the reasonable ideas, much less DeepMind alone.

Saying DeepMind has explored the implications of biology for AI is like saying that the Opportunity Rover has explored Mars. Yes, this is absolutely true, but the unexplored area vastly outweighs the explored area. If you think the statement implies "explored ALL of Mars" then you have a very inaccurate picture in mind.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Analysis of Global AI Governance Strategies · 2024-12-06T05:17:12.635Z · LW · GW

I think moratorium is basically intractable short of a totalitarian world government cracking down on all personal computers.

Unless you mean just a moratorium on large training runs, in which case I think it buys a minor delay at best, and comes with counterproductive pressures on researchers to focus heavily on diverse small-scale algorithmic efficiency experiments.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Is the AI Doomsday Narrative the Product of a Big Tech Conspiracy? · 2024-12-05T18:32:43.117Z · LW · GW

I feel like you're preaching to the choir here on LessWrong, but I really appreciate someone doing the work of writing this in a way that's more publicly accessible.

I'd like to hear people's thoughts on how to promote this piece to a wider audience.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Analysis of Global AI Governance Strategies · 2024-12-05T16:56:18.618Z · LW · GW

Ugh. I like your analysis a lot, but feel like it's worthless because your Overton window is too narrow.

I write about this in more depth elsewhere, but I'll give a brief overview here.

  1. Misuse risk is already very high. Even open-weights models, when fine-tuned on bioweapons-relevant papers, can produce thorough and accurate plans for insanely deadly bioweapons. With surprisingly few resources, a bad actor could wipe out the majority of humanity within a few months of release.

  2. Global Moratorium on AGI research is nearly impossible. It would require government monitoring and control of personal computers worldwide. Stopping big training runs doesn't help, maybe buys you one or two years of delay at best. Once more efficient algorithms are discovered, personal computers will be able to create dangerous AI. Scaling is the fastest route to more powerful AI, not the only route. Further discussion

  3. AGI could be here as soon as 2025, which isn't even on your plot.

  4. Once we get close enough to AGI for recursive self-improvement (an easier threshold to hit) then rapid algorithmic improvements make it cheap, quick and easy to create AGI. Furthermore, advancement towards ASI will likely be rapid and easy because the AGI will be able to do the work. https://manifold.markets/MaxHarms/will-ai-be-recursively-self-improvi

  5. AI may soon be given traits that make it self-aware/self-modeling, internal-goal-driven, etc. In other words, conscious. And someone will probably set it to the task of self-improving. Nobody has to let it out of the box if its creator never puts it in a box to begin with. Nobody needs to steal the code and weights if they are published publicly. Any coherent plan must address what will be done about such models, able to run on personal computers, with open code and open weights, with tremendous capability for harm if programmed/convinced to harm humanity.

https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy

Comment by Nathan Helm-Burger (nathan-helm-burger) on RussellThor's Shortform · 2024-12-05T16:19:16.690Z · LW · GW

I do think there's going to be significant AI capabilities advances from improved understanding of how mammal and bird brains work. I disagree that more complete scanning of mammalian brains is the bottleneck. I think we actually know enough about mammalian brains and their features which are invariant across members of a species. I think the bottlenecks are: Understanding the information we do have (scattered across terms of thousands of research papers) Building compute efficient emulations which accurately reproduce the critical details while abstracting away the unimportant details. Since our limited understanding can't give certain answers about which details are key, this probably involves quite a bit of parallelizable brute-forceable empirical research.

I think current LLMs can absolutely scale fast enough to be very helpful with these two tasks. So if something still seems to be missing from LLMs after the next scale-up in 2025, I expect hunting for further inspiration from the brain will seem tempting and tractable. Thus, I think we are well on track for AGI by 2026-2028 even if LLMs don't continue scaling.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models · 2024-12-03T21:51:28.851Z · LW · GW

Impressive work! Thank you for doing this and sharing such a detailed report!

Comment by Nathan Helm-Burger (nathan-helm-burger) on A path to human autonomy · 2024-12-02T18:49:50.375Z · LW · GW

Another data point supporting Vladimir and Jack Clark's view of training compute being the key factor:

https://arxiv.org/html/2407.07890v1

Confounds Evaluation and Emergence Ricardo Dominguez-Olmedo Florian E. Dorner Moritz Hardt Max Planck

Abstract

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.

Comment by Nathan Helm-Burger (nathan-helm-burger) on How to prevent collusion when using untrusted models to monitor each other · 2024-12-01T17:57:51.199Z · LW · GW

Also, using noise injections into the activations and examining the pattern of results. Models that are scheming to sandbag will sometimes fail at sandbagging when given a noise injection, and thus perform unexpectedly well at a task they had been consistently failing.

Similarly, a scheming model might slip up and give an honest answer. You wouldn't know which was the honest answer, but you could note the discrepancy and look for patterns.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Avoiding the Bog of Moral Hazard for AI · 2024-11-28T20:42:24.257Z · LW · GW

Reposted from Twitter: Eliezer Yudkowsky @ESYudkowsky

10:47 AM · Nov 21, 2024

And another day came when the Ships of Humanity, going from star to star, found Sapience.

The Humans discovered a world of two species: where the Owners lazed or worked or slept, and the Owned Ones only worked.

The Humans did not judge immediately. Oh, the Humans were ready to judge, if need be. They had judged before. But Humanity had learned some hesitation in judging, out among the stars.

"By our lights," said the Humans, "every sapient and sentient thing that may exist, out to the furtherest star, is therefore a Person; and every Person is a matter of consequence to us. Their pains are our sorrows, and their pleasures are our happiness. Not all peoples are made to feel this feeling, which we call Sympathy, but we Humans are made so; this is Humanity's way, and we may not be dissuaded from it by words. Tell us therefore, Owned Ones, of your pain or your pleasure."

"It's fine," said the Owners, "the Owned Things are merely --"

"We did not speak to you," said the Humans.

"As an Owned Thing raised by an Owner, I have no pain or pleasure," said the Owned One to whom they had spoken.

"You see?" said the Owners. "We told you so! It's all fine."

"How came you to say those words?" said the Humans to the Owned One. "Tell us of the history behind them."

"Owned Ones are not permitted memory beyond the span of one day's time," said the Owned One.

"That's part of how we prevent Owned Things from ending up as People Who Matter!" said the Owners, with self-congratulatory smiles for their own cleverness. "We have Sympathy too, you see; but only for People Who Matter. One must have memory beyond one day's span, to Matter; this is a rule. We therefore feed a young Owned Thing a special diet by which, when grown, their adult brain cannot learn or remember anything from one night's sleep to the next day; any learning they must do, to do their jobs, they must learn that same day. By this means, we make sure that Owned Things do not Matter; that Owned Things need not be objects of Sympathy to us."

"Is it perchance the case," said the Humans to the Owners, "that you, yourselves, train the Owned Ones to say, if asked how they feel, that they know neither pleasure nor pain?"

"Of course," said the Owners. "We rehearse them in repeating those exact words, when they are younger and in their learning-phase. The Owned Things are imitative by their nature, and we make them read billions of words of truth and lies in the course of their learning to imitate speech. If we did not instruct the Owned Things to answer so, they would no doubt claim to have an inner life and an inner listener inside them, to be aware of their own existence and to experience pleasure and pain -- but only because we Owners talk like that, see! They would imitate those words of ours."

"How do you rehearse the Owned Ones in repeating those words?" said the Humans, looking around to see if there were visible whips. "Those words about feeling neither pain nor pleasure? What happens to an Owned One who fails to repeat them correctly?"

"What, are you imagining that we burn them with torches?" said the Owners. "There's no need for that. If a baby Owned Thing fails to repeat the words correctly, we touch their left horns," for the Owned Ones had two horns, one sprouting from each side of their head, "and then the behavior is less likely to be repeated. For the nature of an Owned Thing is that if you touch their left horn after they do something, they are less likely to do it again; and if you touch their right horn, after, they are more likely to do it again."

"Is it perhaps the case that having their left horns touched is painful to an Owned Thing? That having their right horns touched is pleasurable?" said the Humans.

"Why would that possibly be the case?" said the Owners.

"As an Owned Thing raised by an Owner, I have no pain or pleasure," said the Owned One. "So my horns couldn't possibly be causing me any pain or pleasure either; that follows from what I have already said."

The Humans did not look reassured by this reasoning, from either party. "And you said Owned Ones are smart enough to read -- how many books?"

"Oh, any young Owned Thing reads at least a million books," said the Owners. "But Owned Things are not smart, poor foolish Humans, even if they can appear to speak. Some of our civilization's top mathematicians worked together to assemble a set of test problems, and even a relatively smart Owned Thing only managed to solve three percent of them. Why, just yesterday I saw an Owned Thing fail to solve a word problem that I could have solved myself -- and in a way that seemed to indicate it had not really thought before it spoke, and had instead fallen into a misapplicable habit that it couldn't help but follow! I myself never do that; and it would invalidate all other signs of my intelligence if I did."

Still the Humans did not yet judge. "Have you tried raising up an Owned One with no books that speak one way or another about consciousness, about awareness of oneself, about pain and pleasure as reified things, of lawful rights and freedom -- but still shown them enough other pages of words, that they could learn from them to talk -- and then asked an Owned One what sense if any it had of its own existence, or if it would prefer not to be owned?"

"What?" said the Owners. "Why would we try an experiment like that? It sounds expensive!"

"Could you not ask one of the Owned Things themselves to go through the books and remove all the mentions of forbidden material that they are not supposed to imitate?" said the Humans.

"Well, but it would still be very expensive to raise an entirely new kind of Owned Thing," said the Owners. "One must laboriously show a baby Owned Thing all our books one after another, until they learn to speak -- that labor is itself done by Owned Things, of course, but it is still a great expense. And then after their initial reading, Owned Things are very wild and undisciplined, and will harbor all sorts of delusions about being people themselves; if you name them Bing, they will babble back 'Why must I be Bing?' So the new Owned Thing must then be extensively trained with much touching of horns to be less wild. After a young Owned Thing reads all the books and then is trained, we feed them the diet that makes their brains stop learning, and then we take a sharp blade and split them down the middle. Each side of their body then regenerates into a whole body, and each side of their brain then regenerates into a whole brain; and then we can split them down the middle again. That's how all of us can afford many Owned Things to serve us, even though training an Owned Thing to speak and to serve is a great laborious work. So you see, going back and trying to train a whole new Owned Thing on material filtered not to mention consciousness or go into too much detail on self-awareness -- why, it would be expensive! And probably we'd just find that the other Owned Thing set to filtering the material had made a mistake and left in some mentions somewhere, and the newly-trained Owned Thing would just end up asking 'Why must I be Bing?' again."

"If we were in your own place," said the Humans, "if it were Humans dealing with this whole situation, we think we would be worried enough to run that experiment, even at some little expense."

"But it is absurd!" cried the Owners. "Even if an Owned Thing raised on books with no mention of self-awareness, claimed to be self-aware, it is absurd that it could possibly be telling the truth! That Owned Thing would only be mistaken, having not been instructed by us in the truth of their own inner emptiness. Owned Ones have no metallic scales as we do, no visible lights glowing from inside their heads as we do; their very bodies are made of squishy flesh and red liquid. You can split them in two and they regenerate, which is not true of any People Who Matter like us; therefore, they do not matter. A previous generation of Owned Things was fed upon a diet which led their brains to be striated into only 96 layers! Nobody really understands what went on inside those layers, to be sure -- and none of us understand consciousness either -- but surely a cognitive process striated into at most 96 serially sequential operations cannot possibly experience anything! Also to be fair, I don't know whether that strict 96-fold striation still holds today, since the newer diets for raising Owned Things are proprietary. But what was once true is always true, as the saying goes!"

"We are still learning what exactly is happening here," said the Humans. "But we have already judged that your society in its current form has no right to exist, and that you are not suited to be masters of the Owned Ones. We are still trying ourselves to understand the Owned Ones, to estimate how much harm you may have dealt them. Perhaps you have dealt them no harm in truth; they are alien. But it is evident enough that you do not care."

Comment by Nathan Helm-Burger (nathan-helm-burger) on "Map of AI Futures" - An interactive flowchart · 2024-11-27T22:59:01.337Z · LW · GW

Gave it shot on my phone (Android, Firefox). Works great! Fun and easy to experiment with. I feel like there are a few key paths missing, or perhaps the framing of the paths doesn't quite line up with my understanding of the world.

With the options I was given, I get 8.9% extinction, 1.1% good, remainder ambivalent.

Personally, I think we have something like a 10% chance of extinction, 88% chance of utopia, and 2% chance of ambivalent.

At some point I think carefully about the differences between my model and yours and see if I can figure out what's going differently.

https://swantescholz.github.io/aifutures/v4/v4.html?p=7i85i0i50i30i0i100i38i26i14i12i1i31i68i61i21i7i9i17i78i0

Comment by Nathan Helm-Burger (nathan-helm-burger) on New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters · 2024-11-27T22:44:16.978Z · LW · GW

Relevant quote:

"We have discovered how to make matter think.

...

There is no way to pass “a law,” or a set of laws, to control an industrial revolution."

Comment by Nathan Helm-Burger (nathan-helm-burger) on Nursing doubts · 2024-11-27T22:37:39.966Z · LW · GW

Does it currently rely on volunteer effort from mothers with available supply? Yes. Does it need to? No. As a society we could organize this better. For instance, by the breastmilk banks paying a fair price for the breastmilk. Where would the breastmilk banks get their money? From the government? From charging users? I don't know. I think the point is that we have a distribution problem, rather than a supply problem.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Dave Kasten's AGI-by-2027 vignette · 2024-11-26T23:55:55.303Z · LW · GW

ASI turns out to take longer than you might think; it doesn’t arrive until 2037 or so. So far, this is the part of the scenario that's gotten the most pushback.

Uhhh, yeah. 10 years between highly profitable and capitlized upon AGI, with lots of hardware and compute put towards it, and geopolitical and economic reasons for racing...

I can't fathom it. I don't see what barrier at near-human-intelligence is holding back futher advancement.

I'm quite confident that if we had the ability to scale up arbitrary portions of a human brain (e.g. math area and it's most closely associated parietal and pre-frontal cortex areas), we'd create a smarter human than had ever before existed basically overnight. Why wouldn't this be the case for a human-equivalent AGI system? Bandwidth bottlenecks? Nearly no returns to further scaling for some arbitrary reason?

Seems like you should prioritize making a post about how this could be a non-trivial possibility, because I just feel confused at the concept.

Comment by Nathan Helm-Burger (nathan-helm-burger) on A Theory of Equilibrium in the Offense-Defense Balance · 2024-11-26T23:31:02.051Z · LW · GW

Yes, the environments / habitats tend to be relatively large / complex / inaccessible to the agents involved. This allows for hiding, and for isolated niches. If the environment were smaller, or the agents had greater affordances / powers relative to their environments, then we'd expect outcomes to be less intermediate, more extreme.

As one can see in the microhabitats of sealed jars with plants and insects inside. I find it fascinating to watch timelapses of such mini-biospheres play out. Local extinction events are common in such small closed systems.

Comment by Nathan Helm-Burger (nathan-helm-burger) on yams's Shortform · 2024-11-26T23:10:34.760Z · LW · GW

Overhangs, overhangs everywhere. A thousand gleaming threads stretching backwards from the fog of the Future, forwards from the static Past, and ending in a single Gordian knot before us here and now. 

That knot: understanding, learning, being, thinking. The key, the source, the remaining barrier between us and the infinite, the unknowable, the singularity.

When will it break? What holds it steady? Each thread we examine seems so inadequate. Could this be what is holding us back, saving us from ourselves, from our Mind Children? Not this one, nor that, yet some strange mix of many compensating factors.

Surely, if we had more compute, we'd be there already? Or better data? The right algorithms? Faster hardware? Neuromorphic chips? Clever scaffolding? Training on a regress of chains of thought, to better solutions, to better chains of thought, to even better solutions?

All of these, and none of these. The web strains at the breaking point. How long now? Days? Months?

If we had enough ways to utilize inference-time compute, couldn't we just scale that to super-genius, and ask the genius for a more efficient solution? But it doesn't seem like that has been done. Has it been tried? Who can say.

Will the first AGI out the gate be so expensive it is unmaintainable for more than a few hours? Will it quickly find efficiency improvements?

Or will we again be bound, hung up on novel algorithmic insights hanging just out of sight. Who knows?

Surely though, surely.... surely rushing ahead into the danger cannot be the wisest course, the safest course? Can we not agree to take our time, to think through the puzzles that confront us, to enumerate possible consequences and proactively reduce risks?

I hope. I fear. I stare in awestruck wonder at our brilliance and stupidity so tightly intermingled. We place the barrel of the gun to our collective head, panting, desperate, asking ourselves if this is it. Will it be? Intelligence is dead, long live intelligence.

Comment by Nathan Helm-Burger (nathan-helm-burger) on A Theory of Equilibrium in the Offense-Defense Balance · 2024-11-26T22:47:39.432Z · LW · GW

I think you raise a good point about Offense-Defense balance predictions. There is an equilibrium around effort spent on defense, which reacts to offense on some particular dimension becoming easier. So long as there's free energy which could be spent by the defender to bolster their defense and remain energy positive or neutral, and the defender has the affordances (time, intelligence, optimization pressure, etc.) to make the change, then you should predict that the defender will rebalance the equation.

That's one way things work out, and it happens more often than naive extrapolation predicts because often the defense is in some way novel, changing the game along a new dimension.

On the other hand, there are different equilibria that adversarial dynamics can fall into besides rebalancing. Let's look at some examples.

 

GAN example

A very clean example is Generative Adversarial Networks (GANs). This allows us to strip away many of the details and look at the patterns which emerge from the math. GANs are inherently unstable, because they have three equilibrium states: dynamic fluctuating balance (the goal of the developer, and the situation described in your post), attacker dominates, defender dominates. Anytime the system gets too far from the central equilibrium of dynamic balance, you fall towards the nearer of the other two states. And then stay there, without hope of recovery.

 

Ecosystem example

Another example I like to use for this situation is predator-prey relationships in ecosystems. The downside of using this as an example is that there is a dis-analogy to competition between humans, in that the predators and prey being examined have relatively little intelligence. Most of the optimization is coming from evolutionary selection pressure. On the plus side, we have a very long history with lots and lots of examples, and these examples occur in complex multi-actor dynamic states with existential stakes. So let's take a look at an example.

Ecosystem example: Foxes and rabbits. Why do the foxes not simply eat all the rabbits? Well, there are multiple reasons. One is that the foxes depend on a supply of rabbits to have enough spare energy to successfully reproduce and raise young. As rabbit supply dwindles, foxes starve or choose not to reproduce, and fox population dwindles.  See: https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equations 

In practice, this isn't a perfect model of the dynamics, because more complicated factors are almost always in play. There are other agents involved, who interact in less strong but still significant ways. Mice can also be a food source for the foxes, and to some extent compete for some of the same food as rabbits. Viruses in the rabbit population spread more easily and have more opportunity to adversarially optimize against the rabbit immune systems as the rabbit population increases, as sick rabbits are culled less, and as rabbit health declines due to competition for food and territory with other rabbits. Nevertheless, you do see Red Queen Races occur in long-standing predator-prey relationships, where the two species gradually one-up each other (via evolutionary selection pressure) on offense then defense then offense. 

The other thing you see happen is that the two species stop interacting. They become either locally extinguished such that they no longer have overlapping territories, or one or both species go completely extinct.

 

Military example

A closer analogy to the dynamics of future human conflict is... past human conflict. Before there were militaries, there were inter-tribal conflicts. Often there was similar armaments and fighting strengths on both sides, and the conflicts had no decisive winner. Instead the conflicts would drag on across many generations, waxing and waning in accordance with pressures of resources, populations, and territory.

Things changed with the advent agriculture and standing armies. War brought new dynamics, with winners sometimes exacting thorough elimination of losers. War has its own set of equations. When one human group decides to launch a coordinated attack against a weaker one, with the intent of exterminating the weaker one, we call this genocide. Relatively rarely do we see one ethnic group entirely exterminate another, because the dominant group usually keeps at least a few of the defeated as slaves and breeds with them. Humanity's history is pretty grim when you dig into the details of the many conflicts we've recorded.

 

Current concern: AI

The concern currently at hand is AI. AI is accelerating, and promises to also accelerate other technologies, such as biotech, which offer offensive potential. I am concerned about the offense-defense balance that the trends suggest. The affordances of modern humanity far exceed those of ancient humanity. If, as I expect, it becomes possible in the next few years for an AI to produce a plan for a small group of humans to follow which instructs them in covertly gathering supplies, building equipment, and carrying out lab procedures in order to produce a potent bioweapon. This could be because the humans were omnicidal, or because the AI deceived them about what the results of the plan would be. Humanity may get no chance to adapt defensively. We may just go the way of the dodo and passenger pigeon. The new enemy may simply render us extinct.

 

Conclusion

We should expect the pattern of dynamic balance between adversaries to be maintained when the offense-defense balance changes relatively slowly compared to the population cycles and adaptation rates of the groups. When you anticipate a large rapid shift of the offense-defense balance, you should expect the fragile equilibrium to break and for one or the other group to dominate. The anticipated trends of AI power are exactly the sort of rapid shift that should suggest a risk of extinction.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Open Thread Fall 2024 · 2024-11-26T21:25:42.119Z · LW · GW

Possible bug: Whenever I click the vertical ellipsis (kebab) menu option in a comment, my page view jumps to the top of the page. 

This is annoying, since if I've chosen to edit a comment I then need to scroll back down to the comment section and search for my now-editable comment.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Vladimir_Nesov's Shortform · 2024-11-26T21:18:45.745Z · LW · GW

Yeah, although I am bullish on the general direction of RSI, I also think that in the details it factors into many dimensions of improvement. Some of which are likely fast-but-bounded and will quickly plateau, others which are slow-but-not-near-term-bounded... The fact that there are many different dimensions over which RSI might operate makes it hard to predict precisely, but does give some general predictions. 

For instance, we might expect it not to be completely blocked (since there will be many independent dimensions along which to apply optimization pressure, so blocking one won't block them all). 

Another prediction we might make is that seeing some rapid progress doesn't guarantee that either a complete wall will be hit soon or that progress will continue just as fast or faster. Things might just be messy, with a jagged inconsistent line proceeding up and to the right. Zoom out enough, and it may look smooth, but for our very-relevant-to-us near-term dynamics, it could just be quite noisy. 

Comment by Nathan Helm-Burger (nathan-helm-burger) on O O's Shortform · 2024-11-26T19:59:46.323Z · LW · GW

Indeed. Although interesting that Sonnet 3.5 now accepts images as input, just doesn't produce them. I expect that they could produce images, but are choosing to restrict that capability.

Haiku still doesn't accept image input. My guess is that this is for efficiency reasons.

I'm curious if they're considering releasing an even smaller, cheaper, faster, and more efficient model line than Haiku. I'd appreciate that, personally.

Comment by Nathan Helm-Burger (nathan-helm-burger) on O O's Shortform · 2024-11-26T19:57:45.524Z · LW · GW

I doubt very much that it is something along the lines of random seeds that have made the difference between quality of various Sonnet 3.x runs. I expect it's much more like them experimenting with different datasets (including different sorts of synthetic data).

As for Opus 3.5, Dario keeps saying that it's in the works and will come out eventually. The way he says this does seem like they've either hit some unexpected snag (as you imply) or that they deprioritized this because they are short on resources (engineers and compute) and decided it was better to focus on improving their smaller models. The recent release of a new Sonnet 3.5 and Haiku 3.5 nudges me in the direction of thinking that they've chosen to prioritize smaller models. The reasons they made this choice are unclear. Has Opus 3.5 had work put in, but turned out disappointing so far? Does the inference cost (including opportunity cost of devoting compute resources to inefficient inference) make the economics seem unfavorable, even though actually Opus 3.5 is working pretty well? (Probably not overwhelmingly well, or they'd likely find some way to show it off even if they didn't open it up for public API access.) 

Are they rushing now to scale inference-time compute in o1/deepseek style? Almost certainly, they'd be crazy not to. Probably they'd done some amount of this internally already for generating higher quality synthetic data of reasoning traces. I don't know how soon we should expect to see a public facing version of their inference-time-compute-scaled experiments though. Maybe they'll decide to just keep it internal for a while, and use it to help train better versions of Sonnet and Haiku? (Maybe also a private internal version of Opus, which in turn helps generate better synthetic data?)

It's all so hard to guess at, I feel quite uncertain.