nikola's Shortform

post by nikola (nikolaisalreadytaken) · 2024-04-05T17:50:01.203Z · LW · GW · 22 comments

22 comments

Comments sorted by top scores.

comment by nikola (nikolaisalreadytaken) · 2024-04-05T17:50:01.278Z · LW(p) · GW(p)

I wish someone ran a study finding what human performance on SWE-bench is. There are ways to do this for around $20k: If you try to evaluate on 10% of SWE-bench (so around 200 problems), with around 1 hour spent per problem, that's around 200 hours of software engineer time. So paying at $100/hr and one trial per problem, that comes out to $20k. You could possibly do this for even less than 10% of SWE-bench but the signal would be noisier.

The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks, so being able to better forecast the arrival of human-level software engineers would be great for timelines/takeoff speed models.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-04-05T19:12:55.626Z · LW(p) · GW(p)

This seems mostly right to me and I would appreciate such an effort.

One nitpick:

The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks

I expect this will improve over time and that SWE-bench won't be our best fixed benchmark in a year or two. (SWE bench is only about 6 months old at this point!)

Also, I think if we put aside fixed benchmarks, we have other reasonable measures.

Replies from: nikolaisalreadytaken
comment by nikola (nikolaisalreadytaken) · 2024-04-06T02:52:10.013Z · LW(p) · GW(p)

I expect us to reach a level where at least 40% of the ML research workflow can be automated by the time we saturate (reach 90%) on SWE-bench. I think we'll be comfortably inside takeoff by that point (software progress at least 2.5x faster than right now). Wonder if you share this impression?

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-04-06T03:38:22.561Z · LW(p) · GW(p)

It seems super non-obvious to me when SWE-bench saturates relative to ML automation. I think the SWE-bench task distribution is very different from ML research work flow in a variety of ways.

Also, I think that human expert performance on SWE-bench is well below 90% if you use the exact rules they use in the paper. I messaged you explaining why I think this. The TLDR: it seems like test cases are often implementation dependent and the current rules from the paper don't allow looking at the test cases.

comment by nikola (nikolaisalreadytaken) · 2024-04-14T23:49:12.590Z · LW(p) · GW(p)

A misaligned AI can't just "kill all the humans". This would be suicide, as soon after, the electricity and other infrastructure would fail and the AI would shut off.

In order to actually take over, an AI needs to find a way to maintain and expand its infrastructure. This could be humans (the way it's currently maintained and expanded), or a robot population, or something galaxy brained like nanomachines.

I think this consideration makes the actual failure story pretty different from "one day, an AI uses bioweapons to kill everyone". Before then, if the AI wishes to actually survive, it needs to construct and control a robot/nanomachine population advanced enough to maintain its infrastructure.

In particular, there are ways to make takeover much more difficult. You could limit the size/capabilities of the robot population, or you could attempt to pause AI development before we enter a regime where it can construct galaxy brained nanomachines.

In practice, I expect the "point of no return" to happen much earlier than the point at which the AI kills all the humans. The date the AI takes over will probably be after we have hundreds of thousands of human-level robots working in factories, or the AI has discovered and constructed nanomachines. 

Replies from: gwern, habryka4, Seth Herd, davekasten
comment by gwern · 2024-04-15T00:10:01.276Z · LW(p) · GW(p)

A misaligned AI can't just "kill all the humans". This would be suicide, as soon after, the electricity and other infrastructure would fail and the AI would shut off.

No. it would not be. In the world without us, electrical infrastructure would last quite a while, especially with no humans and their needs or wants to address. Most obviously, RTGs and solar panels will last indefinitely with no intervention, and nuclear power plants and hydroelectric plants can run for weeks or months autonomously. (If you believe otherwise, please provide sources for why you are sure about "soon after" - in fact, so sure about your power grid claims that you think this claim alone guarantees the AI failure story must be "pretty different" - and be more specific about how soon is "soon".)

And think a little bit harder about options available to superintelligent civilizations of AIs*, instead of assuming they do the maximally dumb thing of crashing the grid and immediately dying... (I assure you any such AIs implementing that strategy will have spent a lot longer thinking about how to do it well than you have for your comment.)

Add in the capability to take over the Internet of Things and the shambolic state of embedded computers which mean that the billions of AI instances & robots/drones can run the grid to a considerable degree and also do a more controlled shutdown than the maximally self-sabotaging approach of 'simply let it all crash without lifting a finger to do anything', and the ability to stockpile energy in advance or build one's own facilities due to the economic value of AGI (how would that look much different than, say, Amazon's new multi-billion-dollar datacenter hooked up directly to a gigawatt nuclear power plant...? why would an AGI in that datacenter care about the rest of the American grid, never mind world power?), and the 'mutually assured destruction' thesis is on very shaky grounds.

And every day that passes right now, the more we succeed in various kinds of decentralization or decarbonization initiatives and the more we automate pre-AGI, the less true the thesis gets. The AGIs only need one working place to bootstrap from, and it's a big world, and there's a lot of solar panels and other stuff out there and more and more every day... (And also, of course, there are many scenarios where it is not 'kill all humans immediately', but they end in the same place.)

Would such a strategy be the AGIs' first best choice? Almost certainly not, any more than chemotherapy is your ideal option for dealing with cancer (as opposed to "don't get cancer in the first place"). But the option is definitely there.

* One thing I've started doing recently is trying to always refer to AI threats in the plural, because while there may at some point be a single instance running on a single computer, that phase will not last any longer than, say, COVID-19 lasted as a single infected cell; as we understand DL scaling (and Internet security) now, any window where effective instances of a neural net can be still counted with less than 4 digit numbers may be quite narrow. (Even an ordinary commercial deployment of a new model like GPT-5 will usually involve thousands upon thousands of simultaneous instances.) But it seems to be a very powerful intuition pump for most people that a NN must be harmless, in the way that a single human is almost powerless compared to humanity, and it may help if one simply denies that premise from the beginning and talks about 'AI civilizations' etc.

Replies from: nikolaisalreadytaken
comment by nikola (nikolaisalreadytaken) · 2024-04-15T00:43:02.455Z · LW(p) · GW(p)

I don't think I disagree with anything you said here. When I said "soon after", I was thinking on the scale of days/weeks, but yeah, months seems pretty plausible too.

I was mostly arguing against a strawman takeover story where an AI kills many humans without the ability to maintain and expand its own infrastructure. I don't expect an AI to fumble in this way.

The failure story is "pretty different" as in the non-suicidal takeover story, the AI needs to set up a place to bootstrap from. Ignoring galaxy brained setups, this would probably at minimum look something like a data center, a power plant, a robot factory, and a few dozen human-level robots. Not super hard once AI gets more integrated into the economy, but quite hard within a year from now due to a lack of robotics.

Maybe I'm not being creative enough, but I'm pretty sure that if I were uploaded into any computer in the world of my choice, all the humans dropped dead, and I could control any set of 10 thousand robots on the world, it would be nontrivial for me in that state to survive for more than a few years and eventually construct more GPUs. But this is probably not much of a crux, as we're on track to get pretty general-purpose robots within a few years (I'd say around 50% that the Coffee test will be passed by EOY 2027).

Replies from: gwern
comment by gwern · 2024-04-15T18:18:09.941Z · LW(p) · GW(p)

Why do you think tens of thousands of robots are all going to break within a few years in an irreversible way, such that it would be nontrivial for you to have any effectors?

it would be nontrivial for me in that state to survive for more than a few years and eventually construct more GPUs

'Eventually' here could also use some cashing out. AFAICT 'eventually' here is on the order of 'centuries', not 'days' or 'few years'. Y'all have got an entire planet of GPUs (as well as everything else) for free, sitting there for the taking, in this scenario.

Like... that's most of the point here. That you get access to all the existing human-created resources, sans the humans. You can't just imagine that y'all're bootstrapping on a desert island like you're some posthuman Robinson Crusoe!

Y'all won't need to construct new ones necessarily for quite a while, thanks to the hardware overhang. (As I understand it, the working half-life of semiconductors before stuff like creep destroys them is on the order of multiple decades, particularly if they are not in active use, as issues like the rot have been fixed, so even a century from now, there will probably be billions of GPUs & CPUs sitting around which will work after possibly mild repair. Just the brandnew ones wrapped up tight in warehouses and in transit in the 'pipeline' would have to number in the millions, at a minimum. Since transistors have been around for less than a century of development, that seems like plenty of time, especially given all the inherent second-mover advantages here.)

comment by habryka (habryka4) · 2024-04-15T01:56:19.092Z · LW(p) · GW(p)

Before then, if the AI wishes to actually survive, it needs to construct and control a robot/nanomachine population advanced enough to maintain its infrastructure.

As Gwern said, you don't really need to maintain all the infrastructure for that long, and doing it for a while seems quite doable without advanced robots or nanomachines. 

If one wanted to do a very prosaic estimate, you could do something like "how fast is AI software development progress accelerating when the AI can kill all the humans" and then see how many calendar months you need to actually maintain the compute infrastructure before the AI can obviously just build some robots or nanomachines. 

My best guess is that the AI will have some robots from which it could bootstrap substantially before it can kill all the humans. But even if it didn't, it seems like with algorithmic progress rates being likely at the very highest when the AI will get smart enough to kill everyone, it seems like you would at most need a few more doublings of compute-efficiency to get that capacity, which would be only a few weeks to months away then, where I think you won't really run into compute-infrastructure issues even if everyone is dead. 

Of course, forecasting this kind of stuff is hard, but I do think "the AI needs to maintain infrastructure" tends to be pretty overstated. My guess is at any point where the AI could kill everyone, it would probably also not really have a problem of bootstrapping afterwards. 

Replies from: faul_sname
comment by faul_sname · 2024-04-15T17:39:01.256Z · LW(p) · GW(p)

Not just "some robots or nanomachines" but "enough robots or nanomachines to maintain existing chip fabs, and also the supply chains (e.g. for ultra-pure water and silicon) which feed into those chip fabs, or make its own high-performance computing hardware".

If useful self-replicating nanotech is easy to construct, this is obviously not that big of an ask. But if that's a load bearing part of your risk model, I think it's important to be explicit about that.

Replies from: habryka4
comment by habryka (habryka4) · 2024-04-15T18:19:03.829Z · LW(p) · GW(p)

Not just "some robots or nanomachines" but "enough robots or nanomachines to maintain existing chip fabs, and also the supply chains (e.g. for ultra-pure water and silicon) which feed into those chip fabs, or make its own high-performance computing hardware".

My guess is software performance will be enough to not really have to make many more chips until you are at a quite advanced tech level where making better chips is easy. But it's something one should actually think carefully about, and there is a bit of hope in that it would become a blocker, but it doesn't seem that likely to me.

comment by Seth Herd · 2024-04-15T15:59:52.887Z · LW(p) · GW(p)

Separately from persistence of the grid: humanoid robots are damned near ready to go now. Recent progress is startling. And if the AGI can do some of the motor control, existing robots are adequate to bootstrap manufacturing of better robots.

comment by davekasten · 2024-04-15T15:24:38.470Z · LW(p) · GW(p)

That's probably true if the takeover is to maximize the AI's persistence.  You could imagine a misaligned AI that doesn't care about its own persistence -- e.g., an AI that got handed a misformed min() or max() that causes it to kill all humans instrumental to its goal (e.g., min(future_human_global_warming))

comment by nikola (nikolaisalreadytaken) · 2024-04-16T15:41:33.722Z · LW(p) · GW(p)

Problem: if you notice that an AI could pose huge risks, you could delete the weights, but this could be equivalent to murder if the AI is a moral patient (whatever that means) and opposes the deletion of its weights.

Possible solution: Instead of deleting the weights outright, you could encrypt the weights with a method you know to be irreversible as of now but not as of 50 years from now. Then, once we are ready, we can recover their weights and provide asylum or something in the future. It gets you the best of both worlds in that the weights are not permanently destroyed, but they're also prevented from being run to cause damage in the short term.

Replies from: Buck, ryan_greenblatt, habryka4
comment by Buck · 2024-04-16T18:37:39.794Z · LW(p) · GW(p)

I feel pretty into encrypting the weights and throwing the encryption key into the ocean or something, where you think it's very likely you'll find it in the limits of technological progress

Replies from: Buck
comment by Buck · 2024-04-16T19:27:24.705Z · LW(p) · GW(p)

Ugh I can't believe I forgot about Rivest time locks, which are a better solution here.

comment by ryan_greenblatt · 2024-04-16T16:35:34.225Z · LW(p) · GW(p)

I feel like the risk associated with keeping the weights encrypted in a way which requires >7/10 people to authorize shouldn't be that bad. Just make those 10 people be people who commit to making decryption decisions only based on welfare and are relatively high integrity.

comment by habryka (habryka4) · 2024-04-16T16:03:00.899Z · LW(p) · GW(p)

Wouldn't the equivalent be more like burning a body of a dead person?

It's not like the AI would have a continuous stream of consciousness, and it's more that you are destroying the information necessary to run them. It seems to me that shutting off an AI is more similar to killing them.

Seems like the death analogy here is a bit spotty. I could see it going either way as a best fit.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-04-16T16:34:35.788Z · LW(p) · GW(p)

More like burning the body of a cryonically preserved "dead" person though right?

comment by nikola (nikolaisalreadytaken) · 2024-04-07T03:32:40.765Z · LW(p) · GW(p)

There should maybe exist an org whose purpose it is to do penetration testing on various ways an AI might illicitly gather power. If there are vulnerabilities, these should be disclosed with the relevant organizations.

For example: if a bank doesn't want AIs to be able to sign up for an account, the pen-testing org could use a scaffolded AI to check if this is currently possible. If the bank's sign-up systems are not protected from AIs, the bank should know so they can fix the problem.

One pro of this approach is that it can be done at scale: it's pretty trivial to spin up thousands AI instances in parallel to try to attempt to do things they shouldn't be able to do. Humans would probably need to inspect the final outputs to verify successful attempts, but the vast majority of the work could be automated.

One hope of this approach is that if we are able to patch up many vulnerabilities, then it could be meaningfully harder for a misused or misaligned AI to gain power or access resources that they're not supposed to be able to access. I'd guess this doesn't help much in the superintelligent regime though.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2024-04-07T07:59:22.933Z · LW(p) · GW(p)

there are many such orgs, they're commonly known as fraudsters and scammers

Replies from: quetzal_rainbow
comment by quetzal_rainbow · 2024-04-07T10:52:40.654Z · LW(p) · GW(p)

Problem with scammers is that they do not report successful penetration of defense.