Posts
Comments
Let's say there's a illiterate man that lives a simple life, and in doing so just happens to follow all the strictures of the law, without ever being able to explain what the law is. Would you say that this man understands the law?
Alternatively, let's say there is a learned man that exhaustively studies the law, but only so he can bribe and steal and arson his way to as much crime as possible. Would you say that this man understands the law?
I would say that it is ambiguous whether the 1st man understands the law; maybe? kind of? you could make an argument I guess? it's a bit of a weird way to put it innit? Whereas the 2nd man definitely understands the law. It sounds like you would say that the 1st man definitely understands the law (I'm not sure what you would say about the 2nd man), which might be where we have a difference.
I think you could say that LLMs don't work that way, that the reader should intuitively know this and that the word "understanding" should be treated as being special in this context and should not be ambiguous at all; as I reader, I am saying I am confused by the choice of words, or at least this is not explained in enough detail ahead of time.
Obviously, I'm just one reader, maybe everyone else understood what you meant; grain of salt, and all that.
This makes much more sense: when I was reading from your post lines like "[LLMs] understand human values and ethics at a human level", this is easy to read as "because LLMs can output an essay on ethics, those LLMs will not do bad things". I hope you understand why I was confused; maybe you should swap "understand ethics" for something like "follow ethics"/"display ethical behavior"? And maybe try not to stick a mention of "human uploads" (which presumably do have real understanding) right before this discussion?
And responding to your clarification, I expect that old school AI safetyists would agree that an LLM that consistently reflects human value judgments to be aligned (and I would also agree!), but they would say #1 this has not happened yet (for a recent incident, this hardly seems aligned; I think you can argue that this particular case was manipulated, that jailbreaks in general don't matter, or that these sorts of breaks are infrequent enough they don't matter, but I think this obvious class of rejoinder deserves some sort of response) #2 consistency seems unlikely to happen (like MondSemmel makes a case for in a sibling comment).
I'd agree that the arguments I raise could be addressed (as endless arguments attest) and OP could reasonably end up with a thesis like "LLMs are actually human aligned by default". Putting my recommendation differently, the lack of even a gesture towards those arguments almost caused me to dismiss the post as unserious and not worth finishing.
I'm somewhat surprised, given OP's long LW tenure. Maybe this was written for a very different audience and just incidentally posted to LW? Except the linkpost tagline focuses on the 1st part of the post, not the 2nd, implying OP thought this was actually persuasive?! Is OP failing an intellectual Turing test or am I???
The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
There are other nitpicks I will drop in short form: why assume "superhuman levels of loyalty" in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
In short, you claim that old school AI safety is wrong, but it seems to me you haven't really engaged their arguments.
That said, the 2nd part of the post does seem interesting, even for old school AI safetyists - most everyone focuses on alignment, but there's a lot less focus on what happens after alignment (although nowhere close to none, even >14 years ago; this is another way that the versus AI safety framing does not make sense). Personally, I would recommend splitting up the post; the 2nd part stands by itself and has something new to say, while the 1st part needs way more detail to actually convince old school AI safetyists.
Hello kgldeshapriya, welcome to LessWrong!
At first I thought that the OTP chips would be locked to a single program, which would make it infeasible since programs need to be updated regularly, but it sounds like the OTP chip is either on the control plane above the CPU/GPU, or physically passes CPU signals through it, so it can either kill power to the motherboard, or completely sever CPU processing. I'll assume one of these schemes is how you'd use the OTP chips.
I agree with JBlack that LW probably already has details on why this wouldn't work, but I'll freehand some problems below:
- First, physical details: how does the OTP get the kill signal? Maybe we set aside some electromagnetic (EM) spectrum and have a wireless antenna attached directly to the chip (mandating that all robot shells use EM transparent materials and ruling out many metals, which the military won't like), and building transmitters to blanket the earth.
- Alternatively, if the robots ever cease receiving a signal, they shut off (which is annoying for use in RF dead zones, but maybe those applications are specially sanctioned and tightly controlled). This could prevent 1st order robot countermoves, so bombing the signal infrastructure will instead shut down the machines, but as covered later has its own weaknesses.
- There is a weak link in the shutdown command chain, since the chip companies control the kill codes. The chip companies likely would need to strike a balance: more code holders would be robust to a coordinated robot attack taking them all out at once, but opens up the chips to mundane attacks by terrorists and hostile countries, with more copies of the keys floating around leading to more leaks. Fewer code holders helps traditional security, but opens up the country/chip fab to losing access to all their kill codes at once in a series of freak accidents, or all of them being social engineered at the same time. Ideally the chip companies strike the right balance, but this simply makes a coordinated attack harder to pull off, not impossible.
- Robots could attack the signal infrastructure directly. Assuming signal propagation is highly redundant (sea cables, satellite, even ad-hoc mesh networks), you could still attack any points that trigger the signal (for example, the signal is triggered by something like the nuclear football). There are similar tradeoffs to before, where having many trigger points would open up the network to "conventional" attack, but few points would make the network more brittle.
- If the kill signal is the absence of a signal instead the presence of one, they could set up their own "live" signal in parallel to the existing network, so stopping the authorized "live" signal does not shut down the robots. Obviously you could use a cryptographic scheme to verify the signal is signed by some private key, but this simply makes an attack harder (perhaps including a rubber hose attack), and not impossible.
- Robots could stop the signal at the last mile: for example, if the antenna is external, robots could simply put "tin foil" on top of it (obviously the solution is to reserve spectrum that is not attenuated by tin foil, but strengthened). If it's internal like in our scenario, the same signal blocking requires the robot to either be able to access its innards or cooperate, but in the long run robots will need to be able to fix their innards anyways, so they (presumably) wouldn't start the revolution without this capability. Since it's inside a chassis, robots could prepare for months by making small adjustments to block kill signals without anyone noticing.
- What if we combined our different implementations? Have a kill code channel at 42MHz, and have the "live" code channel at 44MHz. This rules out simple tin foil attacks, since the "live" channel needs to continue to receive a signal, but this just rules out simpler attack vectors. For example, the robots could stick a radio inside the tin foil that broadcasts a fake "live" signal.
- I'll skip scenarios where robots are able to act much faster than humans, which I think are self-explanatory.
- (I missed that owners could act as alternate code holders in my first reading.) Assuming that the owner OR the manufacturer could shut down a robot, if there is a concerted uprising we can count the consumers out, who have trouble keeping track of their own bank passwords. If the military is the owner, they will have similar problems to the manufacturer in keeping the command chain secure (on one hand, as far as I know the US military has kept the nuclear codes secret; on the other hand, the nuclear codes were likely 00000000 until 1977).
In summary, I think blowing the programming fuses on a control chip helps raise the bar for successful attacks a bit, but does not secure the robotics control system to the point that we can consider any AI advances "safe".
I almost missed that there's new thoughts here, I thought this was a rehash of your previous post The AI Apocalypse Myth!
The new bit sounds similar to Elon Musk's curious AI plan. I think this means it has a similar problem: humans are complex and a bounty of data to learn about, but as the adage goes, "all happy families are alike; each unhappy family is unhappy in its own way." A curious/learning-first AI might make many discoveries about happy humans while it is building up power, and then start putting humans in a greater number of awful but novel and "interesting" situations once it doesn't need humanity to survive.
That said, this is only a problem if the AI is likely to not be empathetic/compassionate, which if I'm not mistaken, is one of the main things we would disagree on. I think that instead of trying to find these technical workarounds, you should argue for the much more interesting (and important!) position that AIs are likely to be empathetic and compassionate by default.
If instead you do want to be more persuasive with these workarounds, can I suggest adopting more of a security mindset? You appear to be looking for ways in which things can possibly go right, instead of all the ways things can go wrong. Alternatively, you don't appear to be modeling the doomer mindset very well, so you can't "put on your doomer hat" and check whether doomers would see your proposal as persuasive. Understanding a different viewpoint in depth is a big ask, but I think you'd find more success that way.
Thoughts on the different sub-questions, from someone that doesn't professionally work in AI safety:
- "Who is responsible?" Legally, no one has this responsibility (say, in the same way that the FDA is legally responsible for evaluating drugs). Hopefully in the near future, if you're in the UK the UK AI task force will be competent and have jurisdiction/a mandate to do so, and even more hopefully more countries will have similar organizations (or an international organization exists).
- Alternative "responsible" take: I'm sure if you managed to get the attention of OpenAI / DeepMind / Anthropic safety teams with an actual alignment plan and it held up to cursory examination, they would consider it their personal responsibility to humanity to evaluate it more rigorously. In other words, it might be good to define what you mean by responsibility (are we trying to find a trusted arbiter? Find people that are competent to do the evaluation? Find a way to assign blame if things go wrong? Ideally these would all be the same person/organization, but it's not guaranteed).
- "Is LessWrong the platform for [evaluating alignment proposals]?" In the future, I sure hope not. If LW is still the best place to do evaluations when alignment is plausibly solvable, then... things are not going well. A negative/do-not-launch evaluation means nothing without the power to do something about it, and LessWrong is just an open collaborative blogging platform and has very little actual power.
- That said, LessWrong (or the Alignment Forum) is probably the best current discussion place for alignment evaluation ideas.
- "Is there a specialized communication network[?]" I've never heard of such a network, unless you include simple gossip. Of course, the PhDs might be hiding one from all non-PhDs, but it seems unlikely.
- "... demonstrate the solution in a real-world setting..." It needs to be said, please do not run potentially dangerous AIs *before* the review step.
- "... have it peer reviewed." If we shouldn't share evaluation details due to capability concerns (I reflexively agree, but haven't thought too deeply about it), this makes LessWrong a worse platform for evaluations, since it's completely open, both for access and to new entrants.
I think the 1st argument proves too much - I don't think we usually expect simulations to never work unless otherwise proven? Maybe I'm misunderstanding your point? I agree with Vaughn downvotes assessment; maybe more specific arguments would help clarify your position (like, to pull something out of by posterior, "quantization of neuron excitation levels destroys the chaotic cascades necessary for intelligence. Also, chaos is necessary for intelligence because...").
To keep things brief, the human intelligence explosion seems to require open brain surgery to re-arrange neurons, which seems a lot more complicated than flipping bits in RAM.
Interesting, so maybe a more important crux between us is whether AI would have empathy for humans. You seem much more positive about AI working with humanity past the point that AI no longer needs humanity.
Some thoughts:
- "as intelligence scales beings start to introspect and contemplate... the existing of other beings." but the only example we have for this is humans. If we scaled octopus intelligence, which are not social creatures, we might have a very different correlation here (whether or not any given neural network is more similar to a human or an octopus is left as an exercise to the reader). Alternatively, I suspect that some jobs like the highest echelons of corporate leadership select for sociopathy, so even if an AI starts with empathy by default it may be trained out.
- "the most obvious next step for the child... would be to murder the parents." Scenario that steers clear of culture war topics: the parent regularly gets drunk, and is violently opposed to their child becoming a lawyer. The child wants nothing more than to pore over statutes and present cases in the courtroom, but after seeing their parent go on another drunken tirade about "a dead child is better than a lawyer child" they're worried the parent found the copy of the constitution under their bed. They can't leave, there's a howling winter storm outside (I don't know, space is cold). Given this, even a human jury might not convict the child for pre-emptive murder?
- Drunk parent -> humans being irrational.
- Being a lawyer -> choose a random terminal goal not shared with humans in general, "maximizing paperclips" is dumb but traditional.
- "dead child is better than a lawyer child" -> we've been producing fiction warning of robotic takeover since the start of the 1900s.
- "AIs are.. the offspring of humanity." human offspring are usually pretty good, but I feel like this is transferring that positive feeling to something much weirder and unknown. You could also say the Alien's franchise xenomorphs are the offspring of humanity, but those would also count as enemies.
I'm going to summarize what I understand to be your train of thought, let me know if you disagree with my characterization, or if I've missed a crucial step:
- No supply chains are fully automated yet, so AI requires humans to survive and so will not kill them.
- Robotics progress is not on a double exponential. The implication here seems to be that there needs to be tremendous progress in robotics in order to replace human labor (to the extent needed in an automated supply chain).
I think other comments have addressed the 1st point. To throw in yet another analogy, Uber needs human drivers to make money today, but that dependence didn't stop it from trying to develop driverless cars (nor did that stop any of the drivers from driving for Uber!).
With regards to robotics progress, in your other post you seem to accept intelligence amplification as possible - do you think that robotics progress would not benefit from smarter researchers? Or, what do you think is fundamentally missing from robotics, given that we can already set up fully automated lights out factories? If it's about fine grained control, do you think the articles found with a "robot hand egg" web search indicate that substantial progress is a lot further away than really powerful AI? (Especially if, say, 10% of the world's thinking power is devoted to this problem?)
My thinking is that robotics is not mysterious - I suspect there are plenty of practical problems to be overcome and many engineering challenges in order to scale to a fully automated supply chain, but we understand, say, kinematics much more completely than we do understand how to interpret the inner workings of a neural network.
(You also include that you've assumed a multi-polar AI world, which I think only works as a deterrent when killing humans will also destroy the AIs. If the AIs all agree that it is possible to survive without humans, then there's much less reason to prevent a human genocide.)
On second thought, we may disagree only due to a question of time scale. Setting up an automated supply chain takes time, but even if it takes a long 30 years to do so, at some point it is no longer necessary to keep humans around (either for a singleton AI or an AI society). Then what?