Posts
Comments
Kudos for releasing a concept of a plan! Some thoughts:
Regarding the first safety case:
- The amount of progress in mech interp required to make the first safety case suitable seems overly optimistic to me; I basically think that most of the "limitations" are in fact pretty serious. However, I appreciate that the attempt to include concrete requirements.
- I believe that getting good results from the following experiments might be particularly unrealistic:
- "In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more truth-related features to make the model confident that it is in either testing or unmonitored deployment"
- I worry that "truth-related" features are very difficult both in principle and in practice to isolate, as the model's representation of what other people / entities believe will be hard to disentangle from what it itself believes, to the extent that those are even meaningfully separable concepts.
- Even if you could identify a feature roughly corresponding to "I believe the current sentence is true", it is not immediately clear to me how this could be used to successfully convince the model of more complex situational information, e.g. it is not being monitored and is in a real deployment setting. For instance, I don't expect artificially activating that feature on the sentence "I am not being monitored and this is a real deployment setting" to just work.
- "In all of these model organisms, deactivating the features of concern was sufficient to prevent the harmful behaviors"
- I can't concisely refute this. It is just a "huge if true" kind of claim, and as far as I am aware very much not possible with current techniques. If we could isolate a set of linear directions that were robustly causally responsible for scheming behavior, that would have massive ramifications far beyond this setting.
- I worry a little bit about "training on test" here – there may exist some overfit set of classifiers that can stop model organisms from misbehaving in a specific setting without meaningfully generalizing. This is also related to the stated "Realism of the model organisms" concerns.
- "In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more truth-related features to make the model confident that it is in either testing or unmonitored deployment"
- I strongly approve of this element of the plan: "In order to preserve the integrity of the learned features, we ensured that the learned features were never used as part of the training procedure for the models."
- It's kind of hard to do this in full generality, since you need to not only avoid training directly against the probes, but also avoid using the information from probes at all when selecting what to train on.
I'm more excited about the control safety case, but have a few nitpicks:
- Developing a good elicitation strategy is indeed challenging.
- I think it is pretty crucial to use real human red-teaming, especially if sandbagging (more specifically something like exploration hacking) is a concern.
- It may be helpful to give the red team way more affordances than the untrusted model would actually have in deployment, and only remove affordances if defense becomes impossible. Some examples of extra affordances:
- Letting the red team know the blue team's exact strategy in order to search for the best response
- Giving an untrusted model more unmonitored chain-of-thought / tool use / human assistance than it would actually have in deployment
- Allowing human red-teamers to repeatedly attempt to bypass a monitor when in reality a scheming model might have to one-shot it
- Figuring out exactly what extra affordances suffice to break the control protocol may also provide useful signal about where the protocol could be strengthened, and what capabilities need to be tested for going forward.
- Eliciting capabilities from substantially superhuman models is a much harder problem, and is one of the main reasons I don't expect control to suffice for a safety case well beyond human-level AI.
- The space of strategies beyond trusted monitoring alone is pretty large. You mention untrusted monitoring, but there is also a broad space of strategies that include editing outputs with a trusted model, allocating more monitoring/auditing budget in certain situations, etc.
- It is probably best to continually produce new honeypots throughout deployment.
- For instance, if untrusted models are being used to do lots of automated software engineering and research, likely requiring huge amounts of context, it will become more obvious to the model over time that it is in deployment.
- Making new honeypots out of actual deployment settings, for instance by adding exploitable security flaws, may generally make it much harder for the model to tell when it is being tested.
I haven't read the last safety case yet, but may have some thoughts on it later. I am most excited about control at the moment, in part due to concerns that interpretability won't advance far enough to suffice for a safety case by the time we develop transformative AI.
I have such a strong intuitive opposition to the Internal Reaction Drive that I agree with your conclusion that we should update away from any theory which allows it. Then again, perhaps it is impossible to build such a drive for the merely practical reason that any material with a positive or negative index of refraction will absorb enough light to turn the drive into an expensive radiator.
Especially given the recent Nobel prize announcement, I think the most concerning piece of information is that there are cultural forces from within the physics community discouraging people from trying to answer the question at all.
You need abstractions to think and plan at all with limited compute, not just to speak. I would guess that plenty animals which are incapable of speaking also mentally rely on abstractions. For instance, when foraging for apples, I suspect an animal probably has a mental category for apples, and treats them as the same kind of thing rather than completely unrelated configurations of atoms.
The planet Mercury is a pretty good source of material:
Mass: kg (which is about 70% iron)
Radius: m
Volume: m^3
Density: kg/m^3
Orbital radius: m
A spherical shell around the sun at roughly same radius as Mercury's orbit would have a surface area of m^2, and spreading out Mercury's volume over this area gives a thickness of about 1.4 mm. This means Mercury alone provides ample material for collecting all of the Sun's energy via reflecting light – very thin spinning sheets could act as a swarm of orbiting reflectors that focus sunlight onto large power plants or mirrors that direct it to elsewhere in the solar system. Spinning sheets could be made somewhere between 1-100 μm thick, with thicker cables or supports for additional strength, perhaps 1-10 km wide, and navigate using radiation pressure (using cables that bend the sheet, perhaps). Something like or mirrors would be enough to intercept and redirect all of the sun's light.
The gravitational binding energy of Mercury is on the order of J, or on the order of an hour of the Sun's output. This means in theory the time it takes for a new mirror to pay it's own manufacturing energy cost is in principle quite small; if each kg of material from Mercury is enough to make on the order of 1-100 square meters of mirror, then it will pay for itself in somewhere between minutes and hours (there are roughly 10,000 w/m^2 of solar energy at Mercury's orbit, and each kg of material on average requires on the order of J to remove). Only 40-80 doublings are required to consume the whole planet depending on how thick the mirrors are and how much material is used to start the process. Even with many orders of magnitude of overhead to account for inefficiency and heat dissipation, I believe Mercury could be disassembled to cover the entire sun with reflectors on the order of years and perhaps as quickly as months; certainly within decades.
A better way to do the memory overwrite experiment is to prepare a list of what’s in the box to match each of ten possible numbers, then have someone provide a random number while your short term memory doesn’t work and see if you can successfully overwrite the memory that corresponds to that number (as measured by correctly guessing the number much later).
I’m confused. I know that it is like something to be me (this is in some sense the only thing I know for sure). It seems like there rules which shape the things I experience, and some of those rules can be studied (like the laws of physics). We are good enough at understanding some of these rules to predict certain systems with a high degree of accuracy, like how an asteroid will orbit a star or how electrons will be pushed through a wire by a particular voltage in a circuit. But I have no way to know or predict if it is like something to be a fish or GPT-4. I know that physical alterations to my brain seem to affect my experience, so it seems like there is a mapping from physical matter to experiences. I do not know precisely what this mapping is, and this indeed seems like a hard problem. In what sense do you disagree with my framing here?
Oh good catch, I missed that. Thanks!
The only metric natural selection is “optimizing” for is inclusive genetic fitness. It did not “try” to align humans with social status, and in many cases people care about social status to the detriment of their inclusive genetic fitness. This is a failure of alignment, not a success.
I am not so sure it will be possible to extract useful work towards solving alignment out of systems we do not already know how to carefully steer. I think that substantial progress on alignment is necessary before we know how to build things that actually want to help us advance the science. Even if we built something tomorrow that was in principle smart enough to do good alignment research, I am concerned we don’t know how to make it actually do that rather than, say, imitate more plausible-sounding but incorrect ideas. The fact that appending silly phrases like “I’ll tip $200” improves the probability of receiving correct code from current LLMs indicates to me that we haven’t succeeded at aligning them to maximally want to produce correct code when they are capable of doing so.
How does Harry know the name “Lucius Malfoy”?
We aren’t surprised by HHTHHTTTHT or whatever because we perceive it as the event “a sequence containing a similar number of heads and tails in any order, ideally without a long subsequence of H or T”, which occurs frequently.
I’m enjoying this series, and look forward to the next installment.
The thing I mean by “superintelligence” is very different from a government. A government cannot design nanotechnology, and is made of humans which value human things.
What can men do against such reckless indifference?
Can someone with more knowledge give me a sense of how new this idea is, and guess at the probability that it is onto something?
Why are we so sure chatbots (and parrots for that matter) are not conscious? Well, maybe the word is just too slippery to define, but I would bet that parrots have some degree of subjective experience, and I am sufficiently uncertain regarding chatbots that I do worry about it slightly.
Please note that the graph of per capita war deaths is on a log scale. The number moves over several orders of magnitude. One could certainly make the case that local spikes were sometimes caused by significant shifts in the offense-defense balance (like tanks and planes making offense easier for a while at the beginning of WWII). These shifts are pushed back to equilibrium over time, but personally I would be pretty unhappy about, say, deaths from pandemics spiking 4 orders of magnitude before returning to equilibrium.
This random Twitter person says that it can't. Disclaimer: haven't actually checked for myself.
https://chat.openai.com/share/36c09b9d-cc2e-4cfd-ab07-6e45fb695bb1
Here is me playing against GPT-4, no vision required. It does just fine at normal tic-tac-toe, and figures out anti-tic-tac-toe with a little bit of extra prompting.
Yes. I think the title of my post is misleading (I have updated it now). I think I am trying to point at the problem that the current incentives mean we are going to mess up the outer alignment problem, and natural selection will favor the systems that we fail the hardest on.
That's a very fair response. My claim here is really about the outer alignment problem, and that if lots of people have access to the ability to create / fine tune AI agents, many agents that have goals misaligned with humanity as a whole will be created, and we will lose control of the future.
I suppose what I'm trying to point to is some form of the outer alignment problem. I think we may end up with AIs that are aligned with human organizations like corporations more than individual humans. The reason for this is that corporations or militaries which employ more ruthless AIs will, over time, accrue more power and resources. It's not so much explicit (i.e. violent) competition, but rather the gradual tendency for systems which are power-seeking and resource-maximizing to end up with more power and resources over time. If we allow for the creation / fine tuning of many AI agents, and allow them to accrue resources and copy themselves, then natural selection will favor the more selfish ones which are least aligned with humanity at large. We already require pretty extensive regulation to make sure that corporations don't incur significant negative externalities, and these are organizations that are run by and composed of humans. When those entities are no longer humans, I think the vast majority of power and resources will no longer be explicitly controlled by humans, and moreover will be controlled by AI which has values poorly aligned with the majority of humans. The AI's goals will only be aligned with the short-term interests of the small number of humans that created them in the first place. Once the majority of people realize that this system is not acting in their long-term interests, there will be nothing they can do about it.
Yeah. I think a key point that is often overlooked is that even if powerful AI is technically controllable, i.e. we solve inner alignment, that doesn't mean society will handle it safely. I think by default it looks like every company and military is forced to start using a ton of AI agents (or they will be outcompeted by someone else who does). Competition between a bunch of superhuman AIs that are trying to maximize profits or military tech seems really bad for us. We might not lose control all at once, but rather just be gradually outcompeted by machines, where "gradually" might actually be pretty quick. Basically, we die by Moloch.
Yeah, I've seen this video before. Still excellent.
Yeah, in general, we are pretty compute limited and should stick to good heuristics for most kinds of problems. I do think that most people rely too much on heuristics, so for the average person the useful lesson is "actually stop and think about things once in a while", but I can see how the opposite problem may sometimes arise in this community.
I find it useful to distinguish between epistemic and instrumental rationality. You're talking about instrumental rationality – and it could be instrumentally useful to convince someone of your beliefs, to teach them to think clearly, or to actively mislead them.
Epistemic rationality, on the other hand, means trying to have true beliefs, and in this case it's better to teach someone to fish than to force them to accept your fish.
In the doomsday argument, we are the random runner. If the runner with only 10 people behind him assumed his position was randomly selected, and tried to estimate the total number of runners, he would be very wrong. We could very well be that runner near the back of the race; we weren't randomly selected to be at the back, we just are, and the fact that there are ten people behind us doesn't give us meaningful information about the total number of runners.
Okay, suppose I was born in Teenytown, a little village on the island nation of Nibblenest. The little one-room schoolhouse in Teenytown isn't very advanced, so no one ever teaches me that there are billions of other people living in all the places I've never heard of. Now, I might think to myself, the world must be very small – surely, if there were billions of people living in millions of towns and cities besides Teenytown, it would be very unlikely to be born in Teenytown; therefore, Teenytown must be one of the only villages on Earth.
Clearly, this is absurd, right? The Doomsday argument says that if there are lots of other people in X scenario that is different from mine (be it living in the future or across the ocean), then it would be unlikely for me to experience not X, therefore those other people most likely don't exist. But I am me, and I couldn't be anyone else. It makes no sense to talk about the "probability of being me". I don't think it is possible to "assume I am a randomly sampled observer" or something like that.
The number of humans that I notice have been born to date does not depend whatsoever on how many humans might exist in the future. My experience looks exactly the same whether humanity will be deleted tomorrow by a rogue black hole or spend billions of years spreading across the universe.
I think the claim that we basically understand the universe is misleading. I'm especially unconvinced by your vague explanation of consciousness; I don't think we have anything close to an empirically supported mechanistic model that makes good predictions. I personally have significant uncertainty regarding what kinds of things can have subjective experiences, or why they do.
This also feels like a good opportunity to say that the Doomsday argument has never made much sense to me; it has always felt wrong to me to treat being “me” as a random sample of observers. I couldn’t be anyone except me. If the future has trillions of humans or no humans, the person which is me will feel the same way in either case. I can't possibly condition on being me, because I couldn't be anyone else. The doomsday argument treats my perspective as a random sample of all possible humans, or even all possible observers, which feels like a massive type error.
On a similar note, why is it remotely surprising that we live in a universe with laws of physics that support our existence? We couldn't possibly observe any laws of physics except the ones we have. Does it even make sense to say that the laws of physics "could be different"? I'm not convinced we can even imagine a coherent universe with different fundamental laws of physics, in the same way that I can't imagine what it would mean to live in a universe where the circle constant is something other than This may well just be a failure of my imagination, however – more crucially, hypothesizing that there are lots of universes with different laws of physics doesn't actually explain why we observe our universe. This kind of multiverse idea is a strictly more complicated hypothesis than just accepting that our universe exists and being agnostic about others, right? The only remotely reasonable argument I've heard in favor of some kind of multiverse is that many-worlds is a simpler interpretation of quantum mechanics than wavefunction collapse. This is a distinct idea from the proposal that our universe was born as a random sample among countless others with different physical laws, which is not a simpler explanation of anything at all as far as I can tell.
If I have misunderstood or mischaracterized these arguments, please let me know.
Planes would not be required for stratospheric injection of SO2. It could in theory be done much more cheaply with balloons: https://caseyhandmer.wordpress.com/2023/06/06/we-should-not-let-the-earth-overheat/
Exactly, it has always felt wrong to me to treat being “me” as a random sample of observers. I couldn’t be anyone except me. If the future has trillions of humans or no humans, the person which is me will feel the same way in either case. I find the doomsday argument absurd because it treats my perspective as a random sample, which feels like a type error.
Indeed. I think about this type of thing often when I consider the concept of superhuman AI - when I spend hours stuck on a problem with a simple solution or forget something important, it’s not hard to imagine an algorithm much smarter than me that just doesn’t make those mistakes. I think the bar really isn’t that high for improving substantially on human cognition. Our brains have to operate under very strict energy constraints, but I can easily imagine a machine which performs a lot better than me by applying more costly but effective algorithms and using more precise memory. A pocket calculator is the trivial case, but I expect most of the other algorithms my brain uses can also be improved a lot given a larger energy and compute budget.
I am a literal freshman, and not feeling super optimistic about the future right now. How should I think about how to spend my time?
Hi Naomi!
The advice about applying to MIT/Stanford is probably correct, if just to have the option. That said, I definitely don't regret ending up here!
We are quite similar! I was also accepted to Harvard REA – exactly one year ago – and was too lazy mentally drained by the application process to apply to MIT after that. I arrived intending to study physics, but I've since realized AI safety is a much more important and exciting problem to work on. Seems like you got there a bit sooner than I did! HAIST is a wonderful community, and also a great resource for finding upskilling and research opportunities.
I've only been here for a semester, so take this with a grain of salt, but I don't think you should be too worried about taking classes that interest you. It seems like professors are generally pretty flexible if you reach out to them, and it's easy to fulfill general education requirements with hardly any effort (I just took a class on "Anime as Global Popular Culture" which required essentially no work).
Let me know if you have any questions! I also have some older friends who can give you more specific advice. I sincerely hope I get to meet you soon!