Posts
Comments
While I'd be surprised to hear about something like this happening, I wouldn't be that surprised. But in this case, it seems pretty clear that o1 is correcting its own mistakes in a way that past GPTs essentially never did, if you look at the CoT examples in the announcement (e.g. the "Cipher" example).
Yeah, I'm particularly interested in scalable oversight over long-horizon tasks and chain-of-thought faithfulness. I'd probably be pretty open to a wide range of safety-relevant topics though.
In general, what gets me most excited about AI research is trying to come up with the perfect training scheme to incentivize the AI to learn what you want it to - things like HCH, Debate, and the ELK contest were really cool to me. So I'm a bit less interested in areas like mechanistic interpretability or very theoretical math
Thanks Neel, this comment pushed me over the edge into deciding to apply to PhDs! Offering to write a draft and taking off days of work are both great ideas. I just emailed my prospective letter writers and got 2/3 yeses so far.
I just wrote another top-level comment on this post asking about the best schools to apply to, feel free to reply if you have opinions :)
I decided to apply, and now I'm wondering what the best schools are for AI safety.
After some preliminary research, I'm thinking these are the most likely schools to be worth applying to, in approximate order of priority:
- UC Berkeley (top choice)
- CMU
- Georgia Tech
- University of Washington
- University of Toronto
- Cornell
- University of Illinois - Urbana-Champaign
- University of Oxford
- University of Cambridge
- Imperial College London
- UT Austin
- UC San Diego
I'll probably cut this list down significantly after researching the schools' faculty and their relevance to AI safety, especially for schools lower on this list.
I might also consider the CDTs in the UK mentioned in Stephen McAleese's comment. But I live in the U.S. and am hesitant about moving abroad - maybe this would involve some big logistical tradeoffs even if the school itself is good.
Anything big I missed? (Unfortunately, the Stanford deadline is tomorrow and the MIT deadline was yesterday, so those aren't gonna happen.) Or, any schools that seem obviously worse than continuing to work as a SWE at a Big Tech company in the Bay Area? (I think the fact that I live near Berkeley is a nontrivial advantage for me, career-wise.)
Thanks, this post made me seriously consider applying to a PhD, and I strong-upvoted. I had vaguely assumed that PhDs take way too long and don't allow enough access to compute compared to industry AI labs. But considering the long lead time required for the application process and the reminder that you can always take new opportunities as they come up, I now think applying is worth it.
However, looking into it, putting together a high-quality application starting now and finishing by the deadline seems approximately impossible? If the deadline were December 15, that would give you two weeks; other schools like Berkeley have even earlier deadlines. I asked ChatGPT how long it would take to apply to just a single school, and it said it would take 43–59 hours of time spent working, or ~4–6 weeks in real time. Claude said 37-55 hours/4-6 weeks.
Not to discourage anyone from starting their application now if they think they can do it - I guess if you're sufficiently productive and agentic and maybe take some amphetamines, you can do anything. But this seems like a pretty crazy timeline. Just the thought of asking someone to write me a recommendation letter in a two-week timeframe makes me feel bad.
Your post does make me think "if I were going to be applying to a PhD next December, what would I want to do now?" That seems pretty clarifying, and would probably be a helpful frame even if it turns out that a better opportunity comes along and I never apply to a PhD.
I think it'd be a good idea for you to repost this in August or early September of next year!
Thanks for making this point @abramdemski, and thanks for the prompt to think about this more @Dakara. After some reflection, I think the idea that "prompting an LLM with a human-interpretable prompt is more efficient than using a hyper-optimized nonsense prompt" is a really interesting and potentially important hypothesis to test, and I actually am very uncertain whether or not it's true.
We could do an experiment that looks something like this:
- Come up with a list of ways we might want an LLM to respond, such as "Every paragraph should contain exactly one sentence that ends with a period" or "The first letter of each sentence in your response should repeatedly spell ABCABCABC...". These ways of responding should be automatically checkable.
- Make sure that the LLM can follow all of these instructions pretty robustly when they're given in plain English.
- Replicate the method in https://llm-attacks.org (GCG), but with the goal of getting the model to follow these instructions rather than jailbreaking it. It looks like they actually have the full code available, which could make this much easier.
- Make sure that your version of GCG is incentivized to write prompts that are as short as possible (I'm not sure whether this is already the case).
- The reason that this is very different from jailbreaking with GCG is that the LLM isn't actively trained against responding in these ways. So we should anticipate that the prompt won't have to be as long as it would be if we were jailbreaking.
- Determine the answer to the question: is the shortest natural language prompt that makes the LLM follow a given method of responding longer than the prompt output by GCG?
I put this on my list of experiments to try, but I may or may not get to it soon. Anyone reading this, feel free to steal this idea! Just let me know beforehand by commenting, so we don't duplicate work
The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.
For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.
However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.
In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".
Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."
At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.
Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.
The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.
The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.
This is a good point!
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
This is especially likely to hold under the following conditions:
- The step size should be small. I alluded to this in the post, but it seems very hard to select a single token that conveys some nonobvious information to the frozen planner in a way that somehow improves performance. The more tokens per step, the easier this becomes.
- We should apply regularization to the planner, to make it stay somewhat close to the base LLM's logprobs. That can rule out especially crazy sequences of tokens like in the research at https://llm-attacks.org.
- We can continue training the frozen planner on human data, where the humans write continuations of CoT prefixes from the optimized planner. That way, if the optimized planner puts the frozen planner off distribution, the humans can provide examples that will put it back on track. This data collection might be pretty expensive though - maybe there's a smart way to choose optimized CoT prefixes that will be most helpful to collect human continuations for.
If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.
...
So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.
What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."
One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.
The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.
I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.
There seems to be quite a lot of low-hanging fruit here! I'm optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).
918,367 kg
An average chimp is 45 kg
918,367 kg / 45 (kg / chimp)
= 20,408 chimps
Based on my limited experience with lucid dreaming, my impression is that, roughly, whatever you expect to happen in a dream will happen. This includes things like the "lessons" in this post. As far as I know, there's no particular reason that lucid dreams have to be easier for women or people ages 20-30, or that you can't transform into someone else in a lucid dream (it's happened to me). But if you convince yourself of these lessons, then they'll be true for you, and you'll be limiting yourself for no reason.
Your example wouldn't be true, but "Dragons attack Paris" would be, interpreted as a statement about actual dragons' habits
Yep, or use method #2 on my list to make the paraphraser remove as much information as possible
Yeah, this is actually very similar to an earlier version of incremental steering, before I thought of making the "frozen planner" jump in and finish the reasoning process.
The problem is, even if we suppose that the Shoggoth won't be at all motivated to hide its reasoning from the Face (i.e. ignoring what @ryan_greenblatt brought up in his comment), the Shoggoth still might use steganography in order to make its thoughts more compact (or just because it's not particularly motivated to make everything human-understandable).
Even if we don't reward the planner directly for compactness, the reality is that we will have to choose some cutoff point during training so its thoughts don't run literally forever, which gives it some reason to be succinct.
I figured out a workaround to edit my post - it's up to date now
I just realized this post is actually one of my earlier drafts, and a bug is keeping me from editing the post.
The final version of my post is here: https://docs.google.com/document/d/1gHaCHhNOUBqxcXxLeTQiqRjG_y7ELkBdM2kcySKDi_Q/edit
I just wrote a response to this post listing 5 specific ways we could improve CoT faithfulness: https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness
> seems likely that o1 was trained with supervision on the individual CoT steps
OpenAI directly says that they didn't do that
This is also how I interpreted them at first, but after talking to others about it, I noticed that they only say they "cannot train any policy compliance or user preferences onto the chain of thought". Arguably, this doesn't eliminate the possibility that they train a PRM using human raters' opinions about whether the chain of thought is on the right track to solve the problem, independently of whether it's following a policy. Or even if they don't directly use human preferences to train the PRM, they could have some other automated reward signal on the level of individual steps.
I agree that reading the CoT could be very useful, and is a very promising area of research. In fact, I think reading CoTs could be a much more surefire interpretability method than mechanistic interpretability, although the latter is also quite important.
I feel like research showing that CoTs aren't faithful isn't meant to say "we should throw out the CoT." It's more like "naively, you'd think the CoT is faithful, but look, it sometimes isn't. We shouldn't take the CoT at face value, and we should develop methods that ensure that it is faithful."
Personally, what I want most out of a chain of thought is that its literal, human-interpretable meaning contains almost all the value the LLM gets out of the CoT (vs. immediately writing its answer). Secondarily, it would be nice if the CoT didn't include a lot of distracting junk that doesn't really help the LLM (I suspect this was largely solved by o1, since it was trained to generate helpful CoTs).
I don't actually care much about the LLM explaining why it believes things that it can determine in a single forward pass, such as "French is spoken in Paris." It wouldn't be practically useful for the LLM to think these things through explicitly, and these thoughts are likely too simple to be helpful to us.
If we get to the point that LLMs can frequently make huge, accurate logical leaps in a single forward pass that humans can't follow at all, I'd argue that at that point, we should just make our LLMs smaller and focus on improving their explicit CoT reasoning ability, for the sake of maintaining interpretability.
Yeah, you kind of have to expect from the beginning that there's some trick, since taken literally the title can't actually be true. So I think it's fine
I don't think so, just say as prediction accuracy approaches 100%, the likelihood that the mind will use the natural abstraction increases, or something like that
If you read the post I linked, it probably explains it better than I do - I'm just going off of my memory of the natural abstractions agenda. I think another aspect of it is that all sophisticated-enough minds will come up with the same natural abstractions, insofar as they're natural.
In your example, you could get evidence that 0 and 1 voltages are natural abstractions in a toy setting by:
- Training 100 neural networks to take the input voltages to a program and return the resulting output
- Doing some mechanistic interpretability on them
- Demonstrating that in every network, values below 2.5V are separated from values above 2.5V in some sense
See "natural abstractions," summarized here: https://www.lesswrong.com/posts/gvzW46Z3BsaZsLc25/natural-abstractions-key-claims-theorems-and-critiques-1
In your example, it makes more sense to treat voltages <2.5 and >2.5 as different things, rather than <5.0 and >5.0, because the former helps you predict things about how the computer will behave. That is, those two ranges of voltage are natural abstractions.
It can also be fun to include prizes that are extremely low-commitment or obviously jokes/unlikely to ever be followed up on. Like "a place in my court when I ascend to kinghood" from Alexander Wales' Patreon
This is great! Maybe you'd get better results if you "distill" GPT2-LN into GPT2-noLN by fine-tuning on the entire token probability distribution on OpenWebText.
Just curious, why do you spell "useful" as "usefwl?" I googled the word to see if it means something special, and all of the top hits were your comments on LessWrong or the EA Forum
If I understand correctly, you're basically saying:
- We can't know how long it will take for the machine to finish its task. In fact, it might take an infinite amount of time, due to the halting problem which says that we can't know in advance whether a program will run forever.
- If our machine took an infinite amount of time, it might do something catastrophic in that infinite amount of time, and we could never prove that it doesn't.
- Since we can't prove that the machine won't do something catastrophic, the alignment problem is impossible.
The halting problem doesn't say that we can't know whether any program will halt, just that we can't determine the halting status of every single program. It's easy to "prove" that a program that runs an LLM will halt. Just program it to "run the LLM until it decides to stop; but if it doesn't stop itself after 1 million tokens, cut it off." This is what ChatGPT or any other AI product does in practice.
Also, the alignment problem isn't necessarily about proving that a AI will never do something catastrophic. It's enough to have good informal arguments that it won't do something bad with (say) 99.99% probability over the length of its deployment.
Reading the Wikipedia article for "Complete (complexity)," I might have misinterpreted what "complete" technically means.
What I was trying to say is "given Sora, you can 'easily' turn it into an agent" in the same way that "given a SAT solver, you can 'easily' turn it into a solver for another NP-complete problem."
I changed the title from "OpenAI's Sora is agent-complete" to "OpenAI's Sora is an agent," which I think is less misleading. The most technically-correct title might be "OpenAI's Sora can be transformed into an agent without additional training."
That sounds more like "AGI-complete" to me. By "agent-complete" I meant that Sora can probably act as an intelligent agent in many non-trivial settings, which is pretty surprising for a video generator!
First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.
Can we solve this problem by setting the default action to "do nothing," then giving the agent an extra action to "do nothing and give the shutdown subagent +1 reward?"
I think the implication was that "high-status men" wouldn't want to hang out with "low-status men" who awkwardly ask out women
On the topic of AI for forecasting: just a few days ago, I made a challenge on Manifold Markets to try to incentivize people to create Manifold bots to use LLMs to forecast diverse 1-month questions accurately, with improving epistemics as the ultimate goal.
You can read the rules and bet on the main market here: https://manifold.markets/CDBiddulph/will-there-be-a-manifold-bot-that-m?r=Q0RCaWRkdWxwaA
If anyone's interested in creating a bot, please join the Discord server to share ideas and discuss! https://discord.com/channels/1193303066930335855/1193460352835403858
Thanks for the post! I had a similar idea which might let you maintain (or improve) accuracy while still getting the benefit of explanations - basically fine-tune the model on explanations that make it most likely to output the correct token.
For instance, you might have it to fill in the text between <think> and </think> on a bunch of text examples like this: "The capital of the country northeast of Spain is <think> The country northeast of Spain is France, and its capital is Paris </think> Paris".
You make the LLM come up with, say, 10 explanations each time, and choose the one that maximizes the logprob of the correct token ("Paris") immediately after </think>. Then fine-tune it to complete prompts like "The capital of the country northeast of Spain is <think>" with completions like "The country northeast of Spain is France, and its capital is Paris </think> Paris". Then, generate more completions with the fine-tuned model and fine-tune it yet again with the best completions. Rinse and repeat.
Eventually, the LLM should relearn how to guess the next token (at worst, it can just ignore everything in its "think" stage and get identical performance, so it should be able to learn this easily enough).
Alternatively, instead of training to maximize the correct token's logprob, you can choose prompts that minimize the perplexity over many tokens, say 100 tokens. This will be more useful in practice, since we'll probably mostly want explanations of the LLM's thinking across its entire response rather than in the very short term.
This helps because the LLM will actually have some "skin in the game" instead of just blathering on arbitrarily; the disadvantage is that it could eventually learn to optimize its thinking by encoding additional information in its explanation through steganography. Still, this certainly seems like an improvement on inscrutable matrices.
You could try to enforce that the LLM only uses the semantic content of its explanation by making another LLM rephrase it, translating it to another language, etc. This kind of approach wouldn't stop an adversary actively trying to hide its thoughts from you, but it seems like it would prevent most issues for now.
Thanks for doing this, this looks like it'll be very helpful for beginners in AI safety, and the content looks great!
I don't know if this will be addressed in your UI redesign, but I find the UI very counterintuitive. The main problem is that when I open and then close a tab, I expect every sub-tab to collapse and return to the previous state. Instead, the more tabs I open, the more cluttered the space gets, and there's no way to undo it unless I remove the back part of the URL and reload, or click the Stampy logo.
In addition, it's impossible to tell which tab was originally nested under which parent tab, which makes it much more difficult to navigate. And confusingly, sometimes there are "random" tabs that don't necessarily follow directly from their parent tabs (took me a while to figure this out). On a typical webpage, I could imagine thinking "this subtopic is really interesting; I'm going to try to read every tab under it until I'm done," but these design choices are pretty demotivating for that.
I don't have a precise solution in mind, but maybe it would help to color-code different kinds of tabs (maybe a color each for root tabs, leaf tabs, non-root branching tabs, and "random" tabs). You could also use more than two visual layers of nesting - if you're worried about tabs getting narrower and narrower, maybe you could animate the tab expanding to full width and then sliding back into place when it's closed. Currently an "unread" tab is represented by a slight horizontal offset, but you could come up with another visual cue for that. I guess doing lots of UX interviews and A/B testing will be more helpful than anything I could say here.
Came here to say this - I also clicked the link because I wanted to see what would happen. I wouldn't have done it if I hadn't already assumed it was a social experiment.
No, it makes sense to me. I have no idea why you were downvoted
I'd be interested in the full post!
You cannot go in with zero information, but if you know how to read Google Maps and are willing to consider several options, you can do very well overall, although many great places are still easy to miss.
How do you read Google Maps, beyond picking something with a high average star rating and (secondarily) large number of reviews? Since the vast majority of customers don't leave reviews, it seems like the star rating should be biased, but I'm not sure in what way or how to adjust for it.
Thanks for the post! I just published a top-level post responding to it: https://www.lesswrong.com/posts/pmraJqhjD2Ccbs6Jj/is-metaethics-unnecessary-given-intent-aligned-ai
I'd appreciate your feedback!
Can't all of these concerns be reduced to a subset of the intent-alignment problem? If I tell the AI to "maximize ethical goodness" and it instead decides to "implement plans that sound maximally good to the user" or "maximize my current guess of what the user meant by ethical goodness according to my possibly-bad philosophy," that is different from what I intended, and thus the AI is unaligned.
If the AI starts off with some bad philosophy ideas just because it's relatively unskilled in philosophy vs science, we can expect that 1) it will try very hard to get better at philosophy so that it can understand "what did the user mean by 'maximize ethical goodness,'" and 2) it will try to preserve option value in the meantime so not much will be lost if its first guess was wrong. This assumes some base level of competence on the AI's part, but if it can do groundbreaking science research, surely it can think of those two things (or we just tell it).
I'm not Eliezer, but thanks for taking the time to read and engage with the post!
The best explanation I can give for the downvotes is that we have a limited amount of space on the front page of the site, and we as a community want to make sure people see content that will be most useful to them. Unfortunately, we simply don't have enough time to engage closely with every new user on the site, addressing every objection and critique. If we tried, it would get difficult for long-time users to hear each other over the stampede of curious newcomers drawn here recently from our AI posts :) By the way, I haven't downvoted your post; I don't think there's any point once you've already gotten this many, and I'd rather give you a more positive impression of the community than add my vote to the pile.
I'm sure you presented your ideas with the best of intentions, but it's hard to tell which parts of your argument have merit behind them. In particular, you've brought up many arguments that have been partially addressed in popular LessWrong posts that most users have already read. Your point about certainty is just one example.
Believe me, LessWrong LOVES thinking about all the ways we could be wrong (maybe we do it a little too much sometimes). We just have a pretty idiosyncratic way we like to frame things. If someone comes along with ideas for how to improve our rationality, they're much more likely to be received well if they signal that they're familiar with the entire "LessWrong framework of rationality," then explain which parts of it they reject and why.
The common refrain for users who don't know this framework is to "read the Sequences." This is just a series of blog posts written by Eliezer in the early days of LessWrong. In the Sequences, Eliezer wrote a lot about consciousness, AI, and other topics you brought up - I think you'd find them quite interesting, even if you disagree with them! You could get started at https://www.readthesequences.com. If you can make your way through those, I think you'll more than deserve the right to post again with new critiques on LessWrong-brand rationality - I look forward to reading them!
Try reading this post? https://www.lesswrong.com/s/FrqfoG3LJeCZs96Ym/p/ooypcn7qFzsMcy53R
Yeah, I'm mostly thinking about potential hires.
I see what you mean. I was thinking "labs try their hardest to demonstrate that they are working to align superintelligent AI, because they'll look less responsible than their competitors if they don't."
I don't think keeping "superalignment" techniques secret would generally make sense right now, since it's in everyone's best interests that the first superintelligence isn't misaligned (I'm not really thinking about "alignment" that also improved present-day capabilities, like RLHF).
As for your second point, I think that for an AI lab that wants to improve PR, the important thing is showing "we're helping the alignment community by investing significant resources into solving this problem," not "our techniques are better than our competitors'." The dynamic you're talking about might have some negative effect, but I personally think the positive effects of competition would vastly outweigh it (even though many alignment-focused commitments from AI labs will probably turn out to be not-very-helpful signaling).
Competition between labs on capabilities is bad; competition between labs on alignment would be fantastic.
This post seems interesting and promising, thanks for writing it!
The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting.
I think this could be straightforwardly solved by not training two different models at all, but by giving two instances of the same model inputs that are both slightly perturbed in the same random way. Then, neither instance of the model would ever have a predictable advantage over the other.
For instance, in your movie recommendation example, let's say the model takes a list of 1000 user movie ratings as input. We can generate a perturbed input by selecting 10 of those ratings at random and modifying them, say by changing a 4-star rating to a 5-star rating. We do this twice to get two different inputs, feed them into the model, and train based on the outputs as you described.
Another very similar solution would be to randomly perturb the internal activations of each neural network during training.
Does this seem right?
Love to see a well-defined, open mathematical problem whose solution could help make some progress on AI alignment! It's like a little taste of not being a pre-paradigmic field. Maybe someday, we'll have lots of problems like this that can engage the broader math/CS community, that don't involve so much vague speculation and philosophy :)
This is also basically an idea I had - I actually made a system design and started coding it, but haven't made much progress due to lack of motivation... Seems like it should work, though