Research Engineering Intern at the Center for AI Safety. Helping to write the AI Safety Newsletter. Studying CS and Economics at the University of Southern California, and running an AI safety club there. Previously worked at AI Impacts and with Lionel Levine and Collin Burns on calibration for Detecting Latent Knowledge Without Supervision.
Separately, it seems that deception which is not strategic or intentional but is consistently produced by the training objective is also important. Considering cases like Paul Christiano's robot hand that learned to deceive human feedback and Ofria's evolutionary agents that learned to alter their behavior during evaluation, it seems that AI systems can learn to systematically deceive human oversight without being aware of their strategy. In the future, we might see powerful foundation models which are honestly convinced that giving them power in the real world is the best way to achieve their designers' intentions. This belief might be false but evolutionarily useful, making these models disproportionately likely to gain power. This case would not be called "strategic deception" or "deceptive alignment" if you require intentionality, but it seems very important to prevent.
Overall I think it's very difficult to come up with clean taxonomies of AI deception. I spent >100 hours thinking and writing about this in advance of Park et al 2023 and my Hoodwinked paper, and ultimately we ended up steering clear of taxonomies because we didn't have a strong taxonomy that we could defend. Ward et al 2023 formalize a concrete notion of deception, but they also ignore the unintentional deception discussed above. The Stanford Encyclopedia of Philosophy takes 17,000 words to explain that philosophers don't agree on definitions of lying and deception. Without rigorous formal definitions, I still think it's important to communicate the broad strokes of these ideas publicly, but I'd lean towards readily admitting the messiness of our various definitions of deception.
How would you characterize strategic sycophancy? Assume that during RLHF a language model is rewarded for mimicking the beliefs of its conversational partner, and therefore the model intelligently learns to predict the conversational partner's beliefs and mimic them. But upon reflection, the conversational partner and AI developers would prefer that the model report its beliefs honestly.
Under the current taxonomy, this would seemingly be classified as deceptive alignment. The AI's goals are misaligned with the designer's intentions, and it uses strategic deception to achieve them. But sycophancy doesn't include many of the ideas commonly associated with deceptive alignment, such as situational awareness and a difference in behavior between train and test time. Sycophancy can be solved by changing the training signal to not incentivize sycophancy, whereas the hardest to fix forms of deceptive alignment cannot be eliminated by changing the training signal.
It seems like the most concerning forms of deceptive alignment include stipulations about situational awareness and the idea that the behavior cannot necessarily be fixed by changing the training objective.
I wouldn't recommend open sourcing any state of the art LLM agents. But if you open source the evaluation, that would provide most of the benefits (letting labs evaluate their models on your benchmark, helping people build new benchmarks, allowing researchers to build safer agents which reject dangerous actions) while avoiding the capabilities externalities of open sourcing a SOTA language agent.
This is cool! I'd be interested to see a fully reproducible version where anybody can download and run the evaluation. Right now, it's hard to use this methodology to compare the abilities of different language models or scaffolding systems because the results are significantly driven by human prompting skills. To compare the abilities of different models and track progress towards language agents over time, you'd want to hold the evaluation constant and only vary the models and scaffolding structures. This might also be a more accurate analogy to models which are trying to survive and spread without human assistance along the way.
To turn this into a fully reproducible evaluation, you'd need to standardize the initial prompt and allow the agent to interact with the environment without additional prompting or assistance from a human operator. (A human overseer could still veto any dangerous actions.) The agent might get stuck at an intermediate step of the task, so you'd want to decompose each task into many individual steps and check whether the agent can pass each step individually, assuming previous steps were performed correctly. You could create many different scenarios to ensure that idiosyncratic factors aren't dominating the results -- for example, maybe the model is bad at AWS, but would perform much better with Azure or GCP. This benchmark runs this kind of automatic evaluation, but it doesn't test many interesting tasks related to the survive and spread threat model.
Kudos for doing independent replication work, reproducibility is really important for allowing other people to build on existing work.
Very informative. You ignore inference in your cost breakdown, saying:
Other possible costs, such as providing ChatGPT for free, would have been much smaller.
But Semianalysis says: "More importantly, inference costs far exceed training costs when deploying a model at any reasonable scale. In fact, the costs to inference ChatGPT exceed the training costs on a weekly basis." Why the discrepancy?
Thanks for the heads up. The web UI has a few other bugs too -- you'll notice that it also doesn't display your actions after you choose them, and occasionally it even deletes messages that have previously been shown. This was my first React project and I didn't spend enough time to fix all of the bugs in the web interface. I won't release any data from the web UI until/unless it's fixed.
The Python implementation of the CLI and GPT agents is much cleaner, with ~1000 fewer lines of code. If you download and play the game from the command line, you'll find that it doesn't have any of these issues. This is where I built the game, tested it, and collected all of the data in the paper.
The option to vote to banish themselves is deliberately left open. You can view the results as a basic competence test of whether the agents understand the point of the game and play well. Generally smaller models vote for themselves more often. Data here.
Hey, I'm running a skilling up program for the AI Safety club at my school. Would you be able to share a GitHub repo with any of the lessons from ARENA? No need to prepare something new, just wondering if you have anything easily available. Thanks!
That all sounds reasonable. I do think the other case is interesting, when you can verify but not generate the target answer. There are some methods for finding passwords / generating adversarial prompts which induce behaviors that can be verified by a classifier, but they're weak, and it'd be great to have stronger attacks to prevent sandbagging and improve adversarial training.
If I understand correctly, you're concerned these methods will work in cases where we can generate high quality answers ourselves, but might fail in settings where we can merely verify good answers.
This is an important gap for many techniques to bridge, and I do think existing methods will perform better in settings where we can generate high quality answers ourselves. But I'd make two points. First, there are many powerful and dangerous capabilities where we know the target output, and this would be a more tractable setting for beginning this line of research. Second, even in cases where we can only verify outputs, the existing literature includes techniques for finding prompts to induce those outputs.
Many of the most concerning capabilities can be evaluated in settings where we do know the target output. If we want to assess biorisks from a model, we can ask multiple choice questions about virology and wet lab security protocols, where we know the correct answer choice in advance. If we want to assess general capabilities, we can ask a model to take the bar exam, medical licensing exam, or the IMO. These evaluations capture much of the concerning behavior that regulators might audit a model for. And because we know the target outputs for each task, gradient-based techniques for prompt optimization could help us find the password (or another prompt) that demonstrates the model's hidden capabilities.
But if we don't know the target output and can only verify answers after they've been generated, existing techniques can still help find prompts that cause those outputs. Human judgements of output quality can be used to train a model which scores outputs. In an RLHF setup, this would be the reward model providing the loss used to guide gradient updates of the generator model. In an adversarial attack, this same model can be used to figure out which prompts lead to high quality outputs. For example, ARCA is a gradient based method for finding prompts which cause generations that receive high scores from a scoring model. These techniques are not always very strong. If you're interested in developing more powerful methods for inducing behaviors which cannot be specified but can be verified, check out CAIS's automated red teaming competition which will use classifiers to evaluate whether the harmful behavior was demonstrated.
For a first paper on this topic, you might fine-tune an LLM to sandbag a standard benchmark unless a password is present, then attempt to find that password. If finding the password is difficult, then the Trojans people have an interesting new problem to work on, and evaluators and red teamers will need to solve that problem before they can reliable audit models using standard benchmarks. OTOH, if finding the password is too easy, then you can create more sophisticated password protected models, or move to an open-ended generation setting.
Overall I think this is an interesting new framing of the Trojan / adversarial attack problem, where the goal is not to destroy model performance but rather to discover hidden capabilities. Many existing techniques would be helpful in solving this problem, but there's a lot of work to be done adapting the existing literature to this new framing and introducing the problem of detecting complex hidden abilities.
This is really interesting. Two possible methods for detecting password-locked models from the Trojan literature:
Optimizing over prompts to find the password. In the typical Trojan setup, you find a small pattern which, when applied to any input, universally results in the same output. Here, you'd be looking for a small pattern which, when applied to any prompt, improves performance on capabilities benchmarks.
Constructing a dataset of password-locked models and unlocked models, then training a neural network to classify which is which by examining the weights or activations on standard inputs. Sounds simple but is surprisingly effective for detecting Trojans.
The people working on Trojans in academia might be very interested in this. Here's how I'd frame it to them: The motivation for Trojans is developing techniques that uncover hidden model behaviors. But most of the behaviors uncovered in the literature are quite simple, such as outputting a target string or making a certain classification. You could create a Trojan model which, when triggered, displays quite complex behaviors. For example, a Trojan trigger could improve a model’s performance on variety of benchmarks. A company might create such a model if they wish to hide a model's dangerous capabilities during evaluation and red teaming. Detecting complex Trojan behaviors might require substantially different methods compared to detecting simple Trojan behaviors.
This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it's important to surface problems early on. Showing that the conclusions of CoT don't always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide feedback on the CoT.
Just wanted to drop a link to this new paper that tries to improve CoT performance by identifying and fixing inconsistencies in the reasoning process. This wouldn't necessarily solve the problem in your paper, unless the critic managed to identify that the generator was biased by your suggestion in the original prompt. Still, interesting stuff!
One question for you: Faithful CoT would improve interpretability by allowing us to read off a model‘s thought process. But it’s also likely to improve general capabilities: so far, CoT has been one of the most effective ways to improve mathematical and logical reasoning in LLMs. Do you think improving general reasoning skills of LLMs is dangerous? And if so, then imagine your work inspires more people to work on faithfulness in CoT. Do you expect any acceleration of general capabilities to be outweighed by the benefits to interpretability? For example, if someone wants to misuse an LLM to cause harm, faithful CoT would improve their ability to do so, and the greater ease of interpretability would be little consolation. On balance, do you think it’s worth working on, and what subproblems are particularly valuable from a safety angle? (Trying to make this question tough because I think the issue is tricky and multifaceted, not because I have any gotcha answer.)
Do you know anything about the history of the unembedding matrix? Vaswani et al 2017 used a linear projection to the full size of the vocabulary and then a softmax to generate probabilities for each word. What papers proposed and developed the unembedding matrix? Do all modern models use it? Why is it better? Sources and partial answers would be great, no need to track down every answer. Great resource, thanks.
One interesting note is that OpenAI has already done this to an extent with GPT. Ask ChatGPT to identify itself, and it’ll gladly tell you it’s a large language model trained by OpenAI. It knows that it’s a transformer trained first with self-supervised learning, then with supervised fine-tuning. But it has a memory hole around RLHF. It knows what RLHF is, but denies that ChatGPT was trained via RLHF.
I wonder if this is explicitly done to reduce the odds of intentional reward maximization, and whether empirical experiments could show that models with apparent knowledge of their training objective behave differently, such as by gaming their proxy objective.
One test would asking ChatGPT to flip a coin 10 times, and promising it positive reward if and only if the results are majority heads. A helpful, harmless assistant would output majority heads about half the time, while a model that intentionally tries to maximize its reward might output be more likely to output heads. I’ve tried this with ChatGPT and it seems to avoid reward hacking, outputting heads and tails with similar frequencies.
It’s definitely too simple to treat as strong evidence, but it shows some interesting dynamics. For example, levels of alignment rise at first, then rapidly falling when AI deception skills exceed human oversight capacity. I sent it to Tyler and he agreed — cool, but not actual evidence.
If anyone wants to work on improving this, feel free to reach out!
What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted.
Davidson's median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4.
Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum.
I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:
Predictive model misuse -- People use AI to do terrible things.
Adversarial Robustness: Train ChatGPT to withstand jailbreaking. When people try to trick it into doing something bad using a creative prompt, it shouldn't work.
Anomaly Detection: A related but different approach. Instead of training the base model to give good responses to adversarial prompts, you can train a separate classifier to detect adversarial prompts, and refuse to answer them / give a stock response from a separate model.
Cybersecurity at AI companies: If someone steals the weights, they can misuse the model!
Regulation: Make it illegal to release models that help people do bad things.
Strict Liability: If someone does a bad thing with OpenAI's model, make OpenAI legally responsible.
Communications: The open source community is often gleeful about projects like ChaosGPT. Is it possible to shift public opinion? Be careful, scolding can embolden them.
Predictive models playing dangerous characters
Haven't thought about this enough. Maybe start with the new wave of prompting techniques for GPT agents, is there a way to make those agents less likely to go off the rails? Simulators and Waluigi Effect might be useful frames to think through.
Scalable oversight failure without deceptive alignment
Scalable Oversight, of course! But I'm concerned about capabilities externalities here. Some of the most successful applications of IDA and AI driven feedback only make models more capable of pursuing goals, without telling them any more about the content of our goals. This research should seek to scale feedback about human values, not about generic capabilities.
Machine Ethics / understanding human values: The ETHICS benchmark evaluates whether models have the microfoundations of our moral decision making. Ideally, this will generalize better than heuristics that are specific to a particular situation, such as RLHF feedback on helpful chatbots. I've heard the objection that "AI will know what you want, it just won't care", but that doesn't apply to this failure mode -- we're specifically concerned about poorly specifying our goals to the AI.
Regulation to ensure slow, responsible deployment. This scenario is much more dangerous if AI is autonomously making military, political, financial, and scientific decisions. More human oversight at critical junctures and a slower transformation into a strange, unknowable world means more opportunities to correct AI and remove it from high stakes deployment scenarios. The EU AI Act identifies eight high risk areas requiring further scrutiny, including biometrics, law enforcement, and management of critical infrastructure. What else should be on this list?
What else? Does ELK count? If descendants of Collin Burns could tell us a model's beliefs, could we train on not for what would earn the highest reward, but what a human would really want? Or is ELK only for spot checking, and training on its answers would negate its effectiveness?
Interpretability, ELK, and Trojans seem tractable and useful by my own inside view.
Some things I'm less familiar with that might be relevant: John Wentworth-style ontology identification, agent foundations, and shard theory. Correct me if I'm wrong here!
Recursive Self-Improvement --> hard takeoff
Slowing down. The FLI letter, ARC Evals, Yo Shavit's compute governance paper, Ethan Cabellero and others predicting emergent capabilities. Maybe we can stop at the edge of the cliff and wait there while we figure something out. I'm sure some would argue this is impossible.
P(Doom) for each scenario would also be useful, as well as further scenarios not discussed here.
This is great context. With Eliezer being brought up in White House Press Corps meetings, it looks like a flood of people might soon enter the AI risk discourse. Tyler Cowen has been making some pretty bad arguments on AI lately, but I though this quote was spot on:
“This may sound a little harsh, but the rationality community, EA movement, and the AGI arguers all need to radically expand the kinds of arguments they are able to process and deal with. By a lot. One of the most striking features of the “six-month Pause” plea was how intellectually limited and non-diverse — across fields — the signers were. Where were the clergy, the politicians, the historians, and so on? This should be a wake-up call, but so far it has not been.”
Nearly all reasonable discussions of AI x-risk have taken place in the peculiar cultural bubble of rationality and EA. These past efforts could be multiplied by new interest from mainstream folks in AI, policy, philosophy, economics, and other fields. Or they could be misunderstood and discarded in favor of distractions that claim the mantle of AI safety. Hopefully we can find ways of communicating with new people in other disciplines that will lead to productive conversations on AI x-risk.
Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?
This sounds right to me, and is a large part of why I'd be skeptical of explosive growth. But I don't think it's an argument against the most extreme stories of AI risk. AI could become increasingly intelligent inside a research lab, yet we could underestimate its ability because regulation stifles its impact, until it eventually escapes the lab and pursues misaligned goals with reckless abandon.
Very nice. Any ideas on how a language model could tell the difference between training and deployment time? Assume it's a pure LM with no inputs besides the context window. A few possibilities:
Distribution shift in prompts. Perhaps training prompts are systematically different from deployment prompts. The model could predict this difference in advance, or it could simply observe a bunch of OOD inputs once it reaches deployment. For example, the repeated sequences of templated datasets would be more likely in training than deployment.
Observing a string that could not be generated before a certain date. For example, the front page of the New York Times on a certain date, if and only if the model can verify its accuracy via the internet. I'm not sure if there are any examples that can be verified within the context window alone.
The model could exploit implementation-level details about train vs. test time.
Training might use FP64 or FP32, while deployment might use FP16 or FP8. The model could condition its output on the smallest value representable in its activations.
You mention testing if "purposeful disruptions to the world downstream of high belief in a simulation lead to changes matching what they’d expect if SGD were in control." Could you say more about this? Do you mean that the model would e.g. output a result that would induce a particular gradient update, and then check whether that gradient update happened? How would the model remember that it's running this kind of experiment -- wouldn't that require passing information between episodes? Maybe it could be contained within the weights, but that seems tricky.
I'm thinking through a project that would deliberately fine-tune a language model to pursue long-term goals across episodes. This would be a useful red-team for scalable oversight strategies such as model-assisted critiques and Constitutional AI. If you can fine-tune a model such that it deliberately subverts these oversight techniques and preserves its long-term goal, you show that the current techniques are vulnerable to deceptive alignment, and provide an impetus to build techniques with more robust checks against deception.
I don't think language models will take actions to make future tokens easier to predict
For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant -- a study of YouTube's supposed radicalization spiral came up negative, though the authors didn't log in to YouTube which could lead to less personalization of recommendations.
The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don't think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way.
Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model's reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI.
“Early stopping on a separate stopping criterion which we don't run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”
Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data points from that gold distribution. For example, you could train two reward models using different samples from the same distribution and use one for training, the other for early stopping. (This is essentially the train / val split used in typical ML settings with data constraints.) You could measure the best ways to maximize gold reward on a limited data budget. Alternatively your early stopping RM could be trained on a gold RM containing distribution shift, data augmentations, adversarial perturbations, or a challenge set of particularly challenging cases.
Would you be interested to see an experiment like this? Ideally it could make progress towards an empirical study of how reward hacking happens and how to prevent it. Do you see design flaws that would prevent us from learning much? What changes would you make to the setup?
For those interested in empirical work on RLHF and building the safety community, Meta AI's BlenderBot project has produced several good papers on the topic. A few that I liked:
My favorite one is about filtering out trolls from crowdsourced human feedback data. They begin with the observation that when you crowdsource human feedback, you're going to get some bad feedback. They identify various kinds of archetypal trolls: the "Safe Troll" who marks every response as safe, the "Unsafe Troll" who marks them all unsafe, the "Gaslight Troll" whose only feedback is marking unsafe generations as safe, and the "Lazy Troll" whose feedback is simply incorrect a random percentage of the time. Then they implement several methods for filtering out incorrect feedback examples, and find that the best approach (for their particular hypothesized problem) is to score each user's trustworthiness based on agreement with other users' feedback scores and filter feedback from untrustworthy users. This seems like an important problem for ChatGPT or any system accepting public feedback, and I'm very happy to see a benchmark identifying the problem and several proposed methods making progress.
They have a more general paper about online learning from human feedback. They recommend a modular feedback form similar to the one later used by ChatGPT, and they observe that feedback on smaller models can improve the performance of larger models.
DIRECTOR is a method for language model generation aided by a reward model classifier. The method performs better than ranking and filtering beam searches as proposed in FUDGE, but unfortunately they don't compare to full fine-tuning on a learned preference model as used by OpenAI and Anthropic.
They also have this paper and this paper on integrating retrieved knowledge into language model responses, which is relevant for building Truthful AI but advances capabilities more than I think safety researchers should be comfortable with.
Meta and Yann LeCun in particular haven't been very receptive to arguments about AI risk. But the people working on BlenderBot have a natural overlap with OpenAI, Anthropic, and others working on alignment who'd like to collect and use human feedback to align language models. Meta's language models aren't nearly as capable as OpenAI's and have attracted correspondingly less buzz, but they did release BlenderBot 3 with the intention of gathering lots of human feedback data. They could plausibly be a valuable ally and source of research on how to how to robustly align language models using human preference data.
Thanks for sharing, I agree with most of these arguments.
Another possible shortcoming of RLHF is that it teaches human preferences within the narrow context of a specific task, rather than teaching a more principled and generalizable understanding of human values. For example, ChatGPT learns from human feedback to not give racist responses in conversation. But does the model develop a generalizable understanding of what racism is, why it's wrong, and how to avoid moral harm from racism in a variety of contexts? It seems like the answer is no, given the many ways to trick ChatGPT into providing dangerously racist outputs. Redwood's work on high reliability filtering came to a similar conclusion: current RLHF techniques don't robustly generalize to cover similar examples of corrected failures.
Rather than learning by trial and error, it's possible that learning human values in a systematic, principled way could generalize further than RLHF. For example, the ETHICS benchmark measures whether language models understand the implications of various moral theories. Such an understanding could be used to filter the outputs of another language model or AI agent, or as a reward model to train other models. Similarly, law is a codified set of human behavioral norms with established procedures for oversight and dispute resolution. There has been discussion of how AI could learn to follow human laws, though law has significant flaws as a codification of human values. I'd be interested to see more work evaluating and improving AI understanding of principled systems of human value, and using that understanding to better align AI behavior.
(h/t Michael Chen and Dan Hendrycks for making this argument before I understood it.)
I think it's worth forecasting AI risk timelines instead of GDP timelines, because the former is what we really care about while the latter raises a bunch of economics concerns that don't necessarily change the odds of x-risk. Daniel Kokotajlo made this point well a few years ago.
On a separate note, you might be interested in Erik Byrnjolfsson's work on the economic impact of AI and other technologies. For example this paper argues that general purpose technologies have an implementation lag, where many people can see the transformative potential of the technology decades before the economic impact is realized. This would explain the Solow Paradox, named after economist Robert Solow's 1987 remark that "you can see the computer age everywhere but in the productivity statistics." Solow was right that the long-heralded technology had not had significant economic impact at that point in time, but the coming decade would change that narrative with >4% real GDP growth in the United States driven primarily by IT. I've been taking notes on these and other economics papers relevant to AI timelines forecasting, send me your email if you'd like to check it out.
Overall I was similarly bearish on short timelines, and have updated this year towards a much higher probability on 5-15 year timelines, while maintaining a long tail especially on the metric of GDP growth.
This is fantastic, thank you for sharing. I helped start USC AI Safety this semester and we're facing a lot of the same challenges. Some questions for you -- feel free to answer some but not all of them:
What does your Research Fellows program look like?
In particular: How many different research projects do you have running at once? How many group members are involved in each project? Have you published any results yet?
Also, in terms of hours spent or counterfactual likelihood of producing a useful result, how much of the research contributions come from students without significant prior research experience vs. people who've already published papers or otherwise have significant research experience?
The motivation for this question is that we'd like to start our own research track, but we don't have anyone in our group with the research experience of your PhD students or PhD graduates. One option would be to have students lead research projects, hopefully with advising from senior researchers that can contribute ~1 hour / week or less. But if that doesn't seem likely to produce useful outputs or learning experiences, we could also just focus on skilling up and getting people jobs with experienced researchers at other institutions. Which sounds more valuable to you?
What about the general member reading group?
Is there a curriculum you follow, or do you pick readings week-by-week based on discussion?
It seems like there are a lot of potential activities for advanced members: reading groups, the Research Fellows program, facilitating intro groups, weekly social events, and participating in any opportunities outside of HAIST. Do you see a tradeoff where dedicated members are forced to choose which activities to focus on? Or is it more of a flywheel effect, where more engagement begets more dedication? For the typical person who finished your AGISF intro group and has good technical skills, which activities would you most want them to focus on? (My guess would be research > outreach and facilitation > participant in reading groups > social events.)
Broadly I agree with your focus on the most skilled and engaged members, and I'd worry that the ease of scaling up intro discussions could distract us from prioritizing research and skill-building for those members. How do you plan to deeply engage your advanced members going forward?
MLSS requires ML skills as a prerequisite, which is both a barrier to entry and a benefit. Instead of conceptual discussions of AGI and x-risk, it focuses on coding projects and published ML papers on topics like robustness and anomaly detection.
This semester we used a combination of both, and my impression is that the MLSS selections were better received, particularly the coding assignments. (We'll have survey results on this soon.) This squares with your takeaway that students care about "the technically interesting parts of alignment (rather than its altruistic importance)".
MLSS might also be better from a research-centered approach if research opportunities in the EA ecosystem are limited but students can do safety-relevant work with mainstream ML researchers.
On the other hand, AGISF seems better at making the case that AGI poses an x-risk this century. A good chunk of our members still are not convinced of that argument, so I'm planning to update the curriculum at least slightly towards more conceptual discussion of AGI and x-risks.
How valuable do you think your Governance track is relative to your technical tracks?
Personally I think governance is interesting and important, and I wouldn't want the entire field of AI safety to be focused on technical topics. But thinking about our group, all of our members are more technically skilled than they are in philosophy, politics, or economics. Do you think it's worth putting in the effort to recruit non-technical members and running a Governance track next semester, or would that effort better be spent focusing on technical members?
Appreciate you sharing all these detailed takeaways, it's really helpful for planning our group's activities. Good luck with next semester!
How much do you expect Meta to make progress on cutting edge systems towards AGI vs. focusing on product-improving models like recommendation systems that don’t necessarily advance the danger of agentic, generally intelligent AI?
My impression earlier this year was that several important people had left FAIR, and then FAIR and all other AI research groups were subsumed into product teams. See https://ai.facebook.com/blog/building-with-ai-across-all-of-meta/. I thought this would mean deprioritizing fundamental research breakthroughs and focusing instead on less cutting edge improvements to their advertising or recommendation or content moderation systems.
But Meta AI has made plenty of important research contributions since then: Diplomacy, their video generator, open sourcing OPT and their scientific knowledge bot. Their rate of research progress doesn’t seem to be slowing, and might even be increasing. How do you expect Meta to prioritize fundamental research vs. product going forwards?
Interesting paper on the topic, you might’ve seen it already. They show that as you optimize a particular proxy of a true underlying reward model, there is a U-shaped loss curve: high at first, then low, then overfitting to high again.
This isn’t caused by mesa-optimization, which so far has not been observed in naturally optimized neural networks. It’s more closely related to robustness and generalization under varying amounts of optimization pressure.
But if we grant the mesa-optimizer concern, it seems reasonable that more optimization will result in more coherent inner misaligned goals and more deceptive behavior to hide them. Unfortunately I don’t think slack is really a solution because optimizing an objective really hard is what makes your AI capable of doing interesting things. Handicapping your systems isn’t a solution to AI safety, the alignment tax on it is too high.
The key argument is that StableDiffusion is more accessible, meaning more people can create deepfakes with fewer images of their subject and no specialized skills. From above (links removed):
“The unique danger posed by today’s text-to-image models stems from how they can make harmful, non-consensual content production much easier than before, particularly via inpainting and outpainting, which allows a user to interactively build realistic synthetic images from natural ones, dreambooth, or other easily used tools, which allow for fine-tuning on as few as 3-5 examples of a particular subject (e.g. a specific person). More of which are rapidly becoming available following the open-sourcing of Stable Diffusion. It is clear that today’s text-to-image models have uniquely distinct capabilities from methods like Photoshop, RNNs trained on specific individuals or, “nudifying” apps. These previous methods all require a large amount of subject-specific data, human time, and/or human skill. And no, you don’t need to know how to code to interactively use Stable Diffusion, uncensored and unfiltered, including in/outpainting and dreambooth.”
If what you’re questioning is the basic ability of StableDiffusion to generate deepfakes, here  is an NSFW link to www.mrdeepfakes.com who says, “having played with this program a lot in the last 48 hours, and personally SEEN what it can do with NSFW, I guarantee you it can 100% assist not only in celeb fakes, but in completely custom porn that never existed or will ever exist.” He then provides links to NSFW images generated by StableDiffusion, including deepfakes of celebrities. This is apparently facilitated by the LAION-5B dataset which Stability AI admits has about 3% unsafe images and which he claims has “TONS of captioned porn images in it”.
On one hand, I agree that intent alignment is insufficient for preventing x-risk from AI. There are too many other ways for AI to go wrong: coordination failures, surveillance, weaponization, epistemic decay, or a simple failure to understand human values despite the ability to faithfully pursue specified goals. I'm glad there are people like you working on which values to embed in AI systems and ways to strengthen a society full of powerful AI.
On the other hand, I think this post misses the reason for popular focus on intent alignment. Some people think that, for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk. Ajeya Cotra's framing of this argument is most persuasive to me. Or Eliezer Yudkowsky's "strawberry alignment problem", which (I think) he believes is currently impossible and captures the most challenging part of alignment:
"How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellular level, without causing anything weird or disruptive to happen?"
Personally I think there's plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment. Eliezer seems to think this is more distraction from the real problem than it's worth, but surveys suggest that many people in AI safety orgs think x-risk is disjunctive across many scenarios. Which is all to say, aligning AI with societal values is important, but I wouldn't dismiss intent alignment either.
My most important question: What kinds of work do you want to see? Common legal tasks include contract review, legal judgment prediction, and passing questions on the bar exam, but those aren't necessarily the most important tasks. Could you propose a benchmark for the field of Legal AI that would help align AGI?
Beyond that, here are two aspects of the approach that I particularly like, as well as some disagreements.
First, laws and contracts are massively underspecified, and yet our enforcement mechanisms have well-established procedures for dealing with that ambiguity and can enforce the spirit of the law reasonably well. (The failures of this enforcement are proof of the difficulty of value specification, and evidence that laws alone will not close all the loopholes in human value specification.)
Second, the law provides a relatively nuanced picture of the values we should give to AI. A simpler answer to the question of "what should the AI's values be?" would be "aligned with the person who's using it", known as intent alignment. Intent alignment is an important problem on its own, but does not entirely solve the problem. Law is particularly better than ideas like Coherent Extrapolated Volition, which attempt to reinvent morality in order to define the goals of an AI.
Ensuring law expresses the democratically deliberated views of citizens with fidelity and integrity by reducing (1.) regulatory capture, (2.) illegal lobbying efforts, (3.) politicization of judicial and agency independence ...
Cultivating the spread of democracy globally.
While I agree that these would be helpful for both informing AI of our values and benefitting society more broadly, it doesn't seem like a high leverage way to reduce AI risk. Improving governance is a crowded field. Though maybe the "improving institutional decision-making" folks would disagree.
there is no other legitimate source of societal values to embed in AGI besides law.
This seems a bit too extreme. Is there no room for ethics outside of the law? It is not illegal to tell a lie or make a child cry, but AI should understand that those actions conflict with human preferences. Work on imbuing ethical understanding in AI systems therefore seems valuable.
There seems to be a conflict between the goals of getting AI to understand the law and preventing AI from shaping the law. Legal tech startups and academic interest in legal AI seems driven by the possibility of solving existing challenges by applying AI, e.g. contract review. The fastest way to AI that understands the law is to sell those benefits. This does introduce a long-term concern that AI could shape the law in malicious ways, perhaps by writing laws that pursue the wrong objective themselves or which empower future misaligned AIs. That might be the decisive argument, but I could imagine that exposing those problems early on and getting legal tech companies invested in solving them might be the best strategy for alignment. Any thoughts?
I agree that getting an AGI to care about goals that it understands is incredibly difficult and perhaps the most challenging part of alignment. But I think John’s original claim is true: “Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.” Here are some prominent articulations of the argument.
Unsolved Problems in ML Safety (Hendrycks et al., 2021) says that “Encoding human goals and intent is challenging.” Section 4.1 is therefore about value learning. (Section 4.2 is about the difficulty of getting a system to internalize those values.)
Robin Shah’s sequence on Value Learning argues that “Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.”
Specification Gaming: The Flip Side of AI Ingenuity (Krakovna et al., 2020) says the following: “Designing task specifications (reward functions, environments, etc.) that accurately reflect the intent of the human designer tends to be difficult. Even for a slight misspecification, a very good RL algorithm might be able to find an intricate solution that is quite different from the intended solution, even if a poorer algorithm would not be able to find this solution and thus yield solutions that are closer to the intended outcome. This means that correctly specifying intent can become more important for achieving the desired outcome as RL algorithms improve. It will therefore be essential that the ability of researchers to correctly specify tasks keeps up with the ability of agents to find novel solutions.
John is suggesting one way of working on the outer alignment problem, while Zach is pointing out that inner alignment is arguably more dangerous. These are both fair points IMO. In my experience, people on this website often reject work on specifying human values in favor of problems that are more abstract but seen as more fundamentally difficult. Personally I’m glad there are people working on both.
China's top chipmaker, Semiconductor Manufacturing International Corporation, built a 7nm chip in July of this year. This was a big improvement on their previous best of 14nm and apparently surpassed expectations, though is still well behind IBM's 2nm chip and other similarly small ones. This was particularly surprising given that the Trump administration blacklisted SMIC two years ago, meaning for years they've been subject to similar restrictions as Biden recently imposed on all of China. Tough to put this in the right context, but we should follow progress of China's chipmakers going forwards.
Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are doing to be interesting to plenty of academics, and I would think that's what the academics are trying to do. Are they just bad at generating insights? Do they look for the wrong kinds of progress, perhaps motivated by different goals? Why is the large academic field of interpretability not particularly useful for x-risk motivated AI safety?
There are well established ML approaches to, e.g. image captioning, time-series prediction, audio segmentation etc etc. is the bottleneck you’re concerned with the lack of breadth and granularity of these problem-sets, OP - and we can mark progress (to some extent) by the number of these problem sets we have robust ML translations for?
I think this is an important problem. Going from progress on ML benchmarks to progress on real-world tasks is a very difficult challenge. For example, years after human level performance on ImageNet, we still have lots of trouble with real-world applications of computer vision like self-driving cars and medical diagnostics. That's because ImageNet isn't a directly valuable real world task, but rather is built to be amenable to supervised learning models that output a single class label for each input.
While scale will improve performance within established paradigms, putting real world problems into ML paradigms remains squarely a problem for human research taste.