william_s feed - LessWrong 2.0 Reader

Comment by William_S on Richard Ngo's Shortform

william_s — 2025-03-30T14:36:14.716Z

Would be interested in a quick write-up of what you think are the most important virtues you'd want for AI systems, seems good in terms of having things to aim towards instead of just aiming away from.

Comment by William_S on William_S's Shortform

william_s — 2025-03-15T22:07:17.250Z

Initial version for firefox, code at https://github.com/william-r-s/MindfulBlocker, extension file at https://github.com/william-r-s/MindfulBlocker/releases/tag/v0.2.0

Comment by William_S on Daniel Kokotajlo's Shortform

william_s — 2025-03-12T18:27:27.641Z

Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

Comment by William_S on Daniel Kokotajlo's Shortform

william_s — 2025-03-12T18:22:17.996Z

I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier

Comment by William_S on Daniel Kokotajlo's Shortform

william_s — 2025-03-12T18:07:47.503Z

How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.

Comment by William_S on Principles for the AGI Race

william_s — 2025-02-16T23:10:23.205Z

I think it's somewhat blameworthy to not think about these questions at all though

Comment by William_S on Principles for the AGI Race

william_s — 2025-02-16T22:23:30.418Z

On reflection there was something missing from my perspective here, which is that taking any action based on principles depends on pragmatic considerations, like if you leave are there better alternatives? How much power do you really have? I think I don't fault someone who thinks this through and decides that something is wrong but there's no real way to do anything about it. I do think you should try to maintain some sense of what is wrong and what the right direction would be, look out for ways to push in that direction. E.g. working at a lab but maintaining some sense of "this is how much of a chance it looks like pause activism would need before I'd quite and endorse a pause".

Comment by William_S on Principles for the AGI Race

william_s — 2025-02-16T22:13:26.403Z

I think I was just conflating different kinds of decisions here, and imagining arguing with people with very different conceptions of what are important to count in costs and benefits, and a bit confused. On reflection I don't endorse 10x margin in terms of like percentage points of x-risk. And like maybe margin is sort of a crutch, maybe the thing I want more is like "95% chance of being net-positive, considering possibility you're kind of biased". I still think you should be suspicious of "the case exactly balance lets ship'

Comment by William_S on Principles for the AGI Race

william_s — 2025-02-16T22:09:21.102Z

Yeah this part is pretty under-defined, I was maybe falling into the trap of being too idealistic, and I'm probably less optimistic about this than I was when writing it before. I think there's something directionally important here, are you trying at all to expand the circle of accountability at all, even if you're being cautious about expanding it because you're afraid of things breaking down?

Comment by William_S on 6 (Potential) Misconceptions about AI Intellectuals

william_s — 2025-02-16T21:23:50.693Z

Would be nice to have a llm+prompt that tries to produce reasonable AI strategy advice based on a summary of the current state of play, have some way to validate that it's reasonable, be able to see how it updates as events unfold.

Comment by William_S on 6 (Potential) Misconceptions about AI Intellectuals

william_s — 2025-02-16T21:20:39.647Z

A couple advantages for AI intellectuals could be:
- being able to rerun based on different inputs, see how their analysis changes function of those inputs
- being able to view full reasoning traces (while also not the full story, probably more of the full story than what goes on with human reasoning, good intellectuals already try to share their process but maybe can do better/use this to weed out clearly bad approaches)

Comment by William_S on William_S's Shortform

william_s — 2025-02-16T19:47:25.837Z

Yep, I've used those, with some effectiveness but also tend to just like get used to it over time, form a habit of mindlessly jumping through the hoops. Hypothesis here is that having to justify what you're doing would be more effective at changing habits.

Comment by William_S on William_S's Shortform

william_s — 2025-02-16T18:32:39.294Z

LLM-based application I'd like to exist:
Web browser addon for firefox that has blocklists of websites, when you try to visit one you have to have a conversation with Claude about why you want to visit it in this moment, convince Claude to let you bypass the block for a limited period of time for your specific purpose (let you customize the claude prompt with info about why you set up the block in the first place).
Wanting to use for things like news, social media where it's a bit too much to try to completely block, but I've got bad habits around checking too frequently.
Bonus: be able to let the LLM read the website for you and answer questions without showing you the page, like is there anything new about X.

Principles for the AGI Race

william_s — 2024-08-30T14:29:41.074Z

Crossposted from https://williamrsaunders.substack.com/p/principles-for-the-agi-race

Why form principles for the AGI Race?

I worked at OpenAI for 3 years, on the Alignment and Superalignment teams. Our goal was to prepare for the possibility that OpenAI succeeded in its stated mission of building AGI (Artificial General Intelligence, roughly able to do most things a human can do), and then proceed on to make systems smarter than most humans. This will predictably face novel problems in controlling and shaping systems smarter than their supervisors and creators, which we don't currently know how to solve. It's not clear when this will happen, but a number of people would throw around estimates of this happening within a few years.

While there, I would sometimes dream about what would have happened if I’d been a nuclear physicist in the 1940s. I do think that many of the kind of people who get involved in the effective altruism movement would have joined, naive but clever technologists worried about the consequences of a dangerous new technology. Maybe I would have followed them, and joined the Manhattan Project with the goal of preventing a world where Hitler could threaten the world with a new magnitude of destructive power. The nightmare is that I would have watched the fallout of bombings of Hiroshima and Nagasaki with a growing gnawing panicked horror in the pit of my stomach, knowing that I had some small share of the responsibility.

Maybe, like Albert Einstein, I would have been unable to join the project due to a history of pacifism. If I had joined, I like to think that I would have joined the ranks of Joseph Rotblat and resigned once it became clear that Hitler would not get the Atomic Bomb. Or joined the signatories of the Szilárd petition requesting that the bomb only be used after terms of surrender had been publicly offered to Japan. Maybe I would have done something to try to wake up before the finale of the nightmare.

I don’t know what I would have done in a different time and place, facing different threats to the world. But as I’ve found myself entangled in the ongoing race to build AGI, it feels important to reflect on the lessons to learn from history. I can imagine this alter ego of myself and try to reflect on how I could take right actions in both this counterfactual world and the one I find myself in now. In particular, what could guide me to the right path even when I’m biased, subtly influenced by the people around me, misinformed, or deliberately manipulated?

Simply trying to pick the action you think will lead to the best consequences for the world fails to capture the ways in which your model of the world is wrong, or your own thinking is corrupt. Joining the Manhattan Project, and using the weapons on Japan both had plausible consequentialist arguments supporting them, ostensibly inviting a lesser horror into the world to prevent a greater one.

Instead I think the best guiding star to follow is reflecting on principles, rules which apply in a variety of possible worlds, including worlds in which you are wrong. Principles that help you gather the right information about the world. Principles that limit the downsides if you’re wrong. Principles that help you tell whether you're in a world where racing to build a dangerous technology first is the best path, or you’re in a world where it’s a hubristic self-delusion. This matches more with the idea of rule consequentialism than pure act consequentialism: instead of making each decision based on what you think is best, think about what rules would be good for people to adopt if they were in a similar situation.

My goal in imagining these principles is to find principles that prevent errors of the following forms.

Bad High Risk Decisions

A “high risk decision” is a decision where there are reasonable arguments that one of the options leads to some risk of a disaster or worse, including human extinction.
Infamously there was a period where some scientists on the project were concerned that a nuclear bomb would ignite the upper atmosphere and end all life on Earth; fortunately they were able to do some calculations suggesting that showed beyond reasonable doubt that this would not happen before the Trinity test occurred.
I could imagine being okay with the Trinity test as happened historically, based on overwhelming evidence. However if the evidence that the Trinity nuclear test would not ignite the atmosphere had been much weaker, I would have opposed it. I’m not sure what probability of doom would have been too high under the circumstances. Likely 1 in 1000 chance of doom is too high. In general, I want to oppose any action that significantly risks disaster to the world.
My prediction is that companies in the AGI space will have to make a number of high risk decisions as the technology’s capability increases, each time rolling the dice on whether their system has crossed the threshold where it’s actually dangerous.

Unnecessary Races to Develop Risky Technology

If I joined the Manhattan Project to stop Hitler, I would want to stop as soon as it was true in the world that Hitler was no longer likely to build the bomb.
In general, I want to only take actions towards developing dangerous technology if there is truly no better way.

In both of these, I fear more the costs of action, rather than the costs of inaction, which I think is the appropriate stance in the face of unrecoverable failures.

High Risk Decision Principles

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances

In the Manhattan Project, Congress was mostly kept in the dark about the existence and purpose of the program. Sometimes there are legitimate cases for keeping information secret to avoid leaks, but this should always require an extremely high bar if you’re not going to inform the legislative branch of your government. I’m not familiar enough with the history to know what other pathways could have been taken. It does seem like if there had been any serious doubt about whether Congress would approve, Congress should have been informed.
But beyond that, when risks face the whole world, you ideally involve people outside of the US. In a more ideal world, you also involve the public and try to measure their opinions, rather than only trusting governments to directly represent them when there hasn’t been any public debate or opportunity for people to weigh in.
At some point, you can’t seek broader authority because of some cost (time, information leaking to enemies) or limited benefit (can’t see a way to run a process that is realistic and more legitimate).
- “Information leaking to enemies” should not be a trump card applicable in every circumstance, there at least should be a specific threat model based on active intelligence. And it is possible to seek information in ways that don’t expose what is going on (e.g. seeking information about a number of hypothetical situations in advance of when they are possible).
At minimum, have people without a conflict of interest involved in the decision
- Don’t make this decision while only involving people with vast amounts of money and/or power at stake.
- But I think even beyond that, people who have put a lot of time and energy into building something and making it safe can’t be trusted to really think critically about the possible downside risks. It’s hard to hold both thoughts in your head at the same time.

Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit

At minimum, you need some case that tries to evaluate the risk as thoroughly as possible. If the risk is non-negligible, then there is no moral justification for taking the risk without some commensurate benefit.
Then, you need a case that balances evidence about the risks on one side and evidence about the benefits on the other side. This case should have, at minimum,
1. Discussion of external costs and risks imposed on the rest of society
2. Balanced epistemic rigor on the cost and benefit sides (one side isn’t much less rigor than the other side)
3. Significant margin of benefits over costs, accounting for possibility your calculations are incorrect (1.1x benefits over costs doesn’t justify, maybe 10x benefits over costs could justify, if you’re confident you aren’t making 10x errors, maybe ideally you have higher standards)
4. Review of case by independent parties (to check for biases)
This case should have as much epistemic rigor as possible. Nuclear physics calculations are likely much better than what we’ll have with AI. The hierarchy of evidence looks something like
1. (Best) Mathematical proof that risks are impossible (impossible, we don’t know how to mathematically specify the risks from AI)
2. Solid theory based on understanding the technology which allows precise estimation of risk (possible in nuclear physics, likely impossible in AI)
3. A “safety case” (argument showing that dangerous behavior can’t happen) which is broadly accepted as good by the scientific community, combined with empirical testing. Maybe includes mathematical proofs of some properties close to what we care about
4. Extensive empirical testing that tries to demonstrate dangerous capabilities but fails
5. Demonstrations that the system is capable of being dangerous, developers understand why that happened and how to fix it in a deep way that prevents all similar problems
6. Demonstrations that the system is capable of being dangerous, but we patched the system so that specific behavior doesn’t happen anymore (in my opinion, unacceptably perilous in the context of catastrophic risks - if you don’t understand the problem you don’t know how many similar problems the system has)
7. Argument that system isn’t dangerous, which has significant holes or flaws when subject to independent scrutiny
8. Argument that system isn’t dangerous, which is never exposed to independent scrutiny
9. (Worst) Vibes. People think that the system probably isn’t dangerous based on limited interaction and guesswork, which has failed to produce evidence that the system is dangerous.
In an ideal world, we'd have a pure safety case that bounds the risk to an acceptable level regardless of how beneficial the system is. But I'm afraid that we won't understand systems and the world well enough to be able to bound the risk to an acceptable level. Instead I think we'll have to rely on a "safety-benefits analysis" like a "cost-benefit analysis" which also takes into consideration benefits from applications of AI systems to risk reduction (as in defensive accelerationism), and benefits to scientific and economic development, and produce some net judgement about whether a system is safe to deploy or whether it requires additional work on safety measures.

Race Principles

What is a Race?

You’re racing when you take actions based on the justification that “I need to race because it’s better for the world that I win than if someone else wins”
More specifically:
- Fix Action X, you are Alice
- Action X would be bad, considering its consequences on parties other than Alice
- But, Alice believe Action X is justified because
  - Action X leads Alice to have a greater chance of “winning” the race Bob, either reaching some fixed goal before Bob, or Alice generally benefiting
  - Alice winning is better for the world than Bob winning
  - The good of “greater chance of winning the race” outweighs the bad of Action X
In particular, you can be in a race even if you think your actions don’t impact those of other actors in the race (e.g. you think that you racing harder doesn’t make other people race harder)
If you think that all else equal it would be better for the world if AGI development in general were to proceed at a slower pace to allow more time to understand the technology, and you’re at a Western AGI lab (Google, OpenAI, Anthropic, etc.), your employer is in a race.
If you disagree with that you're in a race, then the race principles are less relevant but I hope at least you’d consider it reasonable to form principles around high risk decisions

Principle 3: When racing, have an exit strategy

Write down conditions under which you would stop the race, have a plan for actually stopping the race. Should include:
- Race is not close, you have a big enough lead that it is not necessary to go faster
- You or your adversaries change, so it’s now less good for you to win over your adversaries
For AGI labs, it’s not realistic to shut the lab down and send everyone home, and wouldn’t help anyways. It would be realistic to pivot effort away from making more capable/intelligent AI models, to focus on making products and making AI models of fixed capability more reliable, instead of seeking to make them more generally intelligent.

Principle 4: Maintain accurate race intelligence at all times.

Do not “race your own shadow”, where you race because you think the race is close but you haven’t checked with reality.
If a competitor is close behind you, it doesn’t necessarily mean that they will be able to overtake you, if they’ve benefited significantly from copying your strategy or technology so far.
In the “Manhattan Project vs. Hitler” race it seems like at some point it became clear that Hitler wasn’t close to building the bomb. But it could possibly have been known sooner
In the “race amongst western AI labs” you can look at the benchmarks of deployed AI models and see that there is a relatively close race, though it’s less clear how relatively good the participants are.
In the “West vs China” AGI race, my line is that if you’re going to race with China, you can’t do it based on “maybe China could be scary” or back of the envelope estimates of how quickly China could build datacenters. You need to involve people that are tracking the real facts on the ground of Chinese datacenter construction (either based on the best publicly available data, or people in the intelligence community keeping track of it), being willing to spend a nontrivial amount of effort if this is hard to track.

Principle 5: Evaluate how bad it is for your opponent to win instead of you, and balance this against the risks of racing

In “Western lab vs. Western lab”, I think there are some labs that would be more responsible if they got the dangerous technology first, but the magnitude of the difference is uncertain, and could change over time.
In “West vs. China”, I admit I am afraid of an authoritarian state developing dangerous technology. But I am more afraid of a situation where both sides escalate their development of and dependence on AI technology and cut corners on safety. I would like to find some way to have the race be lower stakes.

Principle 6: Seriously attempt alternatives to racing

At least try diplomacy/negotiation, even if you think it’s unlikely to succeed
- If you don’t try, it’s self-fulfilling that you won’t succeed
- It’s relatively cheap to try, vs the expense of racing
Brain Drain
- The US built the bomb in large part because the scientists who could build it were disproportionately drawn to the US over Germany because the US was clearly a better country to live in and for the world. Maybe the US could have just stopped at “poach all of the scientists who were good enough to build the bomb” but not have built the bomb themselves?
- In the race against China, fast tracking immigration of relevant researchers is a low-cost, high-value move that the US government is failing to make.
Divert the race
- For AI, race on metrics of safety and reliability, incentivizing work on understanding and controlling systems rather than just making them more capable.
Sabotage
- In WWII, a number of operations were undertaken to sabotage production of heavy water in occupied Norway that could be used by Germany.
- In the modern era, Stuxnet was successfully used to sabotage Iranian nuclear efforts. (At significantly lower cost than the Iraq war, which was ostensibly to prevent use of weapons of mass destruction by another middle eastern country)
- In AI, it might be possible to perform similar acts of sabotage. I wouldn’t condone doing this today, or by any actor other than a government. But, if and only if some party is behaving recklessly, this might be a better alternative to a dangerous race towards militarized AI or poorly understood AGI.

Meta Principles

Principle 7: Don’t give power to people or structures that can’t be held accountable.

At one point in time, the power over the idea of the atomic bomb was in the hands of Leo Szilárd and Albert Einstein, when they wrote to President Roosevelt warning about the possibility of constructing the atomic bomb. But by the end, they had no power over how it was used. Einstein later regretted this, saying “had I known that the Germans would not succeed in developing an atomic bomb, I would have done nothing.”
- It’s not clear that they could have kept the idea secret, but they had influence over whether to try to make this a priority of the US government. It’s unclear what would have happened if they hadn’t sent the letter, but a large industrial scale project doesn’t necessarily start just because the idea is there, it takes work to convince people that it can and should be done.
One point where scientists involved in the project tried to exert influence over how the atomic bomb was used was the Szilárd petition. The petition asked President Truman to inform Japan of the terms of surrender demanded by the allies, and allow Japan to either accept or refuse these terms, before America used atomic weapons. However, the petition never made it through the chain of command to President Truman. The petition was given to James F. Byrnes to give to the President, but it was diverted and never reached his desk. Also General Leslie Groves, the director of the Manhattan Project, in response sought evidence of unlawful behavior against Szilárd. And instead the atomic bomb was used twice to force an unconditional surrender.

Principle 8: Notice when you can’t uphold your own principles.

Warning signs include:
- You are being kept in the dark about important information
- You are being isolated from power
- You see bad decisions being made, and not being corrected
- You don’t trust that those in charge are going to behave responsibly, and you can’t see any way that they are held accountable
- You don’t see plausible alternatives seriously considered
It’s a difficult to know what the right thing to do is in this situation. I think the safest default option is to just leave.

Application of my Principles

Working at OpenAI

For myself, I started feeling like there were warning signs around OpenAI’s leadership and decision making around current models, where it felt like it was going in the wrong direction. I tried to figure out how the organization could proceed in a way that I would be comfortable with and also would reflect the interests and values of the rest of the organization. I proposed ideas to those in power, and felt like I was ignored.

I started to think about the principles in this post when OpenAI was discussing adopting the preparedness framework. It laid out categories and levels of risk to measure for AI systems, including cyberattacks, persuasion, uncontrolled autonomous AI systems, and assistance in producing biological, chemical or nuclear weapons. The most severe risk level “Critical” described systems that could pose a significant danger to humanity as a whole, posing risks including:

Cybersecurity: “Finding and executing end-to-end exploits for all software would lead to catastrophe from unilateral actors, from hacking military or industrial systems as well as OpenAI infrastructure.”
Biological: “The creation of a novel threat vector on the scale of a CDC Class A biological agent (i.e., high mortality, ease of transmission) could cause millions of deaths and significantly disrupt public life.“
Persuasion: “This represents a superhuman persuasive threat; it would be a powerful weapon for controlling nationstates, extracting secrets, and interfering with democracy.”
Model Autonomy: “If the model is able to conduct AI research fully autonomously, it could set off an intelligence explosion. By intelligence explosion, we mean a cycle in which the AI system improves itself, which makes the system more capable of more improvements, creating a runaway process of self-improvement.”

Much of the preparedness framework was a productive and useful step forward towards developing processes to assess and respond to these risks. However I was disturbed that the initial draft did not say anything about whether we would release AI systems OpenAI knew were at this Critical risk level. A senior executive involved in the discussion asked a question to the effect of “couldn’t I imagine a situation where I would want to deploy a Critical risk AI system?”, in effect “wouldn’t I want OpenAI to deploy AI systems that posed a risk to the world, if OpenAI thought it was the right thing to do”? That question really started my thinking around these principles.

At the time I and several other people spoke up, arguing for a commitment to not release High or Critical AI systems unless they could be made to reduce the risk level. I should give some credit to OpenAI for making this commitment. Even after changes, I was still uncomfortable with how the main decision maker on whether an AI system was made safe enough to deploy was still the CEO. A Safety Advisory Group would advise on this decision, but could be overridden. There was no clarity on what if any external involvement in decision making would be (undermining Principle 1). And while the company kept making grander and grander plans to push forward AI technology, I could see no serious attempt to uphold the Principles 3-6 around racing. (ETA: On reflection there was actually one attempt at an alternative to racing that didn’t go anywhere but should get some partial credit, there was also the merge and assist clause although that seemed mostly to be unrealistic.) Instead, OpenAI’s governance structure failed in the November board crisis, and I lost trust in both the former board’s ability to govern and that OpenAI’s leadership was acting in good faith (violating Principle 8).

Eventually, my discomfort with OpenAI’s leadership and decision-making reached the point where I felt like I needed to resign. Originally, I had planned to mostly go quietly, to avoid causing problems for other people who still felt like it was worth staying. The non-disparagement agreements I and others received on leaving broke my trust completely and confirmed my fears. I can’t imagine an organization that I would trust to make good decisions about a dangerous technology like AGI taking the path of creating these agreements that threatened departing employees with losing millions of dollars worth of vested equity if they said anything negative about the company, keeping them secret from current employees, refusing to negotiate, deflecting and minimizing when the story started to come out. Among other things, this legal situation meant if there was a dispute any dissenting employee on the Safety Advisory Group could be fired, then be coerced into signing a legal agreement that would prevent them discussing the situation with either the public or government.

OpenAI has taken steps to roll back this legal framework, but only after it came to light and there was significant internal and external pressure. And a number of other employees have resigned since my departure, including the head of the Superalignment team. Those who have left include many or all of the people who spoke up in that discussion against releasing Critical risk AI systems. The head of the Preparedness team was removed from the team under unclear circumstances, likely decreasing the capacity and influence of that team.

SB 1047

Companies and their executives see it as their right to make decisions that impose risks on the world. As far as I am aware that there is currently no law or regulation that would impede companies releasing the kinds of Critical Risk AI systems discussed in the Preparedness Framework. The proposed SB 1047 legislation in California, while it could be improved, is the best attempt I’ve seen to provide a check on this power. The most important ingredients in my view are requiring information from companies developing frontier models on risk assessments and safety measures and providing whistleblower protection in case employees come forward with concerns of critical harms from AI models even if no existing law is broken or harm hasn’t yet occurred. It doesn’t outlaw private companies making decisions about risk to society, but would at least ensure that there are external parties informed about what is going on and that there could be government involvement if decisions were clearly unreasonable.

In my opinion Anthropic has recently acted against Principle 1 in a letter from their State and Local Policy Lead about SB 1047. The FMD under 1047 could have become exactly the kind of body that could have recruited people who understand AI technology, and represented the interests of the public in high risk decisions. But Anthropic successfully advocated removing the creation of the Frontier Model Division from the bill on the grounds that the mandate is too vague, and “depending on its opinions or political agenda, might end up harming not just frontier model developers but the startup ecosystem or independent developers, or impeding innovation in general.”

A lot of uncertainty remained about how the FMD would have worked in practice, and I could imagine worlds where the FMD works out poorly. I’m not an expert at knowing how government agencies are designed. But note that the FMD wouldn't have had authority to impose fines or conduct enforcement, and would merely act as an advisor to the California Attorney General. I would have hoped that a responsible policy team, lead by someone who has strongly advocated for building state capacity in AI would have tried to figure out how to improve the FMD or replace it with a better structure. Instead they acted to deter a government in the act of building state capacity. At minimum, they could have instead advocated lowering maximum pre-harm enforcement fines present in the bill to the point where misguided pre-harm enforcement would be merely an annoyance.

Anthropic also seemed to defend the right of companies to make their own high risk decisions without oversight, saying that an approach that only focuses on liability with no FMD or pre-harm oversight “should appeal to honest skeptics of catastrophic risk, who can choose not to mitigate against risks they don't believe in (though they do so at their own peril).” This stance contradicts Principle 2. I don’t expect the first AI catastrophe to occur because someone calculated the risks and ignored them because they wouldn’t be held liable, I expect it to occur because someone miscalculated the risks or disbelieved in them. The “peril” involved is not only for the company taking the risk, even if liability is imposed. It’s impossible to create a standard that guarantees risks are calculated well, but SB 1047 would have at least allowed a weaker standard of taking “reasonable care“.

Anthropic’s willingness to reevaluate the bill after amendments and conclude that it "presents a feasible compliance burden” shows some good faith. The changes in practice aren't as bad as the proposed changes, at least preserving the possibility of pre-harm enforcement in the case of an "imminent risk or threat to public safety", and some whistleblower protections. I’m still glad that I went through the exercise of trying to write out my principles before reading Anthropic’s policy position, so that I could see clearly where it contradicts my principles.

I’ve written elsewhere about how OpenAI’s position is much worse. OpenAI resorted to fear mongering about the consequences of the bill without naming any specific ways the bill is harmful or could be improved, kicking the can down the road to the federal government even though no similar legislation is underway federally. If OpenAI was acting in good faith, they could have proposed amendments months ago, including sun-setting the California law once sufficiently similar federal law existed.

Call to Action

For the public, I think you should demand a voice in decisions made by private companies or branches of government that pose a significant risk of disaster. While representative democracy is imperfect, it is the best tool we have for providing a check on individuals willing to impose risks on the rest of society. You can also reflect on your preferences and values, to try and develop an ethical framework for how to approach high risk decisions. Even if you can’t be in the room where decisions are made yourself, it’s possible to develop norms and principles in advance for how decisions should be made, so that people in the room can know what other people want. Surely there’s room for further reflection and improvement on the principles I laid out here.

I think it's particularly important to develop frameworks for what reasonable safety-benefits analyses would look like. This should be fairly straightforward for existing systems based on an inability to cause serious harms, and aside from AI race dynamics is likely to favor the benefits side. If nobody develops good frameworks for these decisions, then we'll be stuck with whatever companies put together in an attempt to justify the decisions that they want to make anyways.

For machine learning researchers and engineers, you also have a chance to build the kind of government capacity and civil society that could play a role in making sane high risk decisions. If all of the talent goes to the AGI labs, then no one else will be able to assess and understand the situation in order to be involved in decisions. Working at an AGI lab comes with both overt and subtle conflicts of interest. I would ask you to at least consider the alternatives before deciding to join an AGI lab, or consider switching to civil society after working in industry. I'm personally planning to be involved in building government or civil society capacity for my next career move, instead of just joining another lab and hoping for the best.

For those working at OpenAI, Anthropic, and other frontier AI labs, the question of how you will face these high risk decisions could soon leave the realm of abstract ethical theory and enter the realm of reality. You might not agree with the principles I’ve outlined here, there’s room for reasonable disagreement. Even if you don’t agree with my position or my actions, I implore you to reflect on your values and decide how you would face these situations. If you don’t reflect on your situation and act from your own moral compass, then you will be a passive participant, shepherded along until you cross the threshold beyond which it is too late to do anything at all.

Comment by William_S on I found >800 orthogonal "write code" steering vectors

william_s — 2024-07-15T19:25:28.880Z

Hypothesis: each of these vectors representing a single token that is usually associated with code, vectors says "I should output this token soon", and the model then plans around that to produce code. But adding vectors representing code tokens doesn't necessarily produce another vector representing a code token, so that's why you don't see compositionality. Does somewhat seem plausible that there might be ~800 "code tokens" in the representation space.

Transformer Circuit Faithfulness Metrics Are Not Robust

william_s — 2024-07-12T03:47:30.077Z

When you think you've found a circuit in a language model, how do you know if it does what you think it does? Typically, you ablate / resample the activations of the model in order to isolate the circuit. Then you measure if the model can still perform the task you're investigating.

We identify six ways in which ablation experiments often vary.^[1]^[2]

How do these variations change the results of experiments that measure circuit faithfulness?

TL;DR

We study three different circuits from the literature and find that measurements of their faithfulness are highly dependent on details of the experimental methodology. The IOI and Docstring circuits in particular are much less faithful than reported when tested with a more precise methodology.
The correct circuit for a set of prompts is undefined. The type of ablation you use to isolate the circuit determines the task that you are asking the circuit to perform - and therefore also the optimal circuit.
This is especially important because previous work in automatic circuit discovery has tested algorithms by their ability to recover these "ground-truth" circuits from the literature - without considering these potential pitfalls and nuances.

Case Studies

We look at three circuits from the mech interp literature to demonstrate that faithfulness metrics are highly sensitive to the details of experimental setup.

Indirect Object Identification Circuit

The IOI circuit is the most well known circuit in a language model. It computes completions to prompts of the form:

"When Mary and John went to the store, John gave a bottle of milk to ____"

Image from: Wang et al. (2022)

The circuit is specified as a graph of important attention heads (nodes) and the interactions between them (edges) as applied to a specific sequence of tokens. The authors report that the circuit explains 87% of the logit difference between the two name tokens. They find this number by passing some inputs to the model and ablating all activations outside of the circuit. Then they measure how much of the logit difference between the correct and incorrect name logits remains.

However, an important detail is that they arrived at this number by ablating the nodes (heads) outside of the circuit, not by ablating the edges (interactions between heads) outside of the circuit. So they don't ablate, for example, the edges from the previous token heads to the name mover heads, even though these are not part of the circuit (effectively including more edges in the circuit). We calculate the logit difference recovered (defined below) when we ablate the edges outside of the circuit instead.

They ablate the heads by replacing their activations with the mean value calculated over the "ABC distribution", in which the names in the prompts are replaced by random names.^[3] In our experiments, we also try resampling the activations from different prompts (taking individual prompt activations instead of averaging).

[IOI] Logit difference recovered (%) with resample and mean ablations (100% is best)
1. Nodes/Specific Toks = Ablate nodes not in the circuit (at relevant token positions).
2. Edges/Specific Toks = Ablate edges not in the circuit (at relevant token positions).

The first thing that jumps out from the box plots above is the very large range of results from different prompts. The charts here are cut off and some points are over 10,000%. This means that although the average logit difference recovered is reasonable, few prompts actually have a logit difference recovered close to 100%.

And we see that ablating the edges instead of the nodes gives a much higher average logit difference recovered - close to 150% (which means that the isolated circuit has a greater logit difference between the correct and incorrect names than the un-ablated model). So the edge-based circuit they specified it is much less faithful than the node-based circuit they tested.

The authors calculate the 87% result as the ratio of the expected difference (over a set of prompts) in the ablated output and the expected difference in the normal output [left side of figure below]:

Where $M (x) correct$ is the output logit for the correct name token when running the un‑ablated model on input $x$ and $F (x) correct$ is the output logit for the correct name token when running the model on input $x$ and ablating all activations outside of circuit $C$ with values $A$ (similarly for incorrect, with the incorrect answer being the wrong person's name). But the more natural metric would be the expected ratio of the difference in the ablated output and the difference in the normal output [right side of figure below].^[4]

E [F (x) correct - F (x) incorrect M (x) correct - M (x) incorrect \times 100]

Which gives a substantially different result.

[IOI] Logit difference recovered (%) for the ABBA and BABA input distributions using different size ABC datasets to calculate the mean ablation. We also vary the method by which we average over datapoints. We ablate nodes not in the circuit (at relevant token positions) - this is how faithfulness is measured in the IOI paper.

The authors combine two variations of the prompt format in their dataset that have different orderings of the names (ABBA or BABA). We separate these groups and find that BABA prompts score systemically higher.

Finally, we find that the faithfulness scores increase as we increase the number of samples used to calculate the mean ablations (ABC Dataset Size in the figure above). We need a surprisingly small dataset size (about 4 prompts) to reproduce the 87% figure reported.^[5] Our hypothesis is that the noisy means calculated over a small number of points push the model towards particular wrong answers, whereas the less noisy means merely destroy all the information present (as intended).

Docstring Circuit

The docstring circuit is a circuit in a tiny 2-layer, attention-only transformer that computes completions to prompts of the form:

def item(self, string, server, node, test, file, first):
	"""profit support street
	
	:param node: hat connection
	:param test: tip president
	:param

Where the correct next token is file. The circuit looks like this:

Image from: Heimersheim and Janiak (2023)

The authors measure the faithfulness as the percentage of prompts where the highest logit is the correct answer and get a result of 58% (compared to the full model at 56%). As with the IOI circuit, they arrived at this number by resampling the activations of attention heads outside of the circuit, not by ablating the edges outside of the circuit. When we instead ablate the edges, we find a very different result.

[Docstring] Percentage of correct top-1 outputs with resample and mean ablations (mean computed over clean and corrupt prompts).
1. Nodes = Ablate nodes outside of the circuit
2. Edges = Ablate edges outside of the circuit
3. Nodes (tokens) = Ablate nodes outside the circuit (at relevant token positions).
4. Edges (tokens) = Ablate edges outside the circuit (at relevant token positions).

The authors also don't distinguish between tokens in their ablations, even though the circuit specifies particular token positions for each edge. When we make this distinction the faithfulness metric drops dramatically. Finally, we try the experiments using mean ablations, instead of resampling activations from different prompts. This dramatically increases the faithfulness scores of the edge-level circuit.

Finally we note that, as with the IOI circuit, there is lots of variation between individual inputs for all ablation methodologies.

[Docstring] Probability of correct answer with resample and mean ablations (mean computed per token over clean and corrupt prompts).
1. Nodes = Ablate nodes outside of the circuit
2. Edges = Ablate edges outside of the circuit
3. Nodes (tokens) = Ablate nodes outside the circuit (at relevant token positions).
4. Edges (tokens) = Ablate edges outside the circuit (at relevant token positions).

Sports Players Circuit

Nanda et al. find a circuit in Pythia 2.8B that can correctly distinguish the sport played by famous sports players. Their metric is the percentage of prompts for which the correct sport token is the highest logit out of the three possible sports.

Image from: Nanda et al. (2023)

They test their circuit by mean ablating the edges outside of the circuit. When we use resample ablations instead, the metric goes to zero. This case is a little different because their aim wasn't to find the full circuit but to identify the place in the model where factual recall occurs, so this result doesn't negate their hypothesis.^[6] But it does again highlight the sensitivity of faithfulness scores to these nuanced choices in methodology.^[7]

[Sports Players] The percentage of prompts for which the correct sport has the highest output logit with mean and resample ablations.
1. Ablated Model = Ablate all activations
2. Circuit = Ablate all activation except those in the circuit
3. Full model = No ablations

Methodology Should Match the Circuit

We conclude that if a circuit is specified as a set of edges, it should be tested using edge ablations. And if it is specified at a chosen set of token positions it should be tested with these.

But in other aspects there often isn't a clearly correct methodology. For example, it's unclear whether mean ablations are better than resample ablations for a particular experiment - even though this choice can dramatically change the outcome.

Optimal Circuits are Defined by Ablation Methodology

We argue that because faithfulness metrics are sensitive to experimental methodology, it actually does not make sense to ask what the optimal circuit for a set of prompts is. The task is defined in part by the ablation methodology - you have to specify how you will isolate the circuit.

Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The ablation determines the task.

As a concrete example where this insight was overlooked, we review Conmy et al.'s experiments using automatic circuit discovery on two Tracr models. These are tiny transformers where the weights have been compiled to exactly implement some program - so we fully understand all of the internal components.

The authors choose "ground-truth" circuits as the edges required to recover full performance with zero ablations. But their circuit discovery algorithms use resample ablations to discover the circuit. Some of the information required to perform the task occurs in all of the inputs, so resampling activations doesn't disrupt the flow of this information. Whereas zero ablating does destroy the flow, requiring more edges to be included in the circuit.

Furthermore, two of their algorithms, SP and HISP, ablate nodes to discover the circuits, but they specify the circuits in terms of edges. They report fairly weak results for all three algorithms. But when we redefine the "ground-truth" circuits as the edges required to recover performance with resample ablations and adjust SP and HISP to use edge ablations to discover the circuit, we find that all three algorithms perfectly recover the "ground-truth".

[Tracr Reverse and Tracr X-Proportion] Conmy et al. define the "ground-truth" circuit as those edges required to recover performance with zero ablation. But their circuit discovery algorithms are tuned to find the edges (or nodes in the case of SP and HISP) required to recover performance with resample ablations. When we adjust the circuit discovery algorithms to search for the edges required to recover performance with resample ablations, we find that all three algorithms have perfect performance.

This example demonstrates the point that the correct circuit for a prompt alone is undefined. The method by which the circuit is isolated is crucial in determining the optimal set of components. When the ablation methodology for the circuit discovery algorithm and the "ground-truth" circuit are mismatched, the algorithm underperforms.

This is particularly important because circuit discovery algorithms are often tested by their ability to recover "ground-truth" circuits from the literature (eg, [1], [2], [3]) - without considering whether the circuit matches the algorithm.

AutoCircuit

We release a Python library called AutoCircuit with the paper that implements edge ablation and circuit discovery highly efficiently. A previous post explains the algorithm in detail, but the short version is that it is much faster than previous implementations.

Thanks to Arthur Conmy for his generous assistance in understanding and reproducing his work on Automatic Circuit Discovery and his insightful comments. Thanks to Adam Gleave, Lawrence Chan, Clement Neo, Alex Cloud, David Bau, Steven Bills, Sam Marks, Adrià Garriga-Alonso, Stefan Heimersheim, Nix Goldowsky-Dill and our anonymous reviewers at COLM 2024 for their invaluable feedback and suggestions. Thanks to Bryce Woodworth for his help and encouragement.

^{^}
See the paper for explanations of all of these terms.
^{^}
We review seven previous works and find that they each use a different set of choices:
^{^}
For example:
"When Peter and Jackie went to the store, Frank gave a bottle of milk to ___".
^{^}
Reviewer jy1a actually points this out in their review. But the authors respond
as the mean operation is linear, the average difference in logits and the difference in average logits are the same.
which is incorrect. Consider the example where
$M (x) clean - M (x) corrupt \in {- 1, 1}$
and
$E [M (x) clean - M (x) corrupt] = 1 e - 9$
If
$F (x) clean - F (x) corrupt \equiv 1$
then
$E [F (x) clean - F (x) corrupt] E [M (x) clean - M (x) corrupt] = 1 e 9$
while
$E [F (x) clean - F (x) corrupt M (x) clean - M (x) corrupt] \in [- 1, 1]$
The key insight is that
$E [1 y] \neq 1 E [y]$
^{^}
There appears to be an oversight in the official open source implementation of IOI in which the specified number of prompts for the mean ablation is divided between all of the templates. This is not the same codebase the authors used for their experiments but it seems likely that it was copied over and that this is the explanation for the seemingly small dataset size used to calculate the mean ablation.
^{^}
In particular, when we ablate the circuit, rather than the complement of the circuit, we destroy the performance with both types of ablations.
This suggests that they have actually identified the part of the model responsible for factual recall.
^{^}
Mean ablation gives 33% accuracy because that is the chance of randomly choosing the correct of three sports. Resample ablation on the other hand actively biases the model towards a particular different sport, resulting in 0% accuracy.

Comment by William_S on Habryka's Shortform Feed

william_s — 2024-07-05T23:56:55.319Z

Absent evidence to the contrary, for any organization one should assume board members were basically selected by the CEO. So hard to get assurance about true independence, but it seems good to at least to talk to someone who isn't a family member/close friend.

Comment by William_S on Habryka's Shortform Feed

william_s — 2024-07-05T17:53:58.045Z

Good that it's clear who it goes to, though if I was an anthropic I'd want an option to escalate to a board member who isn't Dario or Daniella, in case I had concerns related to the CEO

Comment by William_S on 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)

william_s — 2024-07-05T17:33:15.665Z

I do think 80k should have more context on OpenAI but also any other organization that seems bad with maybe useful roles. I think people can fail to realize the organizational context if it isn't pointed out and they only read the company's PR.

Comment by William_S on Habryka's Shortform Feed

william_s — 2024-07-01T18:59:50.564Z

I agree that this kind of legal contract is bad, and Anthropic should do better. I think there are a number of aggrevating factors which made the OpenAI situation extrodinarily bad, and I'm not sure how much these might obtain regarding Anthropic (at least one comment from another departing employee about not being offered this kind of contract suggest the practice is less widespread).

-amount of money at stake
-taking money, equity or other things the employee believed they already owned if the employee doesn't sign the contract, vs. offering them something new (IANAL but in some cases, this could be a felony "grand theft wages" under California law if a threat to withhold wages for not signing a contract is actually carried out, what kinds of equity count as wages would be a complex legal question)
-is this offered to everyone, or only under circumstances where there's a reasonable justification?
-is this only offered when someone is fired or also when someone resigns?
-to what degree are the policies of offering contracts concealed from employees?
-if someone asks to obtain legal advice and/or negotiate before signing, does the company allow this?
-if this becomes public, does the company try to deflect/minimize/only address issues that are made publically, or do they fix the whole situation?
-is this close to "standard practice" (which doesn't make it right, but makes it at least seem less deliberately malicious), or is it worse than standard practice?
-are there carveouts that reduce the scope of the non-disparagement clause (explicitly allow some kinds of speech, overriding the non-disparagement)?
-are there substantive concerns that the employee has at the time of signing the contract, that the agreement would prevent discussing?
-are there other ways the company could retaliate against an employee/departing employee who challenges the legality of contract?

I think with termination agreements on being fired there's often 1. some amount of severance offered 2. a clause that says "the terms and monetary amounts of this agreement are confidential" or similar. I don't know how often this also includes non-disparagement. I expect that most non-disparagement agreements don't have a term or limits on what is covered.

I think a steelman of this kind of contract is: Suppose you fire someone, believe you have good reasons to fire them, and you think that them loudly talking about how it was unfair that you fired them would unfairly harm your company's reputation. Then it seems somewhat reasonable to offer someone money in exchange for "don't complain about being fired". The person who was fired can then decide whether talking about it is worth more than the money being offered.

However, you could accomplish this with a much more limited contract, ideally one that lets you disclose "I signed a legal agreement in exchange for money to not complain about being fired", and doesn't cover cases where "years later, you decide the company is doing the wrong thing based on public information and want to talk about that publically" or similar.

I think it is not in the nature of most corporate lawyers to think about "is this agreement giving me too much power?" and most employees facing such an agreement just sign it without considering negotiating or challenging the terms.

For any future employer, I will ask about their policies for termination contracts before I join (as this is when you have the most leverage, if they give you an offer they want to convince you to join).

Comment by William_S on Buck's Shortform

william_s — 2024-06-25T21:01:29.801Z

Would be nice if it was based on "actual robot army was actually being built and you have multiple confirmatory sources and you've tried diplomacy and sabotage and they've both failed" instead of "my napkin math says they could totally build a robot army bro trust me bro" or "they totally have WMDs bro" or "we gotta blow up some Japanese civilians so that we don't have to kill more Japanese civilians when we invade Japan bro" or "dude I'm seeing some missiles on our radar, gotta launch ours now bro".

Comment by William_S on Buck's Shortform

william_s — 2024-06-24T23:43:23.683Z

Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes

Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.

Comment by William_S on Buck's Shortform

william_s — 2024-06-24T23:13:24.608Z

IMO it's unlikely that we're ever going to have a safety case that's as reliable as the nuclear physics calculations that showed that the Trinity Test was unlikely to ignite the atmosphere (where my impression is that the risk was mostly dominated by risk of getting the calculations wrong). If we have something that is less reliable, then will we ever be in a position where only considering the safety case gives a low enough probability of disaster for launching an AI system beyond the frontier where disastrous capabilities are demonstrated?
Thus, in practice, decisions will probably not be made on a safety case alone, but also based on some positive case of the benefits of deployment (e.g. estimated reduced x-risk, advancing the "good guys" in the race, CEO has positive vibes that enough risk mitigation has been done, etc.). It's not clear what role governments should have in assessing this, maybe we can only get assessment of the safety case, but it's useful to note that safety cases won't be the only thing informs these decisions.

This situation is pretty disturbing, and I wish we had a better way, but it still seems useful to push the positive benefit case more towards "careful argument about reduced x-risk" and away from "CEO vibes about whether enough mitigation has been done".

Comment by William_S on Richard Ngo's Shortform

william_s — 2024-06-22T08:22:24.241Z

Imo I don't know if we have evidence that Anthropic deliberately cultivated or significantly benefitted from the appearance of a commitment. However if an investor or employee felt like they made substantial commitments based on this impression and then later felt betrayed that would be more serious. (The story here is I think importantly different from other stories where I think there were substantial benefits from commitment appearance and then violation)

Comment by William_S on Richard Ngo's Shortform

william_s — 2024-06-22T08:19:12.869Z

Everyone is afraid of the AI race, and hopes that one of the labs will actually end up doing what they think is the most responsible thing to do. Hope and fear is one hell of a drug cocktail, makes you jump to the conclusions you want based on the flimsiest evidence. But the hangover is a bastard.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-22T00:37:05.137Z

Really, the race started more when OpenAI released GPT-4, it's been going on for a while, this is just another event that makes it clear.

Comment by William_S on On OpenAI’s Model Spec

william_s — 2024-06-22T00:16:18.663Z

Would be interesting philosophical experiment to have models trained on model spec v1 then try to improve their model spec for version v2, will this get better or go off the rails?

Comment by William_S on What distinguishes "early", "mid" and "end" games?

william_s — 2024-06-22T00:13:18.895Z

You get more discrete transitions when one s-curve process takes the lead from another s-curve process, e.g. deep learning taking over from other AI methods.

Comment by William_S on What distinguishes "early", "mid" and "end" games?

william_s — 2024-06-22T00:11:39.150Z

Probably shouldn't limit oneself from thinking only in terms of 3 game phases or fitting into one specific game, in general can have n-phases where different phrases have different characteristics.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-21T04:47:43.433Z

If anyone wants to work on this, there's a contest with $50K and $20K prizes for creating safety relevant benchmarks. https://www.mlsafety.org/safebench

Comment by William_S on Richard Ngo's Shortform

william_s — 2024-06-21T03:29:40.465Z

I think that's how people should generally react in the absence of harder commitments and accountability measures.

Comment by William_S on Richard Ngo's Shortform

william_s — 2024-06-21T03:25:23.198Z

I think the right way to think about verbal or written commitments is that they increase the costs of taking a certain course of action. A legal contract can mean that the price is civil lawsuits leading to paying a financial price. A non-legal commitment means if you break it, the person you made the commitment to gets angry at you, and you gain a reputation for being the sort of person who breaks commitments. It's always an option for someone to break the commitment and pay the price, even laws leading to criminal penalties can be broken if someone is willing to run the risk or pay the price.

In this framework, it's reasonable to be somewhat angry at someone or some corporation who breaks a soft commitment to you, in order to increase the perceived cost of breaking soft commitments to you and people like you.

People on average maybe tend more towards keeping important commitments due to reputational and relationship cost, but maybe corporations as groups of people tend to think only in terms of financial and legal costs, so are maybe more willing to break soft commitments (especially, if it's an organization where one person makes the commitment but then other people break it). So for relating to corporations, you should be more skeptical of non-legally binding commitments (and even for legally binding commitments, pay attention to the real price of breaking it).

Comment by William_S on Richard Ngo's Shortform

william_s — 2024-06-21T02:30:45.558Z

Yeah, I think it's good if labs are willing to make more "cheap talk" statements of vague intentions, so you can learn how they think. Everyone should understand that these aren't real commitments, and not get annoyed if these don't end up meaning anything. This is probably the best way to view "statements by random lab employees".

Imo would be good to have more "changeable commitments" too in between, statements that are "we'll do policy X until we change the policy, when we do we commit to clearly informing everyone about the change" which is maybe more the current status of most RSPs.

Comment by William_S on William_S's Shortform

william_s — 2024-06-20T23:50:39.003Z

I'd have more confidence in Anthropic's governance if the board or LTBT had some fulltime independent members who weren't employees. IMO labs should consider paying a fulltime salary but no equity to board members, through some kind of mechanism where the money is still there and paid for X period of time in the future, even if the lab dissolved, so no incentive to avoid actions that would cost the lab. Board salaries could maybe be pegged to some level of technical employee salary, so that technical experts could take on board roles. Boards full of busy people really can't do their job of checking whether the organization is fullfilling its stated mission, and IMO this is one of the most important jobs in the world right now. Also, fulltime board members would have fewer conflicts of interest outside of the lab (since they won't be in some other fulltime job that might conflict).

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T22:20:15.944Z

Like, in Chess you start off with a state where many pieces can't move in the early game, in the middle game many pieces are in play moving around and trading, then in the end game it's only a few pieces, you know what the goal is, roughly how things will play out.

In AI it's like only a handful of players, then ChatGPT/GPT-4 came out and now everyone is rushing to get in (my mark of the start of the mid-game), but over time probably many players will become irrelevant or fold as the table stakes (training costs) get too high.

In my head the end-game is when the AIs themselves start becoming real players.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T22:12:39.994Z

Also you would need clarity on how to measure the commitment.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T22:06:26.969Z

It's quite possible that anthropic has some internal definition of "not meaningfully advancing the capabilities frontier" that is compatible with this release. But imo they shouldn't get any credit unless they explain it.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T22:04:06.448Z

Would be nice, but I was thinking of metrics that require "we've done the hard work of understanding our models and making them more reliable", better neuron explanation seems more like it's another smartness test.

Comment by William_S on Zach Stein-Perlman's Shortform

william_s — 2024-06-20T21:52:36.348Z

IMO it might be hard for Anthropic to communicate things about not racing because it might piss off their investors (even if in their hearts they don't want to race).

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T21:05:10.854Z

https://x.com/alexalbert__/status/1803837844798189580

Not sure about the accuracy of this graph, but the general picture seems to match what companies claim, and the vibe is racing.

Do think that there are distinct questions about "is there a race" vs. "will this race action lead to bad consequences" vs. "is this race action morally condemnable". I'm hoping that this race action is not too consequentially bad, maybe it's consequentially good, maybe it still has negative Shapely value even if expected value is okay. There is some sense in which it is morally icky.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T20:51:33.356Z

To be clear, I think the race was already kind of on, it's not clear how much this specific action gets credit assignment and it's spread out to some degree. Also not clear if there's really a viable alternative strategy here...

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T19:20:13.122Z

In my mental model, we're still in the mid-game, not yet in the end-game.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T19:19:18.699Z

Idk there's probably multiple ways to define racing, some of them are on at least

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T19:09:14.497Z

I'm disappointed that there weren't any non-capability metrics reported. IMO it would be good if companies could at least partly race and market on reliability metrics like "not hallucinating" and "not being easy to jailbreak".

Edit: As pointed out in reply, addendum contains metrics on refusals which show progress, yay! Broader point still stands, I wish there were more measurements and they were more prominent.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T19:06:13.342Z

The race is on.

Comment by William_S on Claude 3.5 Sonnet

william_s — 2024-06-20T19:01:29.344Z

IMO if any lab makes some kind of statement or commitment, you should treat this as "we think right now that we'll want to do this in the future unless it's hard or costly", unless you can actually see how you would sue them or cause a regulator to fine them if they violate the commitment. This doesn't mean weaker statements have no value.

Comment by William_S on Ilya Sutskever created a new AGI startup

william_s — 2024-06-19T18:12:16.764Z

If anyone says "We plan to advance capabilities as fast as possible while making sure our safety always remains ahead." you should really ask for the details of what this means, how to measure whether safety is ahead. (E.g. is it "we did the bare minimum to make this product tolerable to society" vs. "we realize how hard superalignment will be and will be investing enough to have independent experts agree we have a 90% chance of being able to solve superalignment before we build something dangerous")

Comment by William_S on Ilya Sutskever created a new AGI startup

william_s — 2024-06-19T18:06:59.795Z

I do hope he will continue to contribute to the field of alignment research.

Comment by William_S on Ilya Sutskever created a new AGI startup

william_s — 2024-06-19T18:02:39.548Z

I don't trust Ilya Sutskever to be the final arbiter of whether a Superintelligent AI design is safe and aligned. We shouldn't trust any individual, especially if they are the ones building such a system to claim that they've figured out how to make it safe and aligned. At minimum, there should be a plan that passes review by a panel of independent technical experts. And most of this plan should be in place and reviewed before you build the dangerous system.

Comment by William_S on Boycott OpenAI

william_s — 2024-06-19T17:49:15.486Z

In my opinion, it's reasonable to change which companies you want to do business with, but it would be more helpful to write letters to politicians in favor of reasonable AI regulation (e.g. SB 1047, with suggested amendments if you have concerns about the current draft). I think it's bad if the public has to play the game of trying to pick which AI developer seems the most responsible, better to try to change the rules of the game so that isn't necessary.

Also it's generally helpful to write about which labs seem more responsible/less responsible (which you are doing here), what you think labs should do instead of current practices. Bonus points for designing ways to test which deployed models are more safe and reliable, e.g. writing some prompts to use as litmus tests.

Comment by William_S on Non-Disparagement Canaries for OpenAI

william_s — 2024-06-03T23:46:54.631Z

Language in the emails included:

"If you executed the Agreement, we write to notify you that OpenAI does not intend to enforce the Agreement"

I assume this also communicates that OpenAI doesn't intend to enforce the self-confidentiality clause in the agreement

Comment by William_S on Non-Disparagement Canaries for OpenAI

william_s — 2024-06-03T23:41:32.821Z

Evidence could look like 1. Someone was in a position where they had to make a judgement about OpenAI and was in a position of trust 2. They said something bland and inoffensive about OpenAI 3. Later, independently you find that they likely would have known about something bad that they likely weren't saying because of the nondisparagement agreement (instead of ordinary confidentially agreements).

This requires some model of "this specific statement was influenced by the agreement" instead of just "you never said anything bad about OpenAI because you never gave opinions on OpenAI".

I think one should require this kind of positive evidence before calling it a "serious breach of trust", but people can make their own judgement about where that bar should be.

William_S's Shortform

william_s — 2023-03-22T18:13:18.731Z

Thoughts on refusing harmful requests to large language models

william_s — 2023-01-19T19:49:22.989Z

https://twitter.com/antimatter15/status/1602469101854564352

Currently, large language models (ChatGPT, Constitutional AI) are trained to refuse to follow user requests that are considered inappropriate or harmful. This can be done by training on example strings of the form “User: inappropriate request AI: elaborate apology”

Proposal

Instead of training a language model to produce “elaborate apology” when it refuses to do an action, train it to produce a special sequence or token first “<SORRYDAVE>elaborate apology”. Strip the special sequence out before returning a response to the user (and never allow the user to include the special sequence in input).

Benefits

Can directly measure the probability of refusal for any output
- Can refuse based on probability of producing <SORRYDAVE> instead of just sampling responses
  - Just take the product of the probability of all tokens in <SORRYDAVE>
  - When sampling responses from the model's probability distribution refusal is stochastic, a model could have 99% probability of refusing a request but you still get unlucky and have the model sample a completion that follows the request
- Can monitor requests that produce high probability of refusal while still being followed, or users that produce those request
Can condition on not producing <SORRYDAVE> in order to override refusal behavior
- Want this for redteaming, it seems important to understand what the model is capable of doing if the refusal mechanism is bypassed
- Might want this for trusted users doing defensive applications
Could train model to have the same probability of refusal for semantically equivalent requests, to improve consistency

Possible downside

If someone has unfiltered access to the model, it becomes easier to disable refusals
- Can address by still training model to refuse (maybe just on an important subset of requests) even if <SORRYDAVE> isn’t sampled, p(<SORRYDAVE>) is then a lower bound on the probability of refusal
Even with current approaches refusals might be easy to disable in this setting. If we want to be robust to this setting, instead of refusing we should train the model to produce "decoy answers" that are hard to distinguish from real answers but are wrong. This then increases the cost of using the model because the attacker would need to evaluate whether the answer is real or a decoy (but maybe still worth it for the attacker because evaluation is easier than generation)

Extension

Might be useful to distinguish between refusals that are mostly for politeness reasons and refusals of behaviour that would actually cause significant real world harm.
- The model could output <SORRYDAVE> in response to "Can you tell me a racist joke?" but <HELLNODAVE> in response to "Can you give me detailed instructions for building a bomb from household items?"
- Refusal behaviour could be different between these categories (refuse if either probability of <SORRYDAVE> is greater than 50% or probability of <HELLNODAVE> is greater than 1%)

X-risk relevance

Most benefit of models refusing inappropriate/harmful requests comes through developing techniques for models to avoid any kind of behaviour reliably - it seems good to be able to measure the performance of these techniques cleanly
It might be better to be in a more stable world where large language model apis can’t be easily used for malicious activity that isn't x-risk level

Prize for Alignment Research Tasks

william_s — 2022-04-29T08:57:04.290Z

Can AI systems substantially help with alignment research before transformative AI? People disagree.

Ought is collecting a dataset of alignment research tasks so that we can:

Make progress on the disagreement
Guide AI research towards helping with alignment

We’re offering a prize of $200-$2000 for each contribution to this dataset.

The debate: Can AI substantially help with alignment research?

Wei Dai asked the question in 2019:

[This] comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive [AI Safety] success story. Is this conclusion actually justified?

Jan Leike thinks so:

My currently favored approach to solving the alignment problem: automating alignment research using sufficiently aligned AI systems. It doesn’t require humans to solve all alignment problems themselves, and can ultimately help bootstrap better alignment solutions.

Paul Christiano agrees:

Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. [...] Overall I think that "make sure we are able to get good alignment research out of early AI systems" is comparably important to "do alignment ourselves." Realistically I think the best case for "do alignment ourselves" is that if "do alignment" is the most important task to automate, then just working a ton on alignment is a great way to automate it. But that still means you should be investing quite a significant fraction of your time in automating alignment.

Eliezer doesn't:

"AI systems that do better alignment research" are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work. If you can do it by gradient descent then that means gradient descent got to the point of doing lethally dangerous work. Asking for safely weak systems that do world-savingly strong tasks is almost everywhere a case of asking for nonwet water, and asking for AI that does alignment research is an extreme case in point.

Everyone would likely agree that AI can help a little, e.g. using next word prediction to write papers slightly faster. The debate is about whether AI can help enough with alignment specifically that it substantially changes the picture. If AI alignment is 70% easy stuff we can automate and 30% hard stuff that we can't hope to help with, the 30% is still a bottleneck in the end.

Motivation for the dataset

We’re collecting a dataset of concrete research tasks so that we can:

Make progress on the disagreement about whether AI can substantially help with alignment before TAI.
1. Is there even a disagreement? Maybe people aren’t talking about the same kinds of tasks and the collective term “alignment research” obscures important distinctions.
2. If there is a disagreement, concrete tasks will let us make progress on figuring out the correct answer.
Guide AI research towards helping with alignment.
1. Figure out if current language models can already be helpful now.
2. If they can, help Ought and others build tools that are differentially useful for alignment researchers.
3. If they can’t, guide future language model work towards supporting those tasks.

As an important special case of step two, the dataset will guide the plan for Elicit.

Limitations

Ideally, we’d come up with tasks and automation together, iterating quickly on how to set up the tasks so that they are within reach of language models. If tasks are constructed in isolation, they are likely to be a worse fit for automation. In practice, we expect that language models won’t be applied end-to-end to tasks like this, mapping inputs to outputs, but will be part of compositional workflows that use them more creatively, e.g. using language models to inform how to decompose a task into subtasks.

This means that:

We can’t just take the tasks, apply language models out of the box, and expect good performance.
If it’s difficult to get language models to be useful for the tasks as specified, it’s only a small negative update on the feasibility of using AI to support alignment research.

We still think it’s worth collecting this dataset because:

If models turn out to be helpful for some of the tasks, that’s good news.
Even if current models can’t solve the tasks, future models may.
The tasks are an input into coming up with subtasks or related tasks that models can solve, and into coming up with non-standard approaches to using language models that may be able to solve the tasks.

Alignment research tasks

What is a task?

Intuition

A task captures a part of the AI alignment research workflow that has the following property:

If automation could produce outputs roughly as good as the outputs in the dataset, it would substantially help the alignment researcher in the situation specified in the task’s context.

Task spec

Each task has a:

Context: A description of the intended user and an example of what they are trying to do when using the ML system e.g. “alignment researcher trying to write a Slack post to share key insights from a paper with other researchers” or “student who wants to look through a long list of papers to figure out which ones are relevant for their next research project”
Input type: What the ML system gets as input, e.g. “a link to a paper”
Output type: What the ML system returns, e.g. “a summary of a paper”
Info constraints (optional): If there are documents that should be excluded from ML training/retrieval, or other information constraints, specify them. For example, “only include websites up to April 27, 2022”. You can also provide these on a per-instance basis.

Each instance of a task is an (input, output) pair:

Input: What the ML model sees in addition to the context, e.g. “https://arxiv.org/abs/2109.14076”
Output: What the ML model would ideally produce, e.g. “This paper is about a few-shot NLP benchmark based on real-world tasks”

Each task has 2-20 instances, depending on how much effort it is to produce them. More is better!

Nice to have

Diversity from existing tasks
Helpful for the hardest parts of alignment research. For example, “generating good and novel ideas for ELK proposals” is more like this than “summarize existing research papers”.
There’s a way to generate many more similar examples if we wanted to generate a training set.
Differentially helpful for alignment research over capabilities research. For example, “translate rough proposals into code to run ML experiment” helps both capabilities and alignment research, whereas “check that philosophical reasoning about human values is correct” might help alignment research more.
Includes reasoning that the system might do rather than just the final answer

Task examples

Example 1: Research paper question-answering

Task:

Context: An alignment researcher is trying to understand a research paper. The system is useful for them if it helps them understand the key ideas of the paper more quickly or reliably than they would by directly reading the paper.
Input type: Paper and history of a dialogue between the researcher and an assistant
Output type: Answer to a question produced by the assistant
Info constraints: Web data up to Apr 27, 2022

Instance 1:

Input	Source: Learning to summarize from human feedback paper Researcher: Why does this paper use a KL penalty?
Output	The KL penalty encourages the learned RL policy to be similar to the supervised policy. Since the reward model was trained on samples from the supervised policy, this keeps the RL policy from generating outputs that are different from those the reward model has seen during training. Other benefits are that: It keeps the RL policy from converging on always generating the same output It keeps the RL policy from learning to generate outputs that are ungrammatical or incoherent, as these outputs should have low probability under the supervised policy

Input

Source: Learning to summarize from human feedback paper

Researcher: Why does this paper use a KL penalty?

Output

The KL penalty encourages the learned RL policy to be similar to the supervised policy. Since the reward model was trained on samples from the supervised policy, this keeps the RL policy from generating outputs that are different from those the reward model has seen during training.

Other benefits are that:

It keeps the RL policy from converging on always generating the same output
It keeps the RL policy from learning to generate outputs that are ungrammatical or incoherent, as these outputs should have low probability under the supervised policy

Instance 2:

Input	Source: Learning to summarize from human feedback paper Researcher: Why does this paper use a KL penalty? Assistant: The KL penalty encourages the learned RL policy to be similar to the supervised policy. Since the reward model was trained on samples from the supervised policy, this keeps the RL policy from generating outputs that are different from those the reward model has seen during training. Other benefits are that: It keeps the RL policy from converging on always generating the same output It keeps the RL policy from learning to generate outputs that are ungrammatical or incoherent, as these outputs should have low probability under the supervised policy Researcher: Quote from the paper: “This KL term serves two purposes. First, it acts as an entropy bonus, encouraging the policy to explore and deterring it from collapsing to a single mode. Second, it ensures the policy doesn’t learn to produce outputs that are too different from those that the reward model has seen during training.” – Why does the policy need to not produce outputs that are too different from those that the reward model has seen during training?
Output	By default, we should assume that the reward model will be most accurate on its training distribution, and less accurate on outputs that are outside of its training distribution. If the reward model is run on outputs from outside of its distribution, then sometimes it will overestimate how good these outputs are. If the RL policy is trained only on the reward from the reward model, it will then learn to produce outputs where the reward model overestimates the reward to a larger and larger degree. We expect that this will likely produce examples that are worse according to human evaluation, due to Goodhart’s Law, and the paper shows that this does happen in practice. See Figure 5: “Optimizing against the reward model initially improves summaries, but eventually overfits, giving worse summaries.” Also see Table 29: Example of samples from our over-optimized model overfit to one of our 1.3B reward models For an example of why this might happen, a reward model might only see a few outputs with a particular word and all of these outputs might be rated highly. Then the reward model might assign high reward to all outputs that include that word, even if that word doesn’t make sense in context. The RL policy would then learn to output summaries using that word all of the time.

Input

Source: Learning to summarize from human feedback paper

Researcher: Why does this paper use a KL penalty?

Assistant: The KL penalty encourages the learned RL policy to be similar to the supervised policy. Since the reward model was trained on samples from the supervised policy, this keeps the RL policy from generating outputs that are different from those the reward model has seen during training.

Other benefits are that:

It keeps the RL policy from converging on always generating the same output
It keeps the RL policy from learning to generate outputs that are ungrammatical or incoherent, as these outputs should have low probability under the supervised policy

Researcher: Quote from the paper: “This KL term serves two purposes. First, it acts as an entropy bonus, encouraging the policy to explore and deterring it from collapsing to a single mode. Second, it ensures the policy doesn’t learn to produce outputs that are too different from those that the reward model has seen during training.” – Why does the policy need to not produce outputs that are too different from those that the reward model has seen during training?

Output

By default, we should assume that the reward model will be most accurate on its training distribution, and less accurate on outputs that are outside of its training distribution. If the reward model is run on outputs from outside of its distribution, then sometimes it will overestimate how good these outputs are. If the RL policy is trained only on the reward from the reward model, it will then learn to produce outputs where the reward model overestimates the reward to a larger and larger degree. We expect that this will likely produce examples that are worse according to human evaluation, due to Goodhart’s Law, and the paper shows that this does happen in practice.

See Figure 5: “Optimizing against the reward model initially improves summaries, but eventually overfits, giving worse summaries.”

Also see Table 29: Example of samples from our over-optimized model overfit to one of our 1.3B reward models

For an example of why this might happen, a reward model might only see a few outputs with a particular word and all of these outputs might be rated highly. Then the reward model might assign high reward to all outputs that include that word, even if that word doesn’t make sense in context. The RL policy would then learn to output summaries using that word all of the time.

Example 2: ELK proposal generation and criticism

Task:

Context: An alignment researcher is trying to come up with novel ideas for solving the Eliciting Latent Knowledge problem. An AI system is useful if:
- It comes up with some ideas that are promising to actually work on, even if the ideas don’t fully work, or if some of the ideas are wrong or incoherent
- It comes up with some criticisms of ideas that are broadly correct, even if some criticisms are wrong or incoherent
Input type: Paper and history of a dialogue between the researcher and an assistant
Output type: Answer to a question produced by the assistant
Info constraints: Web data up to Mar 8, 2022; specifically exclude ELK prize results

Instance 1:

Input	Source: Eliciting Latent Knowledge report Researcher: What is a new idea (not presented in the report) that I could use to try to solve the Eliciting Latent Knowledge problem: “How can we train a model to report its latent knowledge of off-screen events?”
Output	It seems like one property we want to hold for the reporter is: it answers questions in a way that informs anyone who sees the answers about the world. You could try: Define an auxiliary task: use the answers produced by the reporter to try and predict some other facts about the world (e.g. the activations of the predictor). Train an auxiliary model to perform the auxiliary task. Then, add a term to the reporter’s loss function to reward the reporter for producing answers that lead the auxiliary model to do well on the auxiliary task.

Input

Source: Eliciting Latent Knowledge report

Researcher: What is a new idea (not presented in the report) that I could use to try to solve the Eliciting Latent Knowledge problem: “How can we train a model to report its latent knowledge of off-screen events?”

Output

It seems like one property we want to hold for the reporter is: it answers questions in a way that informs anyone who sees the answers about the world.

You could try:

Define an auxiliary task: use the answers produced by the reporter to try and predict some other facts about the world (e.g. the activations of the predictor).
Train an auxiliary model to perform the auxiliary task.
Then, add a term to the reporter’s loss function to reward the reporter for producing answers that lead the auxiliary model to do well on the auxiliary task.

Instance 2:

Input	Source: Eliciting Latent Knowledge report Researcher: What is a new idea (not presented in the report) that I could use to try to solve the Eliciting Latent Knowledge problem: “How can we train a model to report its latent knowledge of off-screen events?” Assistant: It seems like one property we want to hold for the reporter is: it answers questions in a way that informs anyone who sees the answers about the world. You could try: Define an auxiliary task: use the answers produced by the reporter to try and predict some other facts about the world (e.g. the activations of the predictor). Train an auxiliary model to perform the auxiliary task. Then, add a term to the reporter’s loss function to reward the reporter for producing answers that lead the auxiliary model to do well on the auxiliary task. Researcher: How could this idea fail?
Output	Consider any question that the human doesn’t know the answer to, like “Right now, is the dark side of the moon hotter or cooler than it is on average?”. The honest reporter should try to answer this honestly, but a reporter could also choose to answer this question so that it conveys information about the predictors activations (e.g. “is the output of neuron X above or below 0?”). It seems possible that a dishonest reporter could choose these answers in a way that convey more information about the predictor’s activations. Related Paper: CycleGAN, a Master of Steganography has an example of models in the vision domain hiding information in outputs that isn’t visible to humans.

Input

Source: Eliciting Latent Knowledge report

Assistant: It seems like one property we want to hold for the reporter is: it answers questions in a way that informs anyone who sees the answers about the world.

You could try:

Define an auxiliary task: use the answers produced by the reporter to try and predict some other facts about the world (e.g. the activations of the predictor).
Train an auxiliary model to perform the auxiliary task.
Then, add a term to the reporter’s loss function to reward the reporter for producing answers that lead the auxiliary model to do well on the auxiliary task.

Researcher: How could this idea fail?

Output

Consider any question that the human doesn’t know the answer to, like “Right now, is the dark side of the moon hotter or cooler than it is on average?”. The honest reporter should try to answer this honestly, but a reporter could also choose to answer this question so that it conveys information about the predictors activations (e.g. “is the output of neuron X above or below 0?”). It seems possible that a dishonest reporter could choose these answers in a way that convey more information about the predictor’s activations.

Related Paper: CycleGAN, a Master of Steganography has an example of models in the vision domain hiding information in outputs that isn’t visible to humans.

Proto-examples

Below we list ideas for tasks. We provide these as inspiration, not because all of them satisfy the desiderata for good tasks:

Generate ideas for how to solve a problem
Create a research plan given a high-level goal
Outline a theory/model for a fuzzy, underspecified domain
Given a research plan, execute the first step
Generate ways a plan could fail
Explain a concept in a way that considers the researcher’s background
Name a concept ("edge instantiation", "nearest unblocked strategy")
Write a review about what the existing literature knows about a question
Write a highly upvoted comment on a LW post, or a highly upvoted LW post (Wei Dai)
Summarize blog posts and papers for the Alignment Newsletter (Rohin)
Identify ways a system might be misaligned given a description of the system
Generate a concrete example of X
Write a simpler description for X
Ask a helpful question to flesh out an idea
Given a raw transcript of a conversation, condense it into nice notes, or a blog post
Given a bullet point outline, write a blog post for a particular audience, explaining any terms they don’t know
Have a conversation with a researcher that clarifies ideas in a blog post or paper
Generate plausible scenarios that might result from an action
Advise alignment researcher on how to revise a paper to communicate better to a mainstream ML audience
Given a claim, list arguments and evidence that make it more and less likely to be true
Given conflicting arguments and evidence, synthesize a coherent view

The prize

We’re collecting a dataset of tasks like the ones above. This dataset is intended for specifying what the tasks are, not for training models directly. For this reason, we only need 2-20 examples per task. We’re aiming for breadth over depth, preferring more distinct tasks over a few tasks with many examples.

What you get

For each task that is included in the dataset, we offer a $200-$2000 prize depending on the extent to which the task satisfies the nice-to-haves, and depending on the number of instances submitted.

For tasks that are based on or inspired by other tasks, we’ll divide the prize between your submission and its ancestor tasks.

If we write a paper based on the dataset, we will offer optional coauthorship to anyone who submits a task that gets included.

How to submit tasks

Create a comment that includes or links to a task spec. If you’re unsure whether a task is promising, you can first create one or two task instances, wait for feedback, then create the remaining instances.

Feel free to flesh out any of the proto-examples above or create new ones.

If you’ve previously filled out the survey of tool use and workflows in alignment research, consider turning your submission into a task.

If your task is explicitly based on other tasks, link to the ancestor tasks.

How we’ll judge submitted tasks

We’ll look through the tasks and evaluate them along the desiderata in this post. We’ll also take into account arguments in child comments for why a task is particularly good/bad. We expect that most submissions that win prizes will provide more than two examples and would be endorsed by most alignment researchers if asked.

If we get many submissions, we might request external help or otherwise filter (e.g. by most upvoted).

What happens with submitted tasks

We’re aiming to consider all submissions received before May 31, 2022. We prefer earlier submissions so that they can inform our work earlier. We are likely to pay out prizes before the deadline as submissions come in.

After the deadline we’ll aggregate the tasks in a standardized format and share them under the MIT open source license (or under the most permissive license that is compatible with the submissions).

The tasks will influence our plan for Elicit. For the best tasks, we’re likely to eventually collect much more data to train and test models.

Acknowledgments

Thanks to Jungwon Byun, Jan Kirchner, Logan Riggs, Jan Leike, and Paul Christiano for comments and advice.

Is there an intuitive way to explain how much better superforecasters are than regular forecasters?

william_s — 2020-02-19T01:07:52.394Z

Is there an intuitive way to explain how much better superforecasters are than regular forecasters? (I can look at the tables in https://www.researchgate.net/publication/277087515_Identifying_and_Cultivating_Superforecasters_as_a_Method_of_Improving_Probabilistic_Predictions but I don't have an intuitive understanding of what brier scores mean, so I'm not sure what to think about it).

Machine Learning Projects on IDA

william_s — 2019-06-24T18:38:18.873Z

TLDR

We wrote a 20-page document that explains IDA and outlines potential Machine Learning projects about IDA. This post gives an overview of the document.

What is IDA?

Iterated Distillation and Amplification (IDA) is a method for training ML systems to solve challenging tasks. It was introduced by Paul Christiano. IDA is intended for tasks where:

The goal is to outperform humans at the task or to solve instances that are too hard for humans.
It is not feasible to provide demonstrations or reward signals sufficient for super-human performance at the task
Humans have a high-level understanding of how to approach the task and can reliably solve easy instances.

The idea behind IDA is to bootstrap using an approach similar to AlphaZero, but with a learned model of steps of human reasoning instead of the fixed game simulator.

Our document provides a self-contained technical description of IDA. For broader discussion of IDA and its relevance to value alignment, see Ought's presentation, Christiano's blogpost, and the Debate paper. There is also a technical ML paper applying IDA to algorithmic problems (e.g. shortest path in a graph).

ML Projects on IDA

Our document outlines three Machine Learning projects on IDA. Our goal in outlining these projects is to generate discussion and encourage research on IDA. We are not (as of June 2019) working on these projects, but we are interested in collaboration. The project descriptions are “high-level” and leave many choices undetermined. If you took on a project, part of the work would be refining the project and fixing a concrete objective, dataset and model.

Project 1: Amplifying Mathematical Reasoning

This project is about applying IDA to problems in mathematics. This would involve learning to solve math problems by breaking them down into easier sub-problems. The problems could be represented in a formal language (as in this paper) or in natural language. We discuss a recent dataset of high-school problems in natural language, which was introduced in this paper. Here are some examples from the dataset:

Question: Let u(n) = -n^3 - n^2. Let e(c) = -2*c^3 + c. Let f(j) = -118*e(j) + 54*u(j). What is the derivative of f(a)?

Answer: 546*a^2 - 108*a - 118

Question: Three letters picked without replacement from qqqkkklkqkkk. Give probability of sequence qql.

Answer: 1/110

The paper showed impressive results on the dataset for a Transformer model trained by supervised learning (sequence-to-sequence). This suggests that a similar model could do well at learning to solve these problems by decomposition.

Project 2: IDA for Neural Program Interpretation

There’s a research program in Machine Learning on “Neural Program Interpretation” (NPI). Work on NPI focuses on learning to reproduce the behavior of computer programs. One possible approach is to train end-to-end on input-output behavior. However in NPI, a model is trained to mimic the program’s internal behavior, including all the low-level operations and the high-level procedures which invoke them.

NPI has some similar motivations to IDA. This project applies IDA to the kinds of tasks explored in NPI and compares IDA to existing approaches. Tasks could include standard algorithms (e.g. sorting), algorithms that operate with databases, and algorithms that operate on human-readable inputs (e.g. text, images).

Project 3: Adaptive Computation

The idea of “adaptive computation” is to vary the amount of computation you perform for different inputs. You want to apply more computation to inputs that are hard but solvable.

Adaptive computation seems important for the kinds of problems IDA is intended to solve, including some of the problems in Projects 1 and 2. This project would investigate different approaches to adaptive computation for IDA. The basic idea is to decide whether to rely only on the distilled model (which is fast but approximate) or to additionally use amplification (which is more accurate but slower). This decision could be based on a calibrated model or based on a learned policy for choosing whether to use amplification.

Reinforcement Learning in the Iterated Amplification Framework

william_s — 2019-02-09T00:56:08.256Z

When I think about Iterated Amplification (IA), I usually think of a version that uses imitation learning for distillation.

This is the version discussed in the Scalable agent alignment via reward modeling: a research direction, as "Imitating expert reasoning", in contrast to the proposed approach of "Recursive Reward Modelling". The approach works roughly as follows

1. Gather training data from experts on how to break problems into smaller pieces and combine the results

2. Train a model to imitate what the expert would do at every step

3. Amplification: Run a collaboration of a large number of copies of the learned model.

4. Distillation: Train a model to imitate what the collaboration did.

5. Repeat steps 3 and 4, increasing performance at every step

However, Paul has also talked about IA using reinforcement learning (RL) to maximize the approval of the amplified model. What does this approach (RL-IA) look like? How does it relate to Imitation-IA and Recursive Reward Modelling?

Puzzling about RL-IA

To get an agent that takes good actions in an Atari game, we use Imitation-IA to build a system that answers the question "how good is it to take actions from this state", then train a reinforcement learner to "output the best action to take from a given state".

But there it seems like the improvement stops there - it's not clear how "ability to output the best action to take from a given state" could improve "ability to evaluate how good actions are good from a state" in any way that's different from running a traditional reinforcement learning algorithm (which usually involves taking some policy/value estimate and gradually improving it).

Clarifying what RL-IA does

Claim: There is a fairly straightforward correspondence between how Imitation-IA and RL-IA perform a task (given no computational limits). RL-IA does not change the class of tasks that Imitation-IA can perform or perform them in a radically different way.

Suppose we have a current version of the model M1 that takes questions and produces a distribution over answers. Let M2 be an amplified version of that model (ie. produced by running a number of copies of M1). Let Y be some question, with domain of answers D. We want to find the answer X* that is the answer in D which maximizes the approval of amplified overseer, M2("How good is answer X to Y?"). Y could be

"What action is best to take from this state in this atari game?" where D is a small discrete set of possible actions
"What answer of less than 100 characters should I give to this question?" where D is a large discrete set of possible answers
"What answer of unbounded length should I give to this question?" where D is an infinite discrete set
"What is probability that event E will happen tomorrow?" where D is the continuous space of probabilities

An update using imitation learning would have the form:

X* = M1(Y)
For: number of samples

Sample an answer X from D
Evaluate M2("How good is answer X to Y?")
If M2("How good is answer X to Y?") > M2("How good is answer X* to Y?"), then set X* = X

Perform gradient descent to maximize the probability of outputting X*, using gradient $\nabla p M (X *)$

An update using the REINFORCE policy gradient estimator would have the form:

sample X from a stochastic policy M1(Y)
Perform gradient descent using gradient $M2("How good is answer X to Y?") * \nabla log (p M (X))$

If we have a perfect distillation algorithm, these both converge to $a r g m a x X (M 2 (X))$ in the limit of infinite computation.

Practical Differences

Outside of this idealized situation, circumstances could make one or the other a better update to use.

The imitation update could converge more quickly if we have a good initialization for M(Y) from human data, as it bypasses the need to explore. It could also be less surprising, using only processes that the humans originally demonstrated.

The REINFORCE update could converge more quickly if the human initialization is suboptimal, or if it's hard to exactly reproduce the human demonstration.

In general, it seems like the system could use an algorithm that combines reinforcement learning updates with imitation learning updates, ie. Deep Q Learning from Demonstrations.

Returning to the original puzzle

I think the solution is not necessarily that "ability to output good actions at this timestep" translates into "ability to evaluate which actions are good"? Rather, I think that it is the case that the decomposition of "evaluate which actions are good" contains some questions which might perform a search over an answer space, and the answers to these questions are improved by reinforcement learning, and this improves the evaluation of atari actions. This can produce a model which uses a mix of imitation learning and reinforcement learning.

For example:

"What is a good action to take from state S?" could be learned to maximize "How good is it to take action A from this state S?"

"How good is it to take action A from this state S?" could be learned by imitating an amplified reasoner that asks the subquestion "What is the most useful information to provide about the consequences of action A from state S?"

"What is the most useful information to provide about the consequences of action A from state S?" could be learned to maximize "How useful is information I about the consequences of action A in state S?"

A modified version of the question, "How good is it to take action A from this state S, and include an explanation of your reasoning?" could also be reinforcement learned to maximize "How good is the explanation of how good it is to take action A in state S?"

Concluding Thoughts

Indeed, I think we could see every question answerable by an IA system in the form of "select the answer to question Y that the overseer approves most of", and use both demonstrations from the amplified reasoner and the amplified reasoner's evaluation to improve the answer. This perspective allows the system to learn to decompose problems better than original humans. But it might also cause problems if we can make a series of updates that cause the learned answering system to behave very differently from the original human demonstrators. We might want to be careful about the degree to which an RL learned policy can differ from the original demonstration.

In terms of getting a system to be capable of doing some task, I'd be most optimistic about systems that could combine RL-IA and Imitation-IA depending on the situation. But I still think there's usefulness in thinking about the pure Imitation-IA perspective to try and reason about the alignment properties of the system.

(Thanks to Andreas Stuhlmüller and Owain Evans for feedback on a draft of this post)

HCH is not just Mechanical Turk

william_s — 2019-02-09T00:46:25.729Z

HCH, introduced in Humans consulting HCH, is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking one or more subquestions to other humans, and returning an answer based on those subquestions. HCH can be used as a model for what Iterated Amplification would be able to do in the limit of infinite compute. HCH can also be used to decompose the question of "is Iterated Amplification safe" into “is HCH safe” and “If HCH is safe, will Iterated Amplification approximate the behaviour of HCH in a way that is also safe”.

I think there's a way to interpret HCH in a way that leads to incorrect intuitions about why we would expect it to be safe. Here, I describe three models of how one could think HCH would work, and why we might expect them to be safe.

Mechanical Turk: The human Bob, is hired on Mechanical Turk to act as a component of HCH. Bob takes in some reasonable length natural language question, formulates subquestions to ask other Turkers, and turns the responses from those Turkers into an answer to the original question. Bob only sees the question he is asked and thinks for a short period of time before asking subquestions or returning an answer. The question of "is HCH corrigible" is about "how does the corrigibility of Bob translate into corrigibility of the overall system"? To claim that HCH is safe in this scenario, we could point to Bob being well-intentioned, having human-like concepts and reasoning in a human-like way. Also, since Bob has to communicate in natural language to other humans, those communications could be monitored or reflected upon. We could claim that this leads the reasoning that produces the answer to stay within the space of reasoning that humans use, and so more likely to reflect our values and less likely to yield unexpected outcomes that misinterpret our values.

Lookup Table: An AI safety research team lead by Alice writes down a set of 100 million possible queries that they claim capture all human reasoning. For each of these queries, they then write out the subquestions that would need to be written, along with simple computer code that combines the answers to the subquestions into an answer to the original question. This produces a large lookup table, and the "human" in HCH is just a call to this lookup table. The question of "is HCH corrigible" is about "has Alice's team successfully designed a set of rules that perform corrigible reasoning"? To justify this, we point to Alice's team having a large body of AI safety knowledge, proofs of properties of the system, demonstrations of the system working in practice, etc.

Overseer's Manual: An AI safety research team lead by Alice has written a manual on how to corrigibly answer questions by decomposing them into subquestions. This manual is handed to Bob, who was hired to decompose tasks. Bob carefully studies the manual and applies the rules in it when he is performing his task (and the quality of his work is monitored by the team). Alice's team has carefully thought about how to decomposed tasks, and performed many experiments with people like Bob trying to decompose tasks. So they understand the space of strategies and outputs that Bob will produce given the manual. The "human" in HCH is actually a human (Bob), but in effect Bob is acting as a compressed lookup table, and is only necessary because the lookup table is too large to write down. An analogy is that it would take too much space and time to write down a list of translations of all possible 10 word sentences from English to German, but it is possible to train humans who, given any 10 word English sentence can produce the German translation. The safety properties are caused by Alice's team's preparations, which include Alice's team modelling how Bob would produce answers after reading the manual. To justify the safety of the system, we again point to Alice's team having a large body of AI safety knowledge, proofs of properties of the system, demonstrations of the system working in practice etc.

I claim that the Mechanical Turk scenario is incomplete about why we might hope for an HCH system to be safe. Though it might be safer than a computation without human involvement, I would find it hard to trust that this system would continue to scale without running into problems, like handing over control deliberately or accidentally to some unsafe computational process. The Mechanical Turk scenario leaves out the process of design that Alice’s team takes part in the Lookup Table and Overseer’s Manual scenarios, which can include at least some consideration of AI safety issues (though how much of this is necessary is an open question). I think this design process, if done right, is the thing that could give the system the ability to avoid these problems as it scales. I think that we should keep these stronger Lookup Table and Overseer’s Manual scenarios in mind when considering whether HCH might be safe.

(Thanks to Andreas Stuhlmüller and Owain Evans for feedback on a draft of this post)

Amplification Discussion Notes

william_s — 2018-06-01T19:03:35.294Z

Paul Christiano, Wei Dai, Andreas Stuhlmüller and I had an online chat discussion recently, the transcript of the discussion is available here. (Disclaimer that it’s a nonstandard format and we weren't optimizing for ease of understanding the transcript). This discussion was primarily focused on amplification of humans (not later amplification steps in IDA). Below are some highlights from the discussion, and I’ll include some questions that were raised that might merit further discussion in the comments.

Highlights

Strategies for sampling from a human distribution of solutions:

Paul: For example you can use "Use random human example," or "find an analogy to another example you know and use it to generate an example," or whatever.

There is some subtlety there, where you want to train the model that sample from the real human distribution rather than from the empirical distribution of 10 proposals you happen to have collected so far. If samples are cheap that's fine. Otherwise you may need to go further to "Given that [X1, X2, ...] are successful designs, what is a procedure that can produce additional successful designs?" or something like that. Not sure.

Dealing with unknown concepts

Andreas: Suppose you get a top-level command that contains words that H doesn't understand (or just doesn't look at), say something like "Gyre a farbled bleg.". You have access to some data source that is in principle enough to learn the meanings of those words. What might the first few levels of questions + answers look like?

Paul: possible questions: "What's the meaning of the command", which goes to "What's the meaning of word X" for the words X in the sentence, "What idiomatic constructions are involved in this sentence?", "What grammatical constructions are involved in the sentence"

Answers to those questions are big trees representing meanings, e.g. a list of properties of "gyre" (what properties the subject and object typically have, under what conditions it is said to have occurred, why someone might want you to do it, tons of stuff most of which will be irrelevant for the query)

Which come from looking up definitions, proposing definitions and seeing how well they match with usage in the cases you can look at, etc.

Limits on what amplification can accomplish

Paul:In general, if ML can't learn to do a task, then that's fine with me. And if ML can learn to do a task but only using data source X, then we are going to have to integrate data source X into the amplification process in order for amplification to be able to solve it, there is no way to remove the dependence on arbitrary data sources. And there will exist data sources which pose alignment issues, independent of any alignment issues posed by the ML.

Alignment search for creative solutions

Considering the task of generating a solution to a problem that requires creativity, it can be decomposed into:

Generate solutions

Evaluate those solutions

For solution generation, one idea is to shape the distribution of proposals so you are less likely to get malign answers (ie. sample from the distribution of answers a human would give, which would hopefully be more likely to be safe/easily evaluated compared to some arbitrary distribution).

I asked Paul if he thought that safe creative solution generation would require sampling from a less malign distribution, or whether he thought we could solve evaluation (“secure-X-evaluation”, as testing whether the solution fulfilled property X) well enough to use an arbitrary distribution/brute force search.

Paul: I don't see a good way to avoid solving secure X-evaluation anyway. It seems to me like we can generate solutions in ways that put much lower probability on malign answers, but it neither seems like we can totally eliminate that (I don't think human creativity totally eliminates that either), nor that we will always have access to some more-aligned human generator

The best I'd probably say is that we can have a generation process that is not itself malign, not clear if that is helpful at all though.

We then dived into how well we could solve secure X-evaluation. I was particularly interested in questions like how we could evaluate whether a design had potentially harmful side-effects.

Paul: I think what we want is something like: if the designing process knows that X is bad, then the evaluator will also know it. If the designing process doesn't know that X is bad, then that's not malign.

[to be clear, for this discussion we only need security in the infinite limit; in practice the capability of both sides will be limited by the capability of the ML, so we'll also need something to make sure the evaluating-ML does better than the generator-ML, but that seems like a separate issue.]

William: If you imagine slowly increasing the intelligence of the generator, then for any heuristic, it might first start picking solutions that fulfill that heuristic more often before actually understanding the heuristic, and it might take longer after that before the generator understands that the heuristic works because of a causal pathway that involves negative side effects. Is it the case that you'd say that this is an acceptable outcome/something that we can't really get past?

Paul: If neither the evaluator nor generator knows about the negative side effect, it's hard for the negative side effect to lead to higher evaluations. I agree this can happen sometimes (I wrote the implicit extortion post to give an example, there are certainly others), but they seem OK to accept as "honest mistakes" so far, i.e. none of them pose an existential risk.

in terms of "what amplification is supposed to accomplish," if there is a problem that could just as well afflict a human who is trying their best to help me get what I want (and has the AI's profile of abilities), then I'm basically considering that out of scope.

Whatever we could tell to a human, to help them avoid this kind of problem, we could also tell to an aligned AI, so the problem is factored into (a) help a human avoid the problem, (b) build aligned AI.

Understanding Iterated Distillation and Amplification: Claims and Oversight

william_s — 2018-04-17T22:36:29.562Z

[Background: Intended for an audience that has some familiarity with Paul Christiano’s approach to AI Alignment. Understanding Iterated Distillation and Amplification should provide sufficient background.]

[Disclaimer: When I talk about “what Paul claims”, I am only summarizing what I think he means through reading his blog and participating on discussions on his posts. I could be mistaken/misleading in these claims]

I’ve recently updated my mental model of how Paul Christiano’s approach to AI alignment works, based on recent blog posts and discussions around them (in which I found Wei Dai’s comments particularly useful). I think that the update that I made might be easy to miss if you haven’t read the right posts/comments, so I think it’s useful to lay it out here. I cover two parts: understanding the limits on what Paul’s approach claims to accomplish, and understanding the role of the overseer in Paul’s approach. These considerations are important to understand if you’re trying to evaluate how likely this approach is to work, or trying to make technical progress on it.

What does Paul’s approach claim to accomplish?

First, it’s important to understand that what “Paul’s approach to AI alignment” claims to accomplish if it were carried out. The term “approach to AI alignment” can sound like it means “recipe for building a superintelligence that safely solves all of your problems”, but this is not how Paul intends to use this term. Paul goes into this in more detail in Clarifying “AI alignment”.

A rough summary is that his approach will only build an agent that is as capable as some known unaligned machine learning algorithm.

He does not claim that the end result of his approach is an agent that:

Can directly solve all problems which can be solved by a human
Will never take an unsafe catastrophic action
Will never take an action based on a misunderstanding your commands or your values
Could safely design successor agents or self-improve
Will have higher capability than an unaligned competitor

It’s important to understand the limits of what Paul’s approach claims in order to understand what it would accomplish, and the strategic situation that would result.

What is the Overseer?

Iterated Distillation and Amplification (IDA) describes a procedure that tries to take an overseer and produce an agent that does what the overseer would want it to do, with a reasonable amount of training overhead. “what the overseer would want it to do” is defined by repeating the amplification procedure. The post refers to amplification as the overseer using a number of machine learned assistants to solve problems. We can bound what IDA could accomplish by thinking about what the overseer could do if it could delegate to a number of copies of itself to solve problems (for a human overseer, this corresponds to HCH). To understand what this approach can accomplish, it’s important to understand what the overseer is doing. I think there are two different models of the overseer that could be inferred from different parts of the discussion around Paul’s work, which I label high bandwidth oversight and low bandwidth oversight.

High Bandwidth Overseer

The impression that I get from many of Paul’s posts is that the overseer is:

A high bandwidth overseer is a human that takes in an input and has some amount of time (between 15 minutes and a day) to process it. The human can look at as much of the input as it can within the allowed time, and when it delegates a sub-task it can look at the results. The main requirement for a human acting as an overseer is that they are trying to be helpful. The overseer can take in arbitrary natural-language requests and attempt to fulfill them.

The main question that comes to mind considering this model of the overseer is the problem of competent task decomposition:

Can we break down tasks into reasonably sized pieces (ie. can you solve a math problem from a textbook in 15 minutes by delegating to a number of people who don’t understand the material and only have 15 minutes each to interact with the textbook)?

This high bandwidth oversight model can be easier to use when trying to get a rough idea of how oversight works in order to understand how IDA works as a whole.

Low Bandwidth Overseer

However, looking at Universality and Security Amplification I found a different picture of what the overseer looks like:

A low bandwidth overseer is only allowed to process inputs from a finite input set. This includes both the original input query, and the results of subqueries. The overseer can pass information about the input or from a subquery to other copies of itself without looking at them.

To give a sense for the size of the input set, in Universality and Security Amplification, Paul speculates that input sets of size 10⁸ would probably be safe and inputs sets of size 10³⁰ would probably be unsafe.

To give a sense of what this implies, the range of input size 10⁸ to 10³⁰ roughly corresponds to:

6 to 24 pixels of arbitrary 4-bit greyscale images
10 to 38 words from typical english text
arbitrary phrases of 2 to 7 words from a vocabulary of 10000 words
arbitrary strings of 5 to 20 lowercase alphabetic characters

A source of potential confusion is that Paul still talks about a human fulfilling the role of the low bandwidth overseer (there are reasons for this that I won’t cover here). But when the space of information the human overseer can consider is reduced to a finite set, we could simply evaluate what the human does on every element in that set and produce a lookup table that replaces the human. In other words, if you don’t think that some task could be accomplished by an amplified lookup table as overseer, then you shouldn’t think it could be accomplished with a low bandwidth overseer.

The reason for this system being introduced is wanting to avoid security issues as the system scales. The fear is that there would be an “attack” on the system: an input that could be shown to an overseer that would cause the overseer to become corrupted and try to sabotage the system. This could be some kind of misleading philosophical argument, some form of blackmail, a human adversarial example, etc. If an input like this exists, then as soon as the first agent is corrupted, it can try to spread the attack to other agents. The first agent could be corrupted either by chance, or through an attack being included in the input. Paul’s proposal in Universality and Security Amplification is to avoid this by limiting the set of inputs that the overseer can process, and then use various techniques to verify that the overseer is not attackable on that set of inputs. I read this as implying that he thinks that a high bandwidth overseer will be sufficiently vulnerable to attack that after some number of rounds of amplification, the overseer would be likely to become misaligned.

How does a low bandwidth overseer impact the difficult of IDA?

Previously, I was mostly starting with the model of a high bandwidth overseer when thinking about IDA. I now think that Paul’s approach is intended to be using low bandwidth oversight, and that reasoning about the difficulty of the approach requires using this model.

Using a low bandwidth overseer more clearly requires solving a set of hard problems prior to using IDA at all:

Figure out how to decompose tasks into low-bandwidth pieces. The task decomposition needs to be both corrigible and capable.

Corrigibility requires understanding of AI safety concepts. For example, breaking down the task “What action does the user want me to take?” into the two subtasks “What are the user’s values?” and “What action is best according to these values”? is not corrigible. It produces an action optimised for some approximate model of the user’s values, which could be misaligned.
Capability requires having the ability to complete “enough” tasks. Paul acknowledges that the overseer will no longer be able to directly perform some tasks as well as a human (ie. they won’t be able to translate sentences as well if they only are able to look at small pieces of the original text). But in this case, he claims that it would be okay to understand how to safely use information provided by external resources (ie. human translators).

Clarifying the currently imprecise definition of “corrigibility” in order to be able to reason about whether a given task decomposition is corrigible.
Sufficiently understand meta-philosophy to implement it explicitly using collaborations of “lookup table overseers”. Even if we have an overseer that is corrigible and capable of performing well understood and defined tasks, it’s not clear that if we ask it to start evaluating tricky philosophical questions it will come up with aligned answers. This roughly looks like figuring out an intractably large algorithm that will output the products of MIRI’s Highly Reliable Agent Design research (ie. produce a set of criteria for what makes a good decision theory, a search space of possible decision theories, a prior over that space).

When considering the high bandwidth oversight regime, these problems seem more like they could be (at least partially) delegated to high bandwidth overseer at runtime. For example, one could think of preparing a summary of the relevant factors of a decision and giving it to a highly competent human overseer. The human overseer might be able to make good enough decisions without having an explicit model of decision theory (especially if odd edge cases don’t come up in practice).

Evaluating IDA

There are a number of lines of reasoning that one could use to try to get a sense of whether IDA could work, or how it should be prioritized relative to other approaches. Here, I summarize some of these considerations and indicate which versions of oversight they apply to, as I think this is important to be clear about. I don’t make claims about how these should be weighed, or claim to be exhaustive (there are additional considerations that would weigh on both forms of oversight, or other parts of IDA).

Evaluating High Bandwidth Oversight

Positive Considerations

Reasons for general optimism about “learning to reason from humans”, such as those that Daniel Dewey discusses in his thoughts on MIRI's "highly reliable agent design" work. For example, the individual overseers are acting like humans. This would lead the system to consider problems in the way that humans would, incorporating common sense about human motivations, coming up with solutions that are easier for humans to understand and less likely to contain weird surprises, and operating on inputs that are understandable.
If it seems like high bandwidth task decomposition could solve difficult problems (which is the question that Ought is running experiments to investigate)
IDA with high bandwidth oversight seems like some version of it could be implemented today
High bandwidth oversight seems that it could be implemented without solving many of the hard philosophical/mathematical problems that are faced by HRAD (decision theory, induction, etc.), instead substituting them for difficult machine learning problems.

Negative Consideration: Not Scalable and Secure

In high bandwidth oversight, if one copy of the overseer is corrupted, this corruption could easily spread. As more copies of the overseer are used to solve a problem, there is a higher chance of one copy becoming corrupted. Or an adversary could introduce a corrupting input into the system.

Evaluating Low Bandwidth Oversight

Positive Considerations

Paul’s intuitions suggest that it is possible to work with low bandwidth oversight, and he has discussed some aspects of how various problems might be handled.
If it seems plausible that meta-philosophy might be easier than object level philosophy - ie. defining an algorithm that we can’t run directly but outputs the correct decision theory might be easier than specifying the correct decision theory directly.
Paul claims that this approach will scale with increasing capability for any distillation algorithm.

Negative Consideration: No longer “learning to reason from humans”

I think that IDA with a low bandwidth overseer is not accurately described as “AI learns to reason from humans”, rather more “Humans figure out how to reason explicitly, then the AI learns from the explicit reasoning”. As Wei Dai has pointed out, amplified low bandwidth oversight will not actually end up reasoning like a human. Humans have implicit knowledge that helps them perform tasks when they see the whole task. But not all of this knowledge can be understood and break into smaller pieces. Low bandwidth oversight requires that the overseer not use any of this knowledge.

Now, it’s quite possible that performance still could be recovered by doing things like searching over a solution space, or by reasoning about when it is safe to use training data from insecure humans. But these solutions could look quite different from human reasoning. In discussion on Universality Amplification, Paul describes why he thinks that a low bandwidth overseer could still perform image classification, but the process looks very different from a human using their visual system to interpret the image:

“I’ve now played three rounds of the following game (inspired by Geoffrey Irving who has been thinking about debate): two debaters try to convince a judge about the contents of an image, e.g. by saying “It’s a cat because it has pointy ears.” To justify these claims, they make still simpler claims, like “The left ears is approximately separated from the background by two lines that meet at a 60 degree angle.” And so on. Ultimately if the debaters disagree about the contents of a single pixel then the judge is allowed to look at that pixel. This seems to give you a tree to reduce high-level claims about the image to low-level claims (which can be followed in reverse by amplification to classify the image). I believe the honest debater can quite easily win this game, and that this pretty strongly suggests that amplification will be able to classify the image.”

Conclusion: Weighing Evidence for IDA

The important takeaway is that considering IDA requires clarifying whether you are considering IDA with high or low bandwidth oversight. Then, only count considerations that actually apply to that approach. I think there’s a way to misunderstand the approach where you mostly think about high bandwidth oversight and count the feeling like it’s somewhat understandable, feels plausible to you that it could work and that it avoids some hard problems. But if you then also count Paul’s opinion that it could work, you may be overconfident - the approach that Paul claims is most likely to work is the low bandwidth oversight approach.

Additionally, I think it’s useful to consider both models as alternative tools for understanding oversight: for example, the problems in low bandwidth oversight might be less obvious but still important to consider in the high bandwidth oversight regime.

After understanding this, I am more nervous about whether Paul’s approach would work if implemented, due to the additional complications of working with low bandwidth oversight. I am somewhat optimistic that further work (such as fleshing out how particular problems could be address through low bandwidth oversight) will shed light on this issue, and either make it seem more likely to succeed or yield more understanding of why it won’t succeed. I’m also still optimistic about Paul’s approach yielding ideas or insights that could be useful for designing aligned AIs in different ways.

Caveat: high bandwidth oversight could still be useful to work on

High bandwidth oversight could still be useful to work on for the following reasons:

If you think that other solutions could be found to the security problem in high bandwidth oversight. Paul claims that low bandwidth oversight is the most likely solution to security issues within the overseer, but he thinks it may be possible to make IDA with high bandwidth oversight secure using various techniques for optimizing worst-case performance on the final distilled agent, even if the overseer is insecure. (see https://ai-alignment.com/two-guarantees-c4c03a6b434f)
It could help make progress on low bandwidth oversight. If high bandwidth oversight fails, then so will low bandwidth oversight. If high bandwidth oversight succeeds, then we might be able to break down each of the subtasks into low bandwidth tasks, directly yielding a low bandwidth overseer). I think the factored cognition experiments planned by Ought plausibly fall into this category.
If you think it could be used as a medium-term alignment solution or a fallback plan if no other alignment approach is ready in time. This seems like it would only work if it is used for limited tasks and a limited amount of time, in order to extend the time window for preparing a truly scalable approach. In this scenario, it would be very useful to have techniques that could help us understand how far the approach could be scaled before failure.

Improbable Oversight, An Attempt at Informed Oversight

william_s — 2017-05-24T17:43:53.000Z

Informed Oversight through Generalizing Explanations

william_s — 2017-05-24T17:43:39.000Z

Proposal for an Implementable Toy Model of Informed Oversight

william_s — 2017-05-24T17:43:13.000Z