Posts
Comments
I got a bit lost in understanding your exit plan. You write
My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them
Some questions about this and the text that comes after it:
- How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you're describing here. What techniques are you imagining using?
- Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
- Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control "human-obsoleting" AIs?
- Why do you "expect that ruling out egregious misalignment is the hardest part in practice"? That seems pretty counterintuitive to me. It's easy to imagine descendants of today's models that don't do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn't be "philosophically competent".
- What are you buying time to do? I don't understand how you're proposing spending the "3 years of time prior to needing to build substantially superhuman AIs". Is it on alignment for those superhuman AIs?
- You mention having 3 years, but then you say "More generally, it just seems really heuristically scary to very quickly go from AIs which aren't much smarter than the best humans to AIs which are wildly smarter in only a few years." I found this confusing.
- What do you mean by "a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point." It seems easier to mitigate which risks prior to what point? And why? I didn't follow this.
In the past, you've treated catching an AI escape attempt as a "win condition" for a control protocol, because such an attempt leads you to either 1) shut down your AI or 2) do few-shot catastrophe prevention. You wrote that having these two tools would "substantially" change the situation w.r.t. the model's chance of success.
But the arguments in this post suggest that catching an attempt may not change the overall strategic landscape/level of AI risk much.
Is this a change in your views? Or have you always felt that catching an AI trying to escape substantially changes that AI's chance of success without substantially changing the overall level of AI risk?
Separately, another takeaway of this post: it seems valuable to work on research that allows you to provide evidence about the alignment of other labs' AIs, e.g. better technical work on searching for inputs that trigger competent misaligned behavior.
Thanks, edited my post to reference this (lmk if you understand what's happening here better than I do)
Thanks, this is a helpful comment. Fixed the typo
Edit: The sitation has evolved but is still somewhat confusing. There is now a leaderboard of scores on the public test set that Ryan is #1 on (see here). But this tweet from Jack Cole indicates that his (many month old) solution gets a higher score on the public test set than Ryan's top score on that leaderboard. I'm not really sure what's going on here,
- Why isn't Jack's solution on the public leaderboard?
- Is the semi-pubic test set the same as the old private set?
- If not, is it equal in difficulty to the public test set, or the harder private test set?
- Here it says "New high scores are accepted when the semi-private and public evaluation sets are in good agreement". What does that mean?
One important caveat to the presentation of results in this post (and the discussion on Twitter) is that there are reasons to think this approach may not be SOTA, as it performs similarly to the prior best-performing approach when tested apples-to-apples, i.e. on the same problems.
There are three sets of ARC problems: the public training set, the public eval set, and the private eval set.
- Buck and Ryan got 71% on the first, 51% on the second, and [we don't know] on the third.
- The past SOTA got [we don't know] on the first, 52% on the second, and 34% on the third.
- Humans get 85% on the first, [we don't know] on the second, and [we don't know] on the third
My two main deductions from this are:
- It's very misleading to compare human perf on the train set and AI perf on either of the test sets; the test sets seem way harder! Note that 71% is approaching 85%, so it seems like AIs are not far from human perf when you compare apples-to-apples. So graphs from the ARC folks like the one showing little progress towards human-level perf on this page are not scientifically valid.
- Buck and Ryan's approach doesn't exceed the past AI SOTA on the only apples-to-apples comparison we have so far. Unclear if it will beat it on the private test set.
Apparently, lots of people get better performance on the public test set than the private one, which is a little surprising given that if you read this page from the ARC folks, you'll see the following:
The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.
The public evaluation sets and the private test sets are intended to be the same difficulty.
Two explanations come to mind: maybe the public and private test sets are not IID, and/or maybe past SOTA method overfit to the public set. Chollet claims it's (accidentally) the latter here, but he doesn't rule out the former. He says the tasks across the two public test sets are meant to be equally hard for a human, but he doesn't say they're divided in an IID manner.
I guess we'll see how the results on the public leaderboard shake out.
(Expanding on a tweet)
What are the considerations around whether to structure the debate to permit the judge to abstain (as Michael et al do, by allowing the judge to end the round with low credence) versus forcing the judge to pick an answer each time? Are there pros/cons to each approach? Any arguments about similarity of one or the other to the real AI debates that might be held in the future?
It's possible I'm misremembering/misunderstanding the protocols used for the debate here/in that other paper.
"Follow the right people on twitter" is probably the best option. People will often post twitter threads explaining new papers they put out. There's also stuff like:
- News put together by CAIS: https://newsletter.mlsafety.org/ and https://newsletter.safe.ai/ and https://twitter.com/topofmlsafety
- News put together by Daniel Paleka: https://newsletter.danielpaleka.com/ and twitter summaries like https://twitter.com/dpaleka/status/1664617835178631170
I appreciate you transcribing these interviews William!
Did/will this happen?
I've been loving your optimization posts so far; thanks for writing them. I've been feeling confused about this topic for a while and feel like "being able to answer any question about optimization" would be hugely valuable for me.
Thanks, fixed
We're expecting familiarity with PyTorch, unlike MLAB. The level of Python background expected is otherwise similar. The bar will vary somewhat depending on each applicant's other traits, e.g. mathematical and empirical-science backgrounds
30 min, 45 min, 20-30 min (respectively)
Video link in the pdf doesn't work
Confusion:
You write "Only PaLM looks better than Chinchilla here, mostly because it trained on 780B tokens instead of 300B or fewer, plus a small (!) boost from its larger size."
But earlier you write:
"Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.
It's 70B params, trained on 1.4T tokens of data"
300B vs. 1.4T. Is this an error?
I agree with your description about the hassle of eating veg when away from home. The point I was trying to make is that buying hunted meat seems possibly ethically preferable to veganism on animal welfare grounds, would address Richard's nutritional concerns, and also satisfies meat cravings.
Of course, this only works if you condition on the brutality of nature as the counterfactual. But for the time being, that won't change.
I was thinking yesterday that I'm surprised more EAs don't hunt or eat lots of mail-ordered hunted meat, like eg this. Regardless of whether you think nature should exist in the long term, as it stands the average deer, for example, has a pretty harsh life and death. Studies like this on American white-tailed deer enumerate the alternative modes of death, which I find universally unappealing. You've got predation (which surprisingly to me is the number one cause of death for fawns), car accidents, disease, and starvation. These all seem orders of magnitude worse than being killed by a hunter with a good shot.
I'd assume human hunting basically trades off against predation and starvation, so the overall quantity of deer and deer consciousness isn't affected much by hunting. The more humans kill, the fewer coyotes kill.
Edit: So it seems to me that buying hunted meat/encouraging hunting might have a better animal welfare profile than veganism, while also satisfying Richard's concerns about nutrition and satisfying meat cravings. That being said, it is not really scalable in the way veg*ism is.