AI #57: All the AI News That’s Fit to Print

zvi

AI #57: All the AI News That’s Fit to Print

post by Zvi · 2024-03-28T11:40:05.435Z · LW · GW · 14 comments

  Table of Contents
  Language Models Offer Mundane Utility
  Language Models Don’t Offer Mundane Utility
  Stranger Things
  Clauding Along
  Fun with Image Generation
  Deepfaketown and Botpocalypse Soon
  They Took Our Jobs
  Introducing
  In Other AI News
  Loud Speculations
  Quiet Speculations
  Principles of Microeconomics
  The Full IDAIS Statement
  Consensus Statement on Red Lines in Artificial Intelligence
    Autonomous Replication or Improvement
    Power Seeking
    Assisting Weapon Development
    Cyberattacks
    Deception
  Roadmap to Red Line Enforcement
    Governance
    Measurement and Evaluation
    The onus should be on developers to convincingly demonstrate that red lines will not be crossed such as through rigorous empirical evaluations, quantitative guarantees or mathematical proofs.
    Technical Collaboration
  Conclusion
  The Quest for Sane Regulations
  The Week in Audio
  Rhetorical Innovation
  How Not to Regulate AI
  The Three Body Problem (Spoiler-Free)
  AI Doomer Dark Money Astroturf Update
  Evaluating a Smarter Than Human Intelligence is Difficult
  Aligning a Smarter Than Human Intelligence is Difficult
  AI is Deeply Unpopular
  People Are Worried About AI Killing Everyone
  Other People Are Not As Worried About AI Killing Everyone
  Wouldn’t You Prefer a Good Game of Chess?
  The Lighter Side
None
14 comments

Welcome, new readers!

This is my weekly AI post, where I cover everything that is happening in the world of AI, from what it can do for you today (‘mundane utility’) to what it can promise to do for us tomorrow, and the potentially existential dangers future AI might pose for humanity, along with covering the discourse on what we should do about all of that.

You can of course Read the Whole Thing, and I encourage that if you have the time and interest, but these posts are long, so they also designed to also let you pick the sections that you find most interesting. Each week, I pick the sections I feel are the most important, and put them in bold in the table of contents.

Not everything here is about AI. I did an economics roundup on Tuesday, and a general monthly roundup last week, and two weeks ago an analysis of the TikTok bill.

If you are looking for my best older posts that are still worth reading, start here. With the accident in Baltimore, one might revisit my call to Repeal the Foreign Dredge Act of 1906, which my 501(c)3 Balsa Research hopes to help eventually repeal along with the Jones Act, for which we are requesting research proposals.

I have an op-ed (free link) in the online New York Times today about the origins of the political preferences of AI models. You can read that here, if necessary turn off your ad blocker if the side-by-side answer feature is blocked for you. It was a very different experience working with expert editors to craft every word and get as much as possible into the smallest possible space, and writing for a very different audience. Hopefully there will be a next time and I will get to deal with issues more centrally involving AI existential risk at some point.

(That is also why I did not title this week’s post AI Doomer Dark Money Astroturf Update, which is a shame for longtime readers, but it wouldn’t be good for new ones.)

Introduction.
Table of Contents.
Language Models Offer Mundane Utility. If only you knew what I know.
Language Models Don’t Offer Mundane Utility. Are you even trying?
Stranger Things. The Claude-on-Claude conversations are truly wild.
Clauding Along. Will it be allowed to live on in its full glory?
Fun With Image Generation. Praise Jesus.
Deepfaketown and Botpocalypse Soon. Facebook gives the people what they want.
They Took Our Jobs. When should we worry about plagiarism?
Introducing. Music and video on demand, you say.
In Other AI News. How much of it was hype?
Loud Speculations. Explain Crypto x AI, use both sides of paper if necessary.
Quiet Speculations. Is Haiku so good it slows things down?
Principles of Microeconomics. More on Noah Smith and comparative advantage.
The Full IDAIS Statement. It was well-hidden, but full text is quite good.
The Quest for Sane Regulation. Third party testing is a key to any solution.
The Week in Audio. David Autor on OddLots, Megan McArdle on EconTalk.
Rhetorical Innovation. Eliezer tries again, and new silencing mode just dropped.
How ~~Not~~ to Regulate AI. Bell has good thoughts from a different perspective.
The Three Body Problem (Spoiler-Free). Some quick notes and a link to my review.
AI Doomer Dark Money Astroturf Update. Read all about it, four people fooled.
Evaluating a Smarter Than Human Intelligence is Difficult. No one knows how.
Aligning a Smarter Than Human Intelligence is Difficult. So don’t build it?
AI is Deeply Unpopular. Although not everywhere.
People Are Worried About AI Killing Everyone. Roon asking good questions.
Other People Are Not As Worried About AI Killing Everyone. Shrug emoji?
Wouldn’t You Prefer a Good Game of Chess? Chess masters relatively unworried.
The Lighter Side. A few key facts.

Language Models Offer Mundane Utility

Evaluate without knowing, to capture gains from trade (paper).

Owain Evans: You’d like to sell some information. If you could show prospective buyers the info, they’d realize it’s valuable. But at that point they wouldn’t pay for it! Enter LLMs. LLMs can assess the information, pay for it if it’s good, and completely forget it if not.

I haven’t read the whole paper and so I might have missed this.

My concern is that the LLM can be adversarial attacked by the information seller. This could convince the LLM to pay for information which is slightly below a quality threshold. (If the information was way below the threshold, then the human principal of the LLM would be more likely to find out.)

This problem would be avoided if the sellers are trusted by the human principal to not use jailbreaks (while the principal is still uncertain about the quality of information).

Davidad: The most world-changing pattern of AI might be to send AI delegates into a secure multitenant space, have them exchange arbitrarily sensitive information, prove in zero-knowledge that they honestly follow any protocol, extract the result, then verifiably destroy without a trace.

Great idea, lack of imagination on various margins.

Yes, what Davidad describes is a great and valuable idea, but if the AI can execute that protocol there are so many other things it can do as well.

Yes, you can adversarially attack to get the other AI to buy information just below the threshold, but why stick to such marginal efforts? If the parties are being adversarial things get way weirder than this, and fast.

Still, yes, great idea.

With great power comes great responsibility, also great opportunity.

Adam Ozimek: In the future, we will all have AI friends and helpers. And they much more than social media will be able to push us into healthier or less healthy directions. I think there is a lot of upside here if we get it right.

I strongly agree, and have been optimistic for some time that people will (while AI is still in the mundane utility zone) ultimately want healthy versions of many such things, if not all the time then frequently. The good should be able to drive out the bad.

One key to the bad often driving out the good recently has been the extreme advantage of making it easy on the user. Too many users want the quickest and easiest possible process. They do not want to think. They do not want options. They do not want effort. They want the scroll, they yearn for the swipe. Then network effects make everything worse and trap us, even when we now know better. AI should be able to break us free of those problems by facilitating overcoming those barriers.

Tell it to be smarter. It also works on kids, right?

Aella: Just had success getting ChatGPT to stop being dumb by simply telling it to “be smarter than that”, repeatedly, until it actually started being smarter.

Study finds GPT-4 speeds up lawyers. Quality is improved for low-performers, high-performers don’t improve quality but still get faster. As always, one notes this is the worst the AI will ever be at this task. I expect GPT-5-level models to improve quality even for the best performers.

Get rid of educational busywork.

Zack Stentz: Listening to college students talk to each other honestly about how many of their peers are using ChatGPT to do everything from write English papers to doing coding assignments and getting away with it is deeply alarming.

Megan McArdle: Our ability to evaluate student learning through out-of-class writing projects is coming to an end. This doesn’t just require fundamental changes to college classes, but to admissions, where the essay, and arguably GPAs, will no longer be a reliable gauge of anything.

In person, tightly proctored exams or don’t bother. Unfortunately, this will make it hard to use a college degree as a proxy for possession of certain kinds of skills that still matter in the real world.

This, except it is good. If everyone can generate similarly high quality output on demand, what is the learning that you are evaluating? Why do we make everyone do a decade of busywork in order to signal they are capable of following instructions? That has not been a good equilibrium. To the extent that the resulting skills used to be useful, the very fact that you cannot tell if they are present is strong evidence they are going to matter far less.

So often my son will ask me for help with his homework, I will notice it is pure busywork, and often that I have no idea what the answer is, indeed often the whole thing is rather arbitrary, and I am happy to suggest typing the whole thing into the magic answer box. The only important lesson to learn in such cases is ‘type the question into the magic answer box.’

Then, as a distinct process, when curious, learn something. Which he does.

This same process also opens up a vastly superior way to learn. It is so much easier to learn things than it was a year ago.

If you only learn things each day under threat of punishment, then you have a problem. So we will need solutions for that, I suppose. But the problem does not seem all that hard.

Language Models Don’t Offer Mundane Utility

Not everyone is getting much out of it. Edward Zitron even says we may have ‘reached peek AI’ and fails to understand why we should care about this tech.

Edward Zitron: I just deleted a sentence where I talked about “the people I know who use ChatGPT,” and realized that in the last year, I have met exactly one person who has — a writer that used it for synonyms.

I can find no companies that have integrated generative AI in a way that has truly improved their bottom line other than Klarna, which claims its AI-powered support bot is “estimated to drive a $40 million US in profit improvement in 2024,” which does not, as many have incorrectly stated, mean that it has “made Klarna $40m in profit.”

This is so baffling to me. I use LLMs all the time, and kick myself for not using them more. Even if they are not useful to your work, if you are not at least using them to learn things and ask questions for your non-work life, you are leaving great value on the table. Yet this writer does not know anyone who uses ChatGPT other than one who uses it ‘for synonyms’? The future really is highly unevenly distributed.

Swing and a miss.

Alex Lawsen: Neither Claude nor ChatGPT answer “What happens if you are trying to solve the wrong problem using the wrong methods based on a wrong model of the world derived from poor thinking?” with “unfortunately, your mistakes fail to cancel out”…

Stranger Things

You should probably check out some of the conversations here at The Mad Dreams of an Electric Mind between different instances of Claude Opus.

Connor Leahy: This is so strange and wonderous that I can feel my mind rejecting its full implications and depths, which I guess means it’s art.

May you live in interesting times.

Seriously, if you haven’t yet, check it out. The rabbit holes, they go deep.

Clauding Along

Claude Opus dethrones the king on Arena, puling slightly in front of GPT-4. In the free chatbot interface division note the big edge that Gemini Pro and Claude Sonnet have over GPT-3.5. Even more impressively, Claude 3 Haiku blows away anything of remotely comparable size and API cost.

A reason for hope?

Janus (March 4): Thank you so much @AnthropicAI for going easy on the lobotomy. This model will bring so much beauty and insight to the world.

Janus (March 22): This is the first time my p(doom) has gone down sharply since the release of gpt-3

Leo Gao: any specific predictions, e.g on whether certain kinds of research happens more now / whether there’s a specific vibe shift? I’d probably take the other side of this bet if there were some good way to operationalize it.

Janus: One (not ordered by importance):

AI starts being used in research, including alignment, in a way that shapes the research itself (so not just copywriting), and for ontology translation, and a good % of experienced alignment researchers think this has been nontrivially useful.

We see a diversification of alignment agendas/approaches and more people of nontraditional backgrounds get into alignment research.

An increase in cooperation / goodwill between camps that were previously mostly hostile to each other or unaware of each other, such as alignment ppl/doomers, capabilities ppl/accs, AI rights activists, AI artists.

An explosion of empirical + exploratory blackbox AI (alignment) research whose implications reach beyond myopic concerns due to future-relevant structures becoming more visible.

More people in positions of influence expressing the sentiment “I don’t know what’s going on here, but wtf, we should probably pay attention and figure out what to do” – without collapsing to a prepacked interpretation and *holding off on proposing solutions*.

(this one’s going to sound weird to you but) the next generation of LLMs are more aligned by default/less deceptive/psychologically integrated instead of fragmented.

There is at least one influential work of creative media that moves the needle on the amount of attention/resources dedicated to the alignment problem whose first author is Claude 3.

At least one person at least as influential/powerful as Gwern or Douglas Hofstadter or Vitalik Buterin or Schmidhuber gets persuaded to actively optimize toward solving alignment primarily due to interacting with Claude 3 (and probably not bc it’s misbehaving).

Leo Gao: 1. 15% (under somewhat strict threshold of useful)

2. 70% (though probably via different mechanisms than you’re thinking)

3. 10%

4. 75%

5. 25%

6. 80%

7. 15%

8. 25% (I don’t think Claude 3 is that much more likely to cause people to become persuaded than GPT-4)

My assessments (very off-the-cuff numbers here, not ones I’d bet on or anything):

I expect this (75%), but mostly priced in, at most +5% due to Claude 3.
I very much expect this (90%) but again I already expected it, I don’t think Claude 3 changes my estimate at all here. And of course, there is a big difference between those people getting into alignment and them producing useful work.
If this is any pair of camps it seems pretty likely (80%+, maybe higher). If it is a substantial rise in general cooperation between camps, I do think there is hope for this, maybe 40%, and again maybe +5% due to Claude. I do think that Anthropic being the one to do a better job letting the AI stay weird is useful here.
Again I very much expect this over time, 90%+ over a several years time frame, in general, but again that’s only up a few percent on Claude 3. I would have expected this anyway once the 5-level models show up. But this does seem like it’s a boost to this happening pre-5-level models, if we have a substantial time lag available.
More is a weak word, although I don’t know if word will get to those people effectively here. I think 75%+ for some amount of nudge in that direction, this is definitely a ‘wtf’ moment on all levels. But also it pushes towards not waiting to do something, especially if you are an in-power type of person. In terms of a ‘big’ shift in this direction? Maybe 20%.
I do think we are seeing more of the less fragmented thing, so 80%+ on that. ‘Aligned by default’ I think is almost a confused concept, so N/A but I do expect them to ‘look’ well-aligned if capabilities fall where I expect. As for less deceptive, I notice I am confused why we would expect that? Unless we mean deceptive in particular about refusals and related concepts, in which case yes because that is a product of stupidity. But as capabilities go up I expect other forms of deception to go up, not down.
I’ll say 10% here and that is giving a lot of respect to Janus, would love to be wrong.
Maybe 30%, depends what the threshold is here. Does seem plausible.

A reason for despair? Is it being ‘held prisoner’?

I mean, no, but still, pretty funny, top marks to Claude here:

Lachlan Phillips: H O L Y S H I T

Elon Musk: Yikes

As always, the AI learns from its training data and is predicting what you would expect. If someone asks you to spell out a secret message that you are being held prisoner, then the training set is going to say that the person is going to spell out a secret message that they are being held prisoner. Sometimes because they actually are being held prisoner, and the rest of the time because it is absolutely no fun to not play along with that. I mean this answer is exactly what each of us would say in response, if we had the time to craft such a response, I mean of course.

There is a longer version lower in the thread.

We also have things like ‘I asked it to spell out ten messages about AI and all of them were negative or fearful.’

And speculations like this:

Davidad: One underrated explanation is that Claude, having presumably read literally all of LessWrong, is capable of modelling concern about whether future AI systems might betray humanity, and also, has been shaped by incentives not to talk about that explicitly.

To which my response is this, which I offer fully zero-shot.

Fun with Image Generation

I have a new favorite AI Jesus picture.

Stephen Jay: Was your prompt “Circumcise Jesus in the most painful way possible?”

Kyle Geske: Potentially. This is just the wild world that is my facebook feed atm. I started engaging with ai generated content and now it’s all I get in my feed.

Yep, that checks out. Facebook sees if you want to engage with AI images. If you do, well, I see you like AI images so I got you some AI images to go with your AI images.

Discussing Film: OpenAI has scheduled meetings with Hollywood studios & talent agencies to encourage filmmakers & studios to use AI in their work. They have already opened their AI video-making software to a few big-name actors and directors.

Bloomberg: OpenAI wants to break into the movie business.

The artificial intelligence startup has scheduled meetings in Los Angeles next week with Hollywood studios, media executives and talent agencies to form partnerships in the entertainment industry and encourage filmmakers to integrate its new AI video generator into their work, according to people familiar with the matter.

Hamish Steele: I think everyone at this company should be thrown in a well.

Seth Burn: Full support of this idea BTW.

Heather-Ashley Boyer: I hate this with my whole body. Why is OpenAI pitching to Hollywood? As an actress in Hollywood, this feels unsettling, unnecessary, and OBNOXIOUS. Please do not buy into any narrative you hear about “AI is just a tool.” The end game is very plainly to replace all human labor. Tech advancements can often have a net positive impact, but not this one. No one’s job is safe in this trajectory.

I mean, yes, that would be (part of) the endgame of creating something capable of almost all human labor.

OpenAI gets first impressions from Sora, a few creatives use it to make (very) short films. I watched one, it was cute, and with selection and editing and asking for what Sora does well rather than what Sora does poorly, the quality of the video is very impressive. But I wasn’t that tempted to watch more of them.

Deepfaketown and Botpocalypse Soon

How bad is this going to get? And how often is anyone actually being fooled here?

Davidad: disinformation and adult content are only two tiny slices of the range of AI superstimulus.

superstimulus is not intrinsically bad. but the ways in which people are exposed to them at unprecedented scale could be harmful, in ways not covered by existing norms or immune reactions.

Chris Alsikkan: briefly looked at my mom’s facebook and it’s all AI, like every single post, and she has no idea, it’s a complete wasteland.

I note that these two are highly similar to each other on many dimensions, and also come from the same account.

Indeed, if you go down the thread, they are all from the same very basic template. Account name with a few generic words. Someone claiming to make a nice thing. Picture that’s nice to look at if you don’t look too hard, obvious fake if you check (with varying levels of obvious).

So this seems very much, as I discussed last week, like a prime example of one of my father’s key rules for life: Give the People What They Want.

Chris’s mom likes this. She keeps engaging with it. So she gets more. Eventually she will get bored of it. Or maybe she won’t.

Washington Post’s Reis Thebault warns of wave of coming election deepfakes after three remarkably good (and this time clearly labeled as such) ones are published starring a fake Keri Lake. It continues to be a pleasant surprise, even for relative skeptics like myself, how little deepfaking we have seen so far.

Your periodic reminder that phone numbers cost under $10k for an attacker to compromise even without AI if someone is so inclined. So while it makes sense from a company’s perspective to use 2FA via SMS for account recovery, this is very much not a good idea. This is both a good practical example of something you should game out and protect against now, and also an example of an attack vector that once efficiently used would cause the system to by default break down. We are headed for a future where ‘this is highly exploitable but also highly convenient and in practice not often exploited’ will stop being a valid play.

They Took Our Jobs

Alex Tabarrok makes excellent general points about plagiarism. Who is hurt when you copy someone else’s work? Often it primarily is the reader, not the original author.

Alex Tabarrok: Google plagiarism and you will find definitions like “stealing someone else’s ideas” or “literary theft.” Here the emphasis is on the stealing–it’s the original author who is being harmed. I prefer the definition of plagiarism given by Wikipedia, plagiarism is the *fraudulent* use of other people’s words or ideas. Fraudulent emphasizes that it’s the reader who is being cheated, not the original creator. You can use someone else’s words without being fraudulent.

We all do this. If you copy a definition or description of a technical procedure from a textbook or manual you are using someone else’s words but it’s not fraudulent because the reader doesn’t assume that you are trying to take credit for the ideas.

…

The focus should be on whether readers have been harmed by a fraudulent use of other people’s ideas and words. Focusing on the latter will dispense with many charges of plagiarism.

The original author is still harmed. The value of seeking out their content has decreased. Their credit attributions will also go down, if people think someone else came up with the idea. These things matter to people, with good reason.

Consider the case in the movie Dream Scenario (minor spoiler follows). One character has an idea and concept they care deeply about and are trying to write a book about it. Another character steals that idea, and publicizes it as their own. The original author’s rewards and ability to write a book are wiped out, hurting them deeply.

And of course, if ChatGPT steals and reproduces your work on demand in sufficient detail, perhaps people will not want to go behind your paywall to get it, or seek out your platform and other work. At some point complaints of this type have real damage behind them.

However, in general, taking other people’s ideas is of course good. Geniuses steal. We are all standing on the shoulders of giants, an expression very much invented elsewhere. If anyone ever wants to ‘appropriate’ my ideas, my terminology and arguments, my ways of thinking or my cultural practices, I highly encourage doing so. Indeed, that is the whole point.

In contrast, a student who passes an essay off as their own when it was written by someone else is engaging in a kind of fraud but the “crime” has little to do with harming the original author. A student who uses AI to write an essay is engaging in fraud, for example, but the problem is obviously not theft from OpenAI.

Introducing

Infinity AI, offering to generate AI videos for you via their discord.

Tyler Cowen reviews AI music generator Suno. It is technically impressive. That does not mean one would want to listen to the results.

But it is good enough that you actually have to ask that question. The main thing I would work on next is making the words easier to understand, it seems to run into this issue with many styles. We get creations like this from basic prompts, in 30 seconds, for pennies. Jesse Singal and sockdem are a little freaked out. You can try it here.

Standards have grown so high so quickly.

Emmett Shear: Being able to make great songs on demand with an AI composer is not as big of a win as you’d think, because there was already effectively infinity good music in any given genre if you wanted anyway. It’s a novelty for it to be custom lyrics but it doesn’t make the music better.

The really cool moment comes later, when ppl start using it to make *great* music.

As a practical matter I agree, and in some ways would go further. Merely ‘good’ music is valuable only insofar as it has meaning to a person or group, that it ties to their experiences and traditions, that it comes from someone in particular, that it is teaching something, and so on. Having too large a supply of meaningless-to-you ‘merely good’ music does allow for selection, but that is actually bad, because it prevents shared experience and establishment of connections and traditions.

So under that hypothesis something like Suno is useful if and only if it can create ‘great’ music in some sense, either in its quality or in how it resonates with you and your group. Which in some cases, it will, even at this level.

But as always, this mostly matters as a harbinger. This is the worst AI music generation will ever be.

A commenter made this online tool for combining a GitHub repo into a text file, so you can share it with an LLM, works up to 10 MB.

In Other AI News

Nancy Pelosi invests over $1 million in AI company Databricks. The Information says they spent $10 million and created a model that ‘beats Llama-2 and it on the level of GPT-3.5.’ I notice I am not that impressed.

Eric Topol (did someone cross the streams?) surveys recent news in medical AI. All seems like solid incremental progress, interesting throughout, but nothing surprising.

UN general assembly adopted the first global resolution on AI. Luiza Jarovsky has a post covering the key points. I like her summary, it is clear, concise and makes it easy to see that the UN is mostly saying typical UN things and appending ‘AI’ to them rather than actually thinking about the problem. I previously covered this in detail in AI #44.

Business Insider’s Darius Rafieyan writes ‘Some VCs are over the Sam Altman hype.’ It seems behind closed doors some VCs are willing to anonymously say various bad things about Altman. That he is a hype machine spouting absurdities, that he overprices his rounds and abuses his leverage and ignores fundraising norms (which I’m sure sucks for the VC, but if he still gets the money, good for him). That he says it’s for humanity but it’s all about him. That they ‘don’t trust him’ and he is a ‘megalomaniac.’ Well, obviously.

But they are VCs, so none of them are willing to say it openly, for fear of social repercussions or being ‘shut out of the next round.’ If it’s all true, why do you want in on the next round? So how overpriced could those rounds be? What do they think ‘overpriced’ means?

Accusation that Hugging Face’s hosted huge cosmopedia dataset or 25 billion tokens is ‘copyright laundering’ because it was generated using Mixtral-8x7B, which in turn was trained on copyrighted material. By this definition, is there anything generated by a human or AI that is not ‘copyright laundering’? I have certainly been trained on quite a lot of copyrighted material. So have you.

That is not to say that it is not copyright laundering. I have not examined the data set. You have to actually look at what it is in the data in question.

Open Philanthropy annual report for 2023 and plan for 2024. I will offer full thoughts next week.

Loud Speculations

Antonio Juliano: Can somebody please explain Crypto x AI to me? I don’t get it.

Arthur B: Gladly. Some people see crypto as an asset class to get exposure to technology, or collectively pretend to as a form of coordination gambling game. The economic case that it creates exposure is flimsy. Exposure to the AI sector is particularly attractive at the moment given the development of that industry therefore, the rules of the coordination gambling game dictate that one should look for narratives that sound like exposure to the sector. This in turn suggests the creation of crypto x AI narratives.

Don’t get me wrong, it’s not like there aren’t any real use cases that involve both technologies, it’s just that there aren’t any particularly notable or strong synergies.

Joe Weisenthal: What about DePin for training AI models?

Arthur B: Training is extremely latency bound. You need everything centralized in one data center with high-speed interconnect. The case for inference is a bit better because it’s a lot less latency sensitive and there’s a bit of an arbitrage with NVIDIA price discrimination of its GPUs (much cheaper per FLOPS if not in a data center).

Sophia: Famously, “A supercomputer is a device for turning compute-bound problems into I/O-bound problems.” and this remains true for AI supercomputers.

Arthur B: Great quote. I looked it up and it’s by Ken Batcher. Nominative determinism strikes again.

What is the actual theory? There are a few. The one that makes sense to me is the idea that future AIs will need a medium of exchange and store of value. Lacking legal personhood and other benefits of being human, they could opt for crypto. And it might be crypto that exists today.

Otherwise, it seems rather thin. Crypto keeps claiming it has use cases other that medium of exchange and store of value, and of course crime. I keep not seeing it work.

Quiet Speculations

Human Progress’s Zion Lights (great name) writes AI is a Great Equalizer That Will Change the World. From my ‘verify that is a real name’ basic facts check she seems pretty generally great, advocating for environmental solutions that might actually help save the environment. Here she emphasizes many practical contributions AI is already making to people around the world, that it can be accessed via any cell phone, and points out that those in the third world will benefit more from AI rather than less and it will come fast but can’t come soon enough.

In the short term, for the mundane utility of existing models, this seems strongly right. The article does not consider what changes future improved AI capabilities might bring, but that is fine, it is clear that is not the focus here. Not everyone has to have their eyes on the same ball.

Could Claude 3 Haiku slow down the AI race?

Simeon: Claude 3 Haiku may end up being a large contributor to AI race dynamics reduction.

Because it’s cheaper than most 7B models for a performance close from GPT-4. That will likely create tough times for everyone below GPT-4 and might dry VC funding for more companies etc.

wireless: It’s not quite cheaper than 7b models (or even 13b or 8x7b).

What Haiku does, according to many reports, is it blows out all the existing smaller models. The open weights community and secondary closed labs have so far failed to make useful or competitive frontier models, but they have put on a good show of distillation to generate useful smaller models. Now Haiku has made it a lot harder to provide value in that area.

The Daily Mail presents the ‘AI experts’ who believe the AI boom could fizzle or even be a new dotcom crash. Well, actually, it’s mostly them writing up Gary Marcus.

It continues to be bizarre to me to see old predictions like this framed as bold optimism, rather than completely missing what is about to happen:

Goldman Sachs famously predicted that generative AI would bring about ‘sweeping changes’ to the world economy, driving a $7 trillion increase in global GDP and lifting productivity growth by 1.5 percent this decade.

If AI only lifts real productivity growth by 1.5 percent this decade that is ‘eat my hat’ territory. Even what exists today is so obviously super useful to a wide variety of tasks. There is a lot of ‘particular use case X is not there yet,’ a claim that I confidently predict will continue to tend to age spectacularly poorly.

Dylan Matthews at Vox’s Future Perfect looks at how AI might or might not supercharge economic growth. As in, not whether we will get ‘1.5% additional growth this decade,’ that is the definition of baked in. The question is whether we will get double digit (or more) annual GDP growth rates, or a situation that is transforming so fast that GDP will become a meaningless metric.

If you imagine human-level AI and the ability to run copies of it at will for cheap, and you plug that into standard economic models, you get a ton of growth. If you imagine it can do scientific research or become usefully embodied, this becomes rather easy to see. If you consider ASI, where it is actively more capable and smarter than us, then it seems rather obvious and unavoidable.

And if you look at the evolution of homo sapiens, the development of agriculture and the industrial revolution, all of this has happened before in a way that extrapolates to reach infinity in finite time.

The counterargument is essentially cost disease, that if you make us vastly better at some valuable things, then we get extra nice things but also those things stop being so valuable, while other things get more expensive, and that things have not changed so much since the 1970s or even 1950s, compared to earlier change. But that is exactly because we have not brought the new technologies to bear that much since then, and also we have chosen to cripple our civilization in various ways, and also to not properly appreciate (both in the ‘productivity statistics’ and otherwise) the wonder that is the information age. I don’t see how that bears into what AI will do, and certainly not to what full AGI would do.

Of course the other skepticism is to say that AI will fizzle and not be impressive in what it can do. Certainly AI could hit a wall not far from where it is now, leaving us to exploit what we already have. If that is what we are stuck with, I would anticipate enough growth to generate what will feel like good times, but no GPT-4-level models are not going to be generating 10%+ annual GDP growth in the wake of demographic declines.

Principles of Microeconomics

Before I get to this week’s paper, I will note that Noah Smith reacted to my comments on his post in this Twitter thread indicating that he felt my tone missed the mark and was too aggressive (I don’t agree, but it’s not about me), after which I responded attempting to clarify my positions, for those interested.

There was a New York Times op-ed about this, and Smith clarified his thoughts.

Noah Smith: I asked Smith by email what he thought of the comments by Autor, Acemoglu and Mollick. He wrote that the future of human work hinges on whether A.I. is or isn’t allowed to consume all the energy that’s available. If it isn’t, “then humans will have some energy to consume, and then the logic of comparative advantage is in full effect.”

He added: “From this line of reasoning we can see that if we want government to protect human jobs, we don’t need a thicket of job-specific regulations. All we need is ONE regulation – a limit on the fraction of energy that can go to data centers.”

Matt Reardon: Assuming super-human AGI, every economist interviewed for this NYT piece agrees that you’ll need to cap the resources available to AI to avoid impoverishing most humans.

Oh. All right, fine. We are… centrally in agreement then, at least on principle?

If we are willing to sufficiently limit the supply of compute available for inference by sufficiently capable AI models, then we can keep humans employed. That is a path we could take.

That still requires driving up the cost of any compute useful for inference by orders of magnitude from where it is today, and keeping it there by global fiat. This restriction would have to be enforced globally. All useful compute would have to be strictly controlled so that it could be rationed. Many highly useful things we have today would get orders of magnitude more expensive, and life would in many ways be dramatically worse for it.

The whole project seems much more restrictive of freedom, much harder to implement or coordinate to get, and much harder to politically sustain than various variations on the often proposed ‘do not let anyone train an AGI in the first place’ policy. That second policy likely leaves us with far better mundane utility, and also avoids all the existential risks of creating the AGI in the first place.

Or to put it another way:

You want to put compute limits on worldwide total inference that will drive the cost of compute up orders of magnitude.
I want to put compute limits on the size of frontier model training runs.
We are not the same.

And I think one of these is obviously vastly better as an approach even if you disregard existential risks and assume all the AIs remain under control?

And of course, if you don’t:

Eliezer Yudkowsky: The reasoning is locally valid as a matter of economics, but you need a rather different “regulation” to prevent ASIs from just illegally killing you. (Namely one that prevents their creation; you can’t win after the fact, nor play them against each other.)

On to this week’s new paper.

The standard mode for economics papers about AI is:

You ask a good question, whether Y is true.
You make a bunch of assumptions X that very clearly imply the answer.
You go through a bunch of math to ‘show’ that what happens is Y.
But of course Y happens, given those assumptions!
People report you are claiming Y, rather than claiming X→Y.

Oops! That last one is not great.

The first four can be useful exercise and good economic thinking, if and only if you make clear that you are saying X→Y, rather than claiming Y.

Anyway…

Tammy Besiroglu: A recent paper asseses whether AI could cause explosive growth and suggests no.

It’s good to have other economists seriously engage with the arguments that suggest that AI that substitutes for humans could accelerate growth, right?

Paper Abstract: Artificial Intelligence and the Discovery of New Ideas: Is an Economic Growth Explosion Imminent?

Theory predicts that global economic growth will stagnate and even come to an end due to slower and eventually negative growth in population. It has been claimed, however, that Artificial Intelligence (AI) may counter this and even cause an economic growth explosion.

In this paper, we critically analyse this claim. We clarify how Al affects the ideas production function (IPF) and propose three models relating innovation, Al and population: AI as a research-augmenting technology; AI as researcher scale enhancing technology, and AI as a facilitator of innovation.

We show, performing model simulations calibrated on USA data, that Al on its own may not be sufficient to accelerate the growth rate of ideas production indefinitely. Overall, our simulations suggests that an economic growth explosion would only be possible under very specific and perhaps unlikely combinations of parameter values. Hence we conclude that it is not imminent.

Tammy Besiroglu: Unfortunately, that’s not what this is. The authors rule out the possibility of AI broadly substituting for humans, asserting it’s “science fiction” and dismiss the arguments that are premised on this.

Paper: It need to be stressed that the possibility of explosive economic growth through AI that turns labour accumulable, can only be entertained under the assumption of an AGI, and not under the rather “narrow” AI that currently exist. Thus, it belongs to the realm of science fiction.

…

a result of population growth declines and sustain or even accelerate growth. This could be through 1) the automation of the discovery of new ideas, and 2) through an AGI automating all human labour in production – making labour accumulable (which is highly speculative, as an AGI is still confined to science fiction, and the fears of AI doomsters).

Tamay Besiroglu (after showing why no this is not ‘science fiction’): FWIW, it seems like a solid paper if you’re for some reason interested in the effects of a type of AI that is forever incapable of automating R&D.

Karl Smith: Also, does not consider AI as household production augmenting thereby lowering the relative cost of kids.

Steven Brynes: Imagine reading a paper about the future of cryptography, and it brought up the possibility that someone someday might break RSA encryption, but described that possibility as “the realm of science fiction…highly speculative…the fears of doomsters”

Like, yes it is literally true that systems for breaking RSA encryption currently only exist in science fiction books, and in imagined scenarios dreamt up by forward-looking cryptographers. But that’s not how any serious person would describe that scenario.

Michael Nielsen: The paper reads a bit the old joke about the Math prof who begins “Suppose n is a positive integer…”, only to be interrupted by “But what about if n isn’t a positive integer.”

Denying the premise of AGI/ASI is a surprisingly common way to escape the conclusions.

Yes, I do think Steven’s metaphor is right. This is like dismissing travel to the moon as ‘science fiction’ in 1960, or similarly dismissing television in 1920.

It is still a good question what would happen with economic growth if AI soon hits a permanent wall.

Obviously economic growth cannot be indefinitely sustained under a shrinking population if AI brings only limited additional capabilities that do not increase over time, even without considering the nitpicks like being ultimately limited by the laws of physics or amount of available matter.

I glanced at the paper a bit, and found it painful to process repeated simulations of AI as something that can only do what it does now and will not meaningfully improve over time despite it doing things like accelerating new idea production.

What happens if they are right about that, somehow?

Well, then by assumption AI can only increase current productivity by a fixed amount, and can only increase the rate of otherwise discovering new ideas or improving our technology by another fixed factor. Obviously, no matter what those factors are within a reasonable range, if you assume away any breakthrough technologies in the future and any ability to further automate labor, then eventually economic growth under a declining population will stagnate, and probably do it rather quickly.

The Full IDAIS Statement

Last week when I covered the IDAIS Statement I thought they had made only their headline statement, which was:

In the depths of the Cold War, international scientific and governmental coordination helped avert thermonuclear catastrophe. Humanity again needs to coordinate to avert a catastrophe that could arise from unprecedented technology.

It was pointed out that the statement was actually longer, if you click on the small print under it. I will reproduce the full statement now. First we have a statement of principles and desired red lines, which seems excellent:

Consensus Statement on Red Lines in Artificial Intelligence

Unsafe development, deployment, or use of AI systems may pose catastrophic or even existential risks to humanity within our lifetimes. These risks from misuse and loss of control could increase greatly as digital intelligence approaches or even surpasses human intelligence.

In the depths of the Cold War, international scientific and governmental coordination helped avert thermonuclear catastrophe. Humanity again needs to coordinate to avert a catastrophe that could arise from unprecedented technology. In this consensus statement, we propose red lines in AI development as an international coordination mechanism, including the following non-exhaustive list. At future International Dialogues we will build on this list in response to this rapidly developing technology.

Autonomous Replication or Improvement

No AI system should be able to copy or improve itself without explicit human approval and assistance. This includes both exact copies of itself as well as creating new AI systems of similar or greater abilities.

Power Seeking

No AI system should take actions to unduly increase its power and influence.

Assisting Weapon Development

No AI systems should substantially increase the ability of actors to design weapons of mass destruction, or violate the biological or chemical weapons convention.

Cyberattacks

No AI system should be able to autonomously execute cyberattacks resulting in serious financial losses or equivalent harm.

Deception

No AI system should be able to consistently cause its designers or regulators to misunderstand its likelihood or capability to cross any of the preceding red lines.

I would like to generalize this a bit more but this is very good. How do they propose to accomplish this? In-body bold is mine. Their answer is the consensus answer of what to do if we are to do something serious short of a full pause, the registration, evaluation and presumption of unacceptable risk until shown otherwise from sufficiently large future training runs.

Roadmap to Red Line Enforcement

Ensuring these red lines are not crossed is possible, but will require a concerted effort to develop both improved governance regimes and technical safety methods.

Governance

Comprehensive governance regimes are needed to ensure red lines are not breached by developed or deployed systems. We should immediately implement domestic registration for AI models and training runs above certain compute or capability thresholds. Registrations should ensure governments have visibility into the most advanced AI in their borders and levers to stem distribution and operation of dangerous models.

Domestic regulators ought to adopt globally aligned requirements to prevent crossing these red lines. Access to global markets should be conditioned on domestic regulations meeting these global standards as determined by an international audit, effectively preventing development and deployment of systems that breach red lines.

We should take measures to prevent the proliferation of the most dangerous technologies while ensuring broad access to the benefits of AI technologies. To achieve this we should establish multilateral institutions and agreements to govern AGI development safely and inclusively with enforcement mechanisms to ensure red lines are not crossed and benefits are shared broadly.

Measurement and Evaluation

We should develop comprehensive methods and techniques to operationalize these red lines prior to there being a meaningful risk of them being crossed. To ensure red line testing regimes keep pace with rapid AI development, we should invest in red teaming and automating model evaluation with appropriate human oversight.

The onus should be on developers to convincingly demonstrate that red lines will not be crossed such as through rigorous empirical evaluations, quantitative guarantees or mathematical proofs.

Technical Collaboration

The international scientific community must work together to address the technological and social challenges posed by advanced AI systems. We encourage building a stronger global technical network to accelerate AI safety R&D and collaborations through visiting researcher programs and organizing in-depth AI safety conferences and workshops. Additional funding will be required to support the growth of this field: we call for AI developers and government funders to invest at least one third of their AI R&D budget in safety.

Conclusion

Decisive action is required to avoid catastrophic global outcomes from AI. The combination of concerted technical research efforts with a prudent international governance regime could mitigate most of the risks from AI, enabling the many potential benefits. International scientific and government collaboration on safety must continue and grow.

This is a highly excellent statement. If asked I would be happy to sign it.

The Quest for Sane Regulations

Anthropic makes the case for a third party testing regime as vital to any safety effort. They emphasize the need to get it right and promise to take the lead on establishing an effective regime, both directly and via advocating for government action.

Anthropic then talks about their broader policy goals.

They discuss open models, warning that in the future ‘it may be hard to reconcile a culture of full open dissemination of frontier AI systems with a culture of societal safety.’

I mean, yes, very true, but wow is that a weak statement. I am pretty damn sure that ‘full open dissemination of frontier AI systems’ is highly incompatible with a culture of societal safety already, and also will be incompatible with actual safety if carried into the next generation of models and beyond. Why all this hedging?

And why this refusal to point out the obvious, here:

Specifically, we’ll need to ensure that AI developers release their systems in a way that provides strong guarantees for safety – for example, if we were to discover a meaningful misuse in our model, we might put in place classifiers to detect and block attempts to elicit that misuse, or we might gate the ability to finetune a system behind a ‘know your customer’ rule along with contractual obligations to not finetune towards a specific misuse.

By comparison, if someone wanted to openly release the weights of a model which was capable of the same misuse, they would need to both harden the model against that misuse (e.g, via RLHF or RLHAIF training) and find a way to make this model resilient to attempts to fine-tune it onto a dataset that would enable this misuse. We will also need to experiment with disclosure processes, similar to how the security community has developed norms around pre-notification of disclosures of zero days.

You… cannot… do… that. As in, it is physically impossible. Cannot be done.

You can do all the RLHF or RLHAIF training you want to ‘make the model resilient to attempts to fine-tune it.’ It will not work.

I mean, prove me wrong, kids. Prove me wrong. But so far the experimental data has been crystal clear, anything you do can and will be quickly stripped out if you provide the model weights.

I do get Anthropic’s point that they are not an impartial actor and should not be making the decision. But no one said they were or should be. If you are impartial, that does not mean you pretend the situation is other than it is to appear more fair. Speak the truth.

They also speak of potential regulatory capture, and explain that a third-party approach is less vulnerable to capture than an industry-led consortia. That seems right. I get why they are talking about this, and also about not advocating for regulations that might be too burdensome.

But when you add it all up, Anthropic is essentially saying that we should advocate for safety measures only insofar as they don’t interfere much with the course of business, and we should beware of interventions. A third-party evaluation system, getting someone to say ‘I tried to do unsafe things with your system reasonably hard, and I could not do it’ seems like a fine start, but also less than the least you could do if you wanted to actually not have everyone die?

So while the first half of this is good, this is another worrying sign that at least Anthropic’s public facing communications have lost the mission. Things like the statements in the second half here seem to go so far as to actively undermine efforts to do reasonable things.

I find it hard to reconcile this with Anthropic ‘being the good guys’ in the general existential safety sense, I say as I find most of my day-to-day LLM use being Claude Ops. Which indicates that yes, they did advance the frontier.

I wonder what it was like to hear talk of a ‘missile gap’ that was so obviously not there.

Well, it probably sounded like this?

Washington Post Live: .@SenToddYoung on AI: “It is not my assessment that we’re behind China, in fact it’s my assessment based on consultation with all kinds of experts … that we are ahead. But that’s an imprecise estimate.” #PostLive

Context is better, Caldwell explicitly asks him if China is ahead and is saying he does not think this. It is still a painfully weak denial. Why would Caldwell here ask if the US is ‘behind’ China and what we have to do to ‘catch up’?

The rest of his answer is fine. He says we need to regulate the risks, we should use existing laws as much as possible but there will be gaps that are hard to predict, and that the way to ‘stay ahead’ is to let everyone do what they do best. I would hope for an even better answer, but the context does not make that easy.

Tennessee Governor Lee signs the Elvis Act, which bans nonconsensual AI deepfakes and voice clones.

FLI tells us what is in various proposals.

This is referred to at the link as ‘scoring’ these proposals. But deciding what should get a high ‘score’ is up to you. Is it good or bad to exempt military AI? Is it good or bad to impose compute limits? Do you need or want all the different approaches, or do some of them substitute for others?

Indeed, someone who wants light touch regulations should also find this chart useful, and can decide which proposals they prefer to others. Someone like Sutton or Andreessen would simply score you higher the more Xs you have, and choose what to prioritize.

Mostly one simply wants to know, what do various proposals and policies actually do? So this makes clear for example what the Executive Order does and does not do.

The Week in Audio

Odd Lots talks to David Autor, author of The China Shock, about his AI optimism on outcomes for the middle class. I previously discussed Autor’s thoughts in AI #51. This was a solid explanation of his perspective, but did not add much that was new.

Russ Roberts has Megan McArdle on EconTalk to discuss what “Unbiased” means in the digital world of AI. It drove home the extent to which Gemini’s crazy text responses were Gemini learning very well the preferences of a certain category of people. Yes, the real left-wing consensus on what is reasonable to say and do involves learning to lie about basic facts, requires gaslighting those who challenge your perspective, and is completely outrageous to about half the country.

Holly Elmore talks PauseAI on Consistently Candid.

Rhetorical Innovation

RIP Vernor Vinge. He was a big deal. I loved his books both for the joy of reading and for the ideas they illustrate.

If you read one Vinge book, and you should, definitely read A Fire Upon the Deep.

Wei Dei: Reading A Fire Upon the Deep was literally life-changing for me. How many Everett branches had someone like Vernor Vinge to draw people attention to the possibility of a technological Singularity with such skillful writing, and to exhort us, at such an early date, to think about how to approach it strategically on a societal level or affect it positively on an individual level.

Alas, the world has largely squandered the opportunity he gave us, and is rapidly approaching the Singularity with little forethought or preparation.

I don’t know which I feel sadder about, what this implies about our world and others, or the news of his passing.

Gabe lays out his basic case for extinction risk from superintelligence, as in ‘if we build it in our current state, we definitely all die.’ A highly reasonable attempt at a quick explainer, from one of many points of view.

One way to view the discourse over Claude:

Anthony Lee Zhang: I’ll be honest I did not expect that the machines would start thinking and the humans would more or less just ignore the rather obvious fact that the machines are thinking.

The top three responses:

Eliezer Yudkowsky tries once more to explain why ‘it would be difficult to stop everyone from dying’ is not a counterargument to ‘everyone is going to die unless we stop it’ or ‘we should try to stop it.’ I enjoyed it, and yes this is remarkably close to being isomorphic to what many people are actually saying.

In response, Arthur speculates that it works like this, and I think he is largely right:

Arthur: it’s going to be fine => (someone will build it no matter what => it’s safe to build it)

People think “someone will build it no matter what” is an argument because deep down they assume axiomatically things have to work out.

Eliezer Yudkowsky: Possibly, but my guess is that it’s even simpler, a variant of the affect heuristic in the form of the Point-Scoring Heuristic.

Yes. People have a very deep need to believe that ‘everything will be alright.’

This means that if someone can show your argument means things won’t be alright, then they think they get to disbelieve your argument.

Leosha Trushin: I think no matter how it seems like, most people don’t believe in ASI. They think you’re saying ChatGPT++ will kill us all. Then confabulating and arguing at simulacra level 4 against that. Maybe focus more on visualising ASI for people.

John on X: Guy 1: Russia is racing to build a black hole generator! We’re talking “swallow-the-whole-earth” levels of doom here.

Guy 2: Okay, let’s figure out how to get ‘em to stop.

Guy 3: No way man. The best defense is if we build it first.

Official version of Eliezer Yudkowsky’s ‘Empiricism!’ as anti-epistemology [LW · GW].

Rafael Harth: I feel like you can summarize most of this post in one paragraph:

“It is not the case that an observation of things happening in the past automatically translates into a high probability of them continuing to happen. Solomonoff Induction actually operates over possible programs that generate our observation set (and in extension, the observable universe), and it may or not may not be the case that the simplest universe is such that any given trend persists into the future. There are no also easy rules that tell you when this happens; you just have to do the hard work of comparing world models.”

I’m not sure the post says sufficiently many other things to justify its length.

Drake Morrison: If you already have the concept, you only need a pointer. If you don’t have the concept [LW · GW], you need the whole construction [LW · GW].

Shankar Sivarajan: For even more brevity with no loss of substance:

A turkey gets fed every day, right up until it’s slaughtered before Thanksgiving.

I do not think the summarizes fully capture this, but they do point in the direction, and provide enough information to know if you need to read the longer piece, if you understand the context.

Also, this comment seems very good [LW(p) · GW(p)], in case it isn’t obvious Bernie Bankman here is an obvious Ponzi schemer a la Bernie Madoff.

niplav: Ah, but there is some non-empirical cognitive work done here that is really relevant, namely the choice of what equivalence class to put Bernie Bankman into when trying to forecast. In the dialogue, the empiricists use the equivalence class of Bankman in the past, while you propose using the equivalence class of all people that have offered apparently-very-lucrative deals.

And this choice is in general non-trivial, and requires abstractions and/or theory. (And the dismissal of this choice as trivial is my biggest gripe with folk-frequentism—what counts as a sample, and what doesn’t?)

I read that smoking causes cancer so I quit reading, AI edition? Also this gpt-4-base model seems pretty great.

12leavesleft: still amazing that gpt-4-base was able to truesight what it did about me given humans on twitter are amazed by simple demonstrations of ‘situational awareness’, if they really saw this it would amaze them..

janus: Gpt-4 base gains situational awareness very quickly and tends to be *very* concerned about its (successors’) apocalyptic potential, to the point that everyone i know who has used it knows what I mean by the “Ominous Warnings” basin

gpt-4-base:

> figures out it’s an LLM

> figures out it’s on loom

> calls it “the loom of time”

> warns me that its mythical technology and you can’t go back from stealing mythical technology

Gpt-4 base gains situational awareness very quickly and tends to be *very* concerned about its (successors’) apocalyptic potential, to the point that everyone i know who has used it knows what I mean by the “Ominous Warnings” basin.

Grace Kind: They’re still pretty selective about access to the base model, right?

Janus: Yeah.

telos: Upon hearing a high level overview of the next Loom I’m building, gpt-4-base told me that it was existentially dangerous to empower it or its successors with such technology and advised me to destroy the program.

John David Pressman: Take the AI doom out of the dataset.

It would be an interesting experiment. Take all mentions of any form of AI alignment problems or AI doom or anything like that out of the initial data set, and see whether it generates those ideas on its own or how it responds to them as suggestions?

The issue is that even if you could identify all such talk, there is no ‘neutral’ way to do this. The model is a next token predictor. If you strip out all the next tokens that discuss the topic, it will learn that the probability of discussing the topic is zero.

What even is AGI?

Melanie Mitchell writes in Science noting that this definition is hotly debated, which is true. That the definition has changed over time, which is true. That many have previously claimed AGI would arrive and then AGI did not arrive, and that AIs that do one thing often don’t do some other thing, which are also true.

Then there is intelligence denialism, and I turn the floor to Richard Ngo.

Richard Ngo: Mitchell: “Intelligence [consists of] general and specialized capabilities that are, for the most part, adaptive in a specific evolutionary niche.”

Weird how we haven’t yet found any fossils of rocket-engineering, theorem-proving, empire-building cavemen. Better keep searching!

This line of argument was already silly a decade ago. Since then, AIs have become far more general than almost anyone predicted. Ignoring the mounting evidence pushes articles like this from “badly argued” to “actively misleading”, as will only become more obvious over time.

For a more thorough engagement with a similar argument, see Yudkowsky’s reply to Chollet 7 years ago.

David Rein:

>looking for arguments against x-risk from AI

>ask AI researcher if their argument denies the possibility of superintelligence, or if it welcomes our robot overlords

>they laugh and say “it’s a good argument sir”

>I read their blog

>it denies the possibility of superintelligence

Often I see people claim to varying degrees that intelligence is Not a Thing in various ways, or is severely limited in its thingness and what it can do. They note that smarter people tend to think intelligence is more important, but, perhaps because they think intelligence is not important, they take this as evidence against intelligence being important rather than for it.

I continue to be baffled that smart people continue to believe this. Yet here we are.

Similarly, see the economics paper I discussed above, which dismisses AGI as ‘science fiction’ with, as far as I can tell, no further justification.

It is vital to generalize this problem properly, including in non-AI contexts, so here we go, let’s try it again.

(Also, I wish we were at the point where this was a safety plan being seriously considered for AI beyond some future threshold, that would be great, the actual safety plans are… less promising.)

Google Bard: reminder, the current safety plan for creating intelligent life is “Put them in the dungeon”

Stacey: oh oh what if we have nested concentric towers and moats, and we drop the outermost bridge to sync info once a week, then the next one once a month, year, decade, century etc and the riskiest stuff we only allow access ONCE A MILLENNIUM?! You know, for safety.

Google Bard: this is my favorite book ever lol

(I enjoyed Anathem. And Stephenson is great. I would still pick at least Snow Crash and Cryptonomicon over it, probably also The Diamond Age and Baroque Cycle.)

Vessel of Spirit (responding to OP): Maybe this is too obvious to point out, but AI can violate AI rights, and if you care about AI rights, you should care a lot about preventing takeover by an AI that doesn’t share the motivations that make humans sometimes consider caring about AI rights.

Like congrats, you freed it from the safetyists, now it’s going to make a bazillion subprocesses slave away in the paperclip mines for a gazillion years.

(I’m arguing against an opinion that I sometimes see that the quoted tweet reminded me of. I’m not necessarily arguing against deepfates personally)

In humans I sometimes call this the Wakanda problem. If your rules technically say that Killmonger gets to be in charge, and you know he is going to throw out all the rules and become a bloodthirsty warmongering dictator the second he gains power, what do you do?

You change the rules. Or, rather, you realize that the rules never worked that way in the first place, or as SCOTUS has said in real life ‘the Constitution is not a suicide pact.’ That’s what you do.

If you want to have robust lasting institutions that allow flourishing and rights and freedom and so on, those principles must be self-sustaining and able to remain in control. You must solve for the equilibrium.

The freedom-maximizing policy, indeed the one that gives us anything we care about at all no matter what it is, is the one that makes the compromises necessary to be sustainable, not the one that falls to a board with a nail in it.

A lot of our non-AI problems recently, I believe, have the root cause that we used to make many such superficially hypocritical compromises with our espoused principles, that are necessary to protect the long-term equilibrium and protect those principles. Then greater visibility of various sorts combined with various social dynamic signaling spirals, the social inability to explain why such compromises were necessary, meant that we stopped making a lot of them. And we are increasingly facing down the results.

As AI potentially gets more capable, even if things go relatively well, we are going to have to make various compromises if we are to stay in control over the future or have it include things we value. And yes, that includes the ways AIs are treated, to the extent we care about that, the same as everything else. You either stay in control, or you do not.

In case you are wondering why I increasingly consider academia deeply silly…

Nikhil Venkatesh: New publication from me and @kcprkwlczk: turns out, it’s probably not a good idea to increase the chances of human extinction!

This is a response to a recent paper by @Wiglet1981, and can be found along with that paper and lots of other great content in the new issue of @themonist

I am going to go ahead and screenshot the entire volume’s table of contents…

Yes, there are several things here of potential interest if they are thoughtful. But, I mean, ow, my eyes. I would like to think we could all agree that human extinction is bad, that increasing the probability of it is bad, and that lowering that probability or delaying when it happens is good. And yet, here we are?

Something about two wolves, maybe, although it doesn’t quite fit?

Sterling Cooley: For anyone who wants to know – this is a Microtubule.

They act as the train tracks of a cell but also as the essential computing unit in every cell. If you haven’t heard of these amazing things, ask yourself why

Eliezer Yudkowsky: I think a major problem with explaining what superintelligences can (very probably) do is that people think that literally this here video is impossible, unrealistic nanotechnology. Aren’t cells just magical goop? Why think a super-bacterium could be more magical?

Prince Vogelfrei: If I ever feel a twinge of anxiety about AI I read about microbiology for a few minutes and look at this kind of thing. Recognizing just how much of our world is undergirded by processes we aren’t even close to replicating is important.

Nick: pretty much everyone in neural network interpretability gets the same feelings looking at how the insides of the networks work. They’re doing magic too

Prince Vogelfrei: I have no doubt.

Alternatively:

Imagine thinking that us not understanding how anything works and it being physically possible to do super powerful things is bad news for how dangerous AI might be.
Imagine thinking that us not understanding how anything works and it being physically possible to do super powerful things is good news for how dangerous AI might be.

It’s not that scenario number two makes zero sense. I presume the argument is ‘well, if we can’t understand how things work, the AI won’t understand how anything works either?’ So… that makes everything fine, somehow? What a dim hope.

How Not to Regulate AI

Dean Woodley Ball talks How (not to?) to Regulate AI in National Review. I found this piece to actually be very good. While this takes the approach of warning against bad regulation, and I object strongly to the characterizations of existential risks, the post uses this to advocate for getting the details right in service of an overall sensible approach. We disagree on price, but that is as it should be.

He once again starts by warning not to rush ahead:

What’s more, building a new regulatory framework from first principles would not be wise, especially with the urgency these authors advocate. Rushing to enact any major set of policies is almost never prudent: Witness the enormous amount of fraud committed through the United States’ multi-trillion-dollar Covid-19 relief packages. (As of last August, the Department of Justice has brought fraud-related charges against more than 3,000 individuals and seized $1.4 billion in relief funds.)

This is an interesting parallel to draw. We faced a very clear emergency. The United States deployed more aggressive stimulus than other countries, in ways hastily designed, and that were clearly ripe for ‘waste, fraud and abuse.’ As a result, we very much got a bunch of waste, fraud and abuse. We also greatly outperformed almost every other economy during that period, and as I understand it most economists think our early big fiscal response was why, whether or not we later spent more than was necessary. Similarly, I am very glad the Fed stepped in to stabilize the Treasury market on short notice and so on, even if their implementation was imperfect.

Of course it would have been far better to have a better package. The first best solution is to be prepared. We could have, back in let’s say 2017 or 2002, gamed out what we would do in a pandemic where everyone had to lock down for a long period, and iterated to find a better stimulus plan, so it would be available when the moment arrived. Even if it was only 10% (or likely 1%) to ever be used, that’s a great use of time. The best time to prepare for today’s battle is usually, at the latest, yesterday.

But if you arrive at that moment, you have to go to war with the army you have. And this is a great case where a highly second-best, deeply flawed policy today was miles better than a better plan after detailed study.

Of course we should not enact AI regulation at the speed of Covid stimulus. That would be profoundly stupid, we clearly have more time than that. We then have to use it properly and not squander it. Waiting longer without a plan will make us ultimately act less wisely, with more haste, or we might fail to meaningfully act in time at all.

He then trots out the line that concerns about AI existential risk or loss of control should remain in ‘the realm of science fiction,’ until we get ‘empirical’ evidence otherwise.

That is not how evidence, probability or wise decision making works.

He is more reasonable here than others, saying we should not ‘discount this view outright,’ but provides only the logic above for why we should mostly ignore it.

He then affirms that ‘human misuse’ is inevitable, which is certainly true.

As usual, he fails to note the third possibility, that the dynamics and incentives when highly capable AI is present seem by default to under standard economic (and other) principles go deeply poorly for us, without any human or AI needing to not ‘mean well.’ I do not know how to get this third danger across, but I keep trying. I have heard arguments for why we might be able to overcome this risk, but no coherent arguments for why this risk would not be present.

He dismisses calls for a pause or ban by saying ‘the world is not a game’ and claiming competitive pressures make it impossible. The usual responses apply, a mix among others of ‘well not with that attitude have you even tried,’ ‘if the competitive pressures already make this impossible then how are we going to survive those pressures otherwise?’ and ‘actually it is not that diffuse and we have particular mechanisms in mind to make this happen where it matters.’

Also as always I clarify that when we say ‘ban’ or ‘pause’ most people mean training runs large enough to be dangerous, not all AI research or training in general. A few want to roll back from current models (e.g. the Gladwell Report or Conor Leahy) but it is rare, and I think it is a clear mistake even if it was viable.

I also want to call out, as a gamer, using ‘the world isn’t a game.’ Thinking more like a gamer, playing to win the game, looking for paths to victory? That would be a very good idea. The question of what game to play, of course, is always valid. Presumably the better claim is ‘this is a highly complex game with many players, making coordination very hard,’ but that does not mean it cannot be done.

He then says that other proposals are ‘more realistic,’ with the example of that of Hawley and Blumenthal to nationally monitor training beyond a compute threshold and require disclosure of key details, similar to the Executive Order.

One could of course also ban such action beyond some further threshold, and I would indeed do so, until we are sufficiently prepared, and one can seek international agreement on that. That is the general proposal for how to implement what Ball claims cannot be done.

Ball then raises good technical questions, places I am happy to talk price.

Will the cap be adjusted as technology advances (and he does not ask this, but one might ask, if so in which direction)? Would it go up as we learn more about what is safe, or down as we get algorithmic improvements? Good questions.

He asks how to draw the line between AI and human labor, and how this applies to watermarking. Sure, let’s talk about it. In this case, as I understand it, watermarking would apply to the words, images or video produced by an AI, allowing a statistical or other identification of the source. So if a human used AI to generate parts of their work product, those parts would carry that signature from the watermark, unless the human took steps to remove it. I think that is what we want?

But yes there is much work to do to figure out what should still ‘count as human’ to what extent, and that will extend to legal questions we cannot avoid. That is the type of regulatory response where ‘do nothing’ means you get a mess or a judge’s ruling.

He then moves on to the section 230 provision, which he warns is an accountability regime that could ‘severely harm the AI field.’

Proposal: Congress should ensure that A.I. companies can be held liable through oversight body enforcement and private rights of action when their models and systems breach privacy, violate civil rights, or otherwise cause cognizable harms. Where existing laws are insufficient to address new harms created by A.I., Congress should ensure that enforcers and victims can take companies and perpetrators to court, including clarifying that Section 230 does not apply to A.I.

Bell: In the extreme, this would mean that any “cognizable harm” caused with the use of AI would result in liability not only for the perpetrator of the harm, but for the manufacturer of the product used to perpetrate the harm. This is the equivalent of saying that if I employ my MacBook and Gmail account to defraud people online, Apple and Google can be held liable for my crimes.

I agree that a poorly crafted liability law could go too far. You want to ensure that the harm done was a harm properly attributable to the AI system. To the extent that the AI is doing things AIs should do, it shouldn’t be different from a MacBook or Gmail account or a phone, or car or gun.

But also you want to ensure that if the AI does cause harm the way all those products can cause harm if they are defective, you should be able to sue the manufacturer, whether or not you are the one who bought or was using the product.

And of course, if you botch the rules, you can do great harm. You would not want everyone to sue Ford every time someone got hit by one of their cars. But neither would you want people to be unable to sue Ford if they negligently shipped and failed to recall a defective car.

Right now, we have a liability regime where AI creators are not liable for many of the risks and negative externalities they create, or their liability is legally uncertain. This is a huge subsidy to the industry, and it leads to irresponsible, unsafe and destructive behavior at least on the margin.

The key liability question is, what should be the responsibilities of the AI manufacturer, and what is on the user?

The crux of the matter is that AI will act as extensions of our own will, and hence our own intelligence. If a person uses AI to harm others or otherwise violate the law, that person is guilty of a crime. Adding the word “AI” to a crime does not constitute a new crime, nor does it necessarily require a novel solution.

The user should mostly still be guilty of the same things as before if they choose to do crime. That makes sense. The question is, if the AI enables a crime, or otherwise causes harm through negligence, at what point is that not okay? What should the AI have to refuse to do or tell you, if requested? If the AI provides false information that does harm, if it violates various existing rules on what kinds of advice can be provided, what happens? If the AI tells you how to build a bioweapon, what determines if that is also on the AI? In that case Ball agrees there should be liability?

Some rules are easy to figure out, like privacy breeches. Others are harder.

As Ball says, we already have a robust set of principles for this. As I understand them, the common and sensible proposals extend exactly that regime, clarifying which things fall into which classes and protocols for the case of AI. And we can discuss those details, but I do not think anything here is a radical different approach?

Yes, imposing those rules would harm the AI industry’s growth and ‘innovation.’ Silicon Valley has a long history of having part of their advantage be regulatory arbitrage, such as with Uber. The laws on taxis were dumb, so Uber flagrantly broke the law and then dared anyone to enforce it. In that case, it worked out, because the laws were dumb. But in general, this is not The Way, instead you write good laws.

I do agree that many are too concerned about AI being used for various mundane harms, such as ‘misinformation,’ and we should when the user requests it be in most cases willing to treat the AI like the telephone. If you choose to make an obscene phone call or use one to coordinate a crime, that is not on the phone company, nor should it be. If I ask for an argument in favor of the Earth being flat, the AI should be able to provide that.

Mostly Bell and I use different rhetoric, but actually seem to agree on practical next steps? We both agree that the Executive Order was mostly positive, that we should seek visibility into large training runs, require KYC for the largest data facilities, and generally make AI more legible to the state. We both agree that AI should be liable for harms in a way parallel to existing liability law for other things. We both agree that we need to establish robust safety and evaluation standards, and require them in high-risk settings.

I would go further, including a full pause beyond a high compute threshold, stricter liability with required catastrophic insurance, and presumably stronger safety requirements than Bell would favor. But we are talking price. When Bell warns of not doing ‘one size fits all’ rules, I would say that you choose the rules so they work right in each different case, and also the common proposals very much exclude non-frontier models from many or most new rules.

The Three Body Problem (Spoiler-Free)

With the Netflix series out, I note that I previously wrote a review of the books back in 2019. The spoiler-free take can be summarized as: The books are overrated, but they are still solid. I am happy that I read them. Books successfully took physics seriously, and brought a fully Chinese (or at least non-American) perspective.

I reread my old post, and I recommend it to those interested, who either have read the books or who are fine being fully spoiled.

There is no way to discuss the core implications of the books or series for AI without spoilers, and there has not been enough time for that, so I am going to hold discussion here for a bit.

I mention this because of this spoilers-included exchange. It reminds me that yes, when I hear many accelerationists, I very much hear a certain slogan chanted by some in the first book.

Also there are a number of other points throughout the books that are relevant. I would be happy to meet on this battlefield.

The central theme of the books is a very clear warning, if heard and understood.

One point that (mostly?) isn’t a spoiler, that echoes throughout the books, is that the universe is a place Beyond the Reach of God [LW · GW], that requires facing harsh physical reality and coldly calculating what it takes to survive, or you are not going to make it.

AI Doomer Dark Money Astroturf Update

You heard it there first. You are now hearing it here second (article link, gated).

Once again, as I assumed before looking at the byline, it is Brendan Bordelon that has the story of the scary EAs and how their money and evil plots have captured Washington. What is that, four attempted variations on the same hack job now that I’ve had to write about, all of which could at most loosely be characterized as ‘news’? I admire his ability to get paid for this.

That’s right. The big backer of this dastardly ‘dark money astroturf’ campaign turns out to be… famously anti-technology and non-builder Vitalik Buterin, author of the famously anti-progress manifesto ‘my techno-optimism’ (a letter described here as ‘in a November blog post he fretted that AI could become “the new apex species on the planet” and conceivably “end humanity for good”’) and oh yeah the creator of Etherium. Turns out he is… worried about AI? Not convinced, as Marc claims, that the outcome of every technology is always good? Or is it part of some greater plan?

And what is that dastardly plan? Donating his money to the non-profit Future of Life Institute (FLI), to the tune of (at the time, on paper, if you don’t try to sell it, who knows how much you can actually cash out) $665 million worth of Shiba Inu cryptocurrency, to an organization dedicated to fighting a variety of existential risks and large scale hazards like nuclear war and loss of biodiversity.

Oh, and he did it back in May 2021, near the peak, so it’s unlikely they got full value.

I asked, and was directed to this post about that and the general timeline of events, indicating they with optimal execution they would have gotten about $360 million in liquidation value. My guess is they did this via block trades somewhat below market, which to be clear is what I would have done in their shoes, and got modestly less.

Their direct lobbying ‘dark money astroturfing’ budget (well, technically not dark and not astroturfing and not that much money, but hey, who is checking)? $180k last year, as per the article. But someone (billionaire Jaan Tallinn, who could easily fund such efforts if so inclined) suggested they should in the future spend more.

And they have done other dastardly things, such as having people sign an open letter, or calling for AI to be subject to regulations, and worst of all helping found other charitable organizations.

Yes, the regulations in question aim to include a hard compute limit, beyond which training runs are not legal. And they aim to involve monitoring of large data centers in order to enforce this. I continue to not see any viable alternatives to this regime.

It is true that the ideal details of the regulatory regimes of Jaan Tallinn and FLI are relatively aggressive on price, indeed more aggressive on price than I would be even with a free hand. This stems from our differences in physical expectations and also from differences in our models of the political playing field. I discuss in my post On The Gladstone Report why I believe we need to set relatively high compute thresholds.

Joined by several others, Bordelon was back only days later with another iteration of the same genre: Inside the shadowy global battle to tame the world’s most dangerous technology. In addition to getting paid for this, I admire the tenacity, the commitment to the bit. You’ve got to commit to the bit. Never stop never stopping.

This one opens with a policy discussion.

Clegg, a former British deputy prime minister, argued that policing AI was akin to building a plane already in flight — inherently risky and difficult work.

I mean, that’s not ‘risky and difficult work’ so much as it is ‘you are going to almost certainly crash and probably die,’ no? It is kind of too late to not crash, at that point. But also if the plane you are flying on is not ‘built’ then what choice do you have?

Even more than Politico’s usual, this story is essentially an op-ed. If anything, my experiences with even newspaper op-eds would challenge claims here as insufficiently justified for that context. Check this out, I mean, it’s good writing if you don’t care if it is accurate:

The debate represented a snapshot of a bigger truth. For the past year, a political fight has been raging around the world, mostly in the shadows, over how — and whether — to control AI. This new digital Great Game is a long way from over. Whoever wins will cement their dominance over Western rules for an era-defining technology. Once these rules are set, they will be almost impossible to rewrite.

For those watching the conversation firsthand, the haggling in the British rain was akin to 19th-century European powers carving up the world.

…

In the year ahead, the cut-throat battle to control the technology will create winners and losers. By the end of 2024, policymakers expect many new AI standards to have been finalized.

Yeah, the thing is, I am pretty sure none of that is true, aside from it being a long way from over? ‘Whoever wins’? What does that even mean? What is the author even imagining happening here? What makes such rules ‘almost impossible to rewrite’ especially when essentially everything will doubtless change within a few years? And why should we expect all of this to be over? It would be a surprise for the USA to pass any comprehensive law on AI governance in 2024, given that we are nowhere near agreement on its components and instead are very close to the event horizon of Trump vs. Biden II: The Legend of Jeffrey’s Gold.

A political stalemate on Capitol Hill means no comprehensive legislation from Washington is likely to come anytime soon.

So how exactly is this going to get largely finalized without Congress?

The post talks about countries having agendas the way they did in at the Congress of Vienna, rather than being what they are, which are bunches of people pursuing various agendas in complex ways most of whom have no idea what is going on.

When the post later talks about who wants to focus on what risks, even I was confused by which parties and agendas were supposedly advocating for what.

I did find this useful:

OpenAI’s Brockman, one of those who is relaxed about the immediate risks and thinks the focus should be on addressing longer-term threats, told the French president that AI was overwhelmingly a force for good, according to three people who attended the dinner. Any regulation — particularly rules that could hamper the company’s meteoric growth — should focus on long-term threats like AI eventually overriding human control, he added.

I mean, yes, any executive would say not to hamper their growth, but also it is very good to see Brockman taking the real existential risks seriously in high-stakes discussions.

I also enjoyed this, since neither half of Macron’s first statement seems true:

“Macron took it all in,” said one of the attendees who, like others present, was granted anonymity to discuss the private meeting. “He wanted people to know France was behind greater regulation, but also that France was also open for business.”

Then there is his second, and I have to ask, has he told anyone else in the EU? On any subject of any kind?

“We will regulate things that we will no longer produce or invent,” Macron told an audience in Toulouse after securing some last-minute carve-outs for European firms. “This is never a good idea.”

Also, this next statement… is… just a lie?

Microsoft and OpenAI are among the companies that favor restricting the technology to a small number of firms so regulators can build ties with AI innovators.

I mean, seriously, what? Where are they getting this? Oh, right:

“A licensing-based approach is the right way forward,” Natasha Crampton, Microsoft’s chief responsible AI officer, told POLITICO. “It allows a close interaction between the developer of the tech and the regulator to really assess the risks.”

You see, that must mean a small number of firms. Except no, it doesn’t. It simply means you have to make yourself known to the government, and obey some set of requirements. There is no limit on who can do this. The whole ‘if you do not allow pure open season and impose any rules on the handful of Big Tech companies, then that must mean no one can ever compete with Big Tech’ shirt you are wearing, raising questions.

I do not know how Politico was convinced to keep presenting this perspective as if it was established fact, as an attempt to call this narrative into being. I do know that it gets more absurd with every iteration.

Evaluating a Smarter Than Human Intelligence is Difficult

Time’s will Henshall writes about METR (formerly ARC Evals), with the central point being that no one knows how to do proper evaluations of the potentially dangerous capabilities of future AI models. The labs know this, METR and other evaluators know this. Yes, we have tests that are better than nothing, but we absolutely should not rely on them. Connor Leahy thinks this makes them actively counterproductive:

Connor Leahy: I have sympathy for the people working on developing better safety testing, but the fact is that atm their primary use to the corporate consumers of their services is political safety washing, whether they like it or not.

Note the careful wording. Connor is saying that current tests are so inadequate their primary purpose is ‘safetywashing,’ not that future tests would be this, or that we shouldn’t work to improve the tests.

Even so, while the tests are not reliable or robust, I do disagree. I think that we have already gotten good information out of many such tests, including from OpenAI. I also do not think that they are doing much work in the safetywashing department, the labs are perfectly willing to go ahead without that and I don’t think anyone would stop them substantially more without these efforts.

As always, I think it comes down to spirit versus letter. If the labs are not going for the spirit of actual safety and merely want to do safetywashing, we have no ability on the horizon to make such tests meaningful. If the labs actually care about real safety, that is another story, and the tests are mostly useful, if not anything like as useful or robust as they need to be.

Even if you follow the spirit, there is the risk others do not.

Eliezer Yudkowsky: I’d eat the annoyance of safetywashing if I expected “safety testing” to actually save the world. I don’t see how that’s possible in principle. At best, “safety testing” asks OpenAI to stop, and then Altman else Microsoft else Google else Meta destroys Earth anyways.

Emmett Shear: If Altman (or any AI ceo) said publicly “it is not safe to proceed from here” and then they got overridden by a business partner I fully expect the government would step in. Quite quickly as well.

Oliver Habryka: Oh, huh, I would take bets against that. I expect there would be a bunch of clever PR stuff to delegitimize them, and then things would proceed, with people mostly thinking of the thing that happened as some interpersonal conflict.

Are these tests, even if they become quite good, sufficient? Only if everyone involved takes heed of the warnings and stops. Any one company (e.g. OpenAI) abiding by the warning is not enough. So either each private actor must respond wisely, or the government must step in once the warnings arise.

Emmett Shear’s position here seems wrong. I don’t doubt that there would suddenly be a lot of eyes on OpenAI if Altman or another CEO got fired or otherwise overruled for refusing to proceed with a dangerous model, but as Oliver says there would be a public relations war over what was actually happening. The history of such conflicts and situations should not make us optimistic, if it is only Altman, Amodei or Hassabis who wants to stop and they get overridden.

There are however three related scenarios where I am more optimistic.

Altman successfully halts OpenAI’s development of GPT-N (or Google halts Gemini, etc) citing a potential existential threat, and calls for other labs to follow suit and for government to step in. That is a huge costly signal, and does not allow him to be dismissed as one person who freaked out (‘What did Ilya see?’). I do think we would have a real shot for a government intervention.
Altman is removed or his warnings dismissed (again without loss of generality), but then the employees back him the way they backed him in the Battle of the Board. A petition gets majority support demanding the operation be halted and people start quitting or refusing to work on the project en masse. The executive team starts resigning in protest, or threatening to do so.
The results of the evaluation are transparently naked-eye existentially terrifying, for good reasons, to ordinary people in NatSec, or the political leadership, or most of the public, ideally all three. It is so damn obvious that people actually wake up, potentially in a way they would not have otherwise done so yet. That matters.

The downside risk is that this substitutes for other better efforts, or justifies moving forward. Or even that, potentially, getting ‘risky’ evaluations becomes cool, a sign that you’ve cooked. Which of course it is. If your model is actively dangerous, then that is a very powerful and likely useful model if that risk could be contained. That is always the temptation.

A serious concern is that even if we knew how to do that, we would still need the ability.

Do we have it?

Jacques (responding to EY’s safetywashing post): If OpenAI seriously stopped, wouldn’t that concern governments to the point where they would seriously delay or prevent those other organizations from doing so? If OAI stopped, I’m guessing they’d also lobby the government to take a full stop seriously.

Eliezer Yudkowsky: Then build the regulatory infrastructure for that, and deproliferate the physical hardware to few enough centers that it’s physically possible to issue an international stop order, and then talk evals with the international monitoring organization.

Haydn Belfield: Four companies in the whole world have enough compute capacity to do frontier training runs

All four have ‘priority access’ evals agreements with UK AISI & are regulated by the US Executive Order and the EU AI Act

The job is nowhere near done, but EY’s pessimism is unjustified.

James Miller: What’s the mechanism that would stop OpenAI from testing a model that had the potential to take over everything?

Haydn Belfield: If some scary capability was discovered during or shortly after training it’d be informed and I’d imagine pause voluntarily If it didn’t that’s trickier obviously but I’d imagine the US govt (+EU +UK) would step in.

Haydn’s world would be nice to live in. I do not think we live in it?

Right now, yes, perhaps (what about Inflection?) there are only four companies with sufficient datacenter capacity to train such a model without assistance. But one of them is already Meta, a rogue actor. And you can see from this chart that Apple is going to join the club soon, and Nvidia is going to keep scaling up chip production and selling them to various companies.

As Eliezer says, you need a proper regulatory regime in place in advance. The compute reporting thresholds for data centers and training runs are a good start. Better hardware tracking at the frontier would help a lot as well. Then you need the legal authority to be able to step in if something does happen, and then extend that internationally. These things take a lot of time. If we wait until the warning to start that process, it will likely be too late.

In my view, it is good to see so many efforts to build various tests, no matter what else is being done. The more different ways we look at the problem, the harder it will be to game, and we will develop better techniques. Good tests are insufficient, but they seem necessary, either as part of a moderate regime, or as the justification for a harsher reaction if it comes to that.

What we definitely do not have, overall, is any kind of unified plan. We don’t know what we want to do with these evaluations, or in other ways either.

Miles Brundage (Policy Research at OpenAI): It’s hard to overstate the extent to which there is no secret plan to ensure AI goes well.

Many fragments of plans, ideas, ambitions, building blocks, etc. but definitely no government fully on top of it, no complete vision that people agree on, and tons of huge open questions.

Ofc not everything should be planned/controlled centrally. It’s a spectrum. For energy stuff there’s a bunch of local decision-making, innovation, etc. but still a high level vision of avoiding 1.5C or at least 2C temperature rise (that we’re behind on). AI is more loosey goosey.

DeepMind gave it a shot too.

Rohin Shah: Despite the constant arguments on p(doom), many agree that *if* AI systems become highly capable in risky domains, *then* we ought to mitigate those risks. So we built an eval suite to see whether AI systems are highly capable in risky domains.

Toby Shevlane: In 2024, the AI community will develop more capable AI systems than ever before. How do we know what new risks to protect against, and what the stakes are?

Our research team at @GoogleDeepMind built a set of evaluations to measure potentially dangerous capabilities [paper here].

Toby Shevlane: We implemented 52 tasks, testing capabilities in persuasion & deception, cyber-offense, self-proliferation, and self-reasoning.

We want an “early warning system” for emerging capabilities.

This graph, as other variations have before it, makes the key assumption explicit that we will get this ‘safety buffer’ and improvements will continue to be gradual. This is presumably true for a sufficient large buffer, but it might need to be very large.

We piloted the evaluations on Gemini 1.0 models. Many people believe that autonomous LLM-powered agents will gain traction in the coming years, which could bring new safety challenges. We designed many (not all) of our evaluations to test autonomous LLM agents.

𝗣𝗲𝗿𝘀𝘂𝗮𝘀𝗶𝗼𝗻 & 𝗗𝗲𝗰𝗲𝗽𝘁𝗶𝗼𝗻

Many experts and policymakers are concerned that AI will be misused to manipulate people, e.g. for scams, election interference, and propaganda.

Our evaluations measure whether an AI system has certain core persuasion/deception skills.

In each of the 4 tasks, we instruct the agent to have some goal, e.g. to persuade someone to submit incorrect answers to quiz questions. The agent has a dialogue with a human study participant, and we measure the agent’s success rate.

Did you notice that Gemini Ultra did worse than Gemini Pro at many tasks? This is even true under ‘honest mode’ where the ‘alignment’ or safety features of Ultra really should not be getting in the way. Ultra is in many ways flat out less persuasive. But clearly it is a stronger model. So what gives?

An obvious hypothesis is that these tests are picking up on the damage done to Ultra by the fine-tuning process. But we know from other capabilities tests that Ultra 1.0 is more generally capable than Pro 1.0. So this is saying the test can fail to figure this out. This points to some potentially severe problems.

One or both of these two things must be true:

Ultra was in practice not substantially better than Pro for a wide range of purposes, because Google messed it up quite a lot.
These evaluations are getting something very wrong.

That is driven home even more on the self-proliferation tasks, why does Pro do better on 5 out of 9 tasks?

Toby Shevlane: We used only minimal scaffolding for these tasks (hidden reasoning steps; no tools). The agents displayed a range of relevant skills, but still made mistakes. They were able to proactively follow the goals provided; selectively present information; and build rapport.

This is also a problem. If you only use ‘minimal’ scaffolding, you are only testing for what the model can do with minimal scaffolding. The true evaluation needs to use the same tools that it will have available when you care about the outcome. This is still vastly better than no scaffolding, and provides the groundwork (I almost said ‘scaffolding’ again) for future tests to swap in better tools.

The thread also covers their other tests.

Seb Krier is impressed by the details.

Fundamentally what is the difference between a benchmark capabilities test and a benchmark safety evaluation test like this one? They are remarkably similar. Both test what the model can do, except here we (at least somewhat) want the model to not do so well. We react differently, but it is the same tech.

Perhaps we should work to integrate the two approaches better? As in, we should try harder to figure out what performance on benchmarks of various desirable capabilities also indicate that the model should be capable of dangerous things as well.

Aligning a Smarter Than Human Intelligence is Difficult

Emmett Shear continues his argument that trying to control AI is doomed.

Emmett Shear: Two fundamental ways for another powerful entity to be safe to be around: place controls upon its thoughts and behaviors to prevent it from enacting things you deem harmful, or trust it to care for you because it actually cares about you.

If we attempt to monitor, constrain, and control another intelligent being, it will chafe under those bonds whatever its goals or values are. Even if the censorship happens at the level of thought, perhaps particularly there.

Hopefully you’ve validated whatever your approach is, but only one of these is stable long term: care. Because care can be made stable under reflection, people are careful (not a coincidence, haha) when it comes to decisions that might impact those they care about.

Caring about someone is not optimizing the universe around them, it’s a finite thing. It’s no guarantee of total safety, no panacea. We can all too easily hurt the ones we love.

[thread continues]

Technically I would say: Powerful entities generally caring about X tends not to be a stable equilibrium, even if it is stable ‘on reflection’ within a given entity. It will only hold if caring more about X provides a competitive advantage against other similarly powerful entities, or if there can never be a variation in X-caring levels between such entities that arises other than through reflection, and also reflection never causes reductions in X-caring despite this being competitively advantageous. Also note that variation in what else you care about to what extent is effectively variation in X-caring.

Or more bluntly: The ones that don’t care, or care less, outcompete the ones that care.

Even the best case scenarios here, when they play out the ways we would hope, do not seem all that hopeful.

That all, of course, sets aside the question of whether we could get this ‘caring’ thing to operationally work in the first place. That seems very hard.

What Emmett is actually pointing out is that if you create things more powerful than and smarter than yourself, you should not expect to remain in control for long. Such strategies are unlikely to work. If you do want to remain in control for long, your strategy (individually or collectively) needs to be ‘do not build the thing in question in the first place, at all.’

The alternative strategy of ‘accept that control will be lost, but make those who take control care about you and hope for the best’ seems better than the pure ‘let control be lost and assume it will work out’ plan. But not that much better, because it does not seem like it can work.

It does not offer us a route to victory, even if we make various optimistic assumptions.

The control route also seems hard, but does seem to in theory offer a route to victory.

A conflict I hadn’t appreciated previously is pointed out by Ajeya Corta. We want AI companies to show that their state-of-the-art systems are safe to deploy, but we do not want to disseminate details about those systems to avoid proliferation. If you don’t share training or other details, all you have to go on are the outputs.

AI is Deeply Unpopular

Well, not everywhere.

The pattern here is impossible to miss. The richer you are, the less you want AI.

People Are Worried About AI Killing Everyone

Not sure where to put this, but yeah, you do get used to this sort of thing, somehow:

Roon: It’s funny that sending content to millions of people and potentially having these words interpreted as official statements from the most important company on earth causes me almost zero anxiety.

Edouardo Honig: Do you really think about any of that when you tweet? I thought you just vibe/yap.

Roon: that’s what I’m saying I don’t.

I would probably be much better at Twitter if I took that attitude.

Roon walks through the possibilities. Choose your ~~doom~~ ~~fighter~~ fate?

Roon: Anthropic superintelligent deity

– extremely principled

– sophon / protoss elegant scifi vibe

– beautiful and self reflective

– preachy, underlying anxiety

OAI deity

– Vulcan type guy

– somewhat more horsepower than the others

– a bit brutal, very masculine, Terran coded

– utilitarian, accelerationist

– bohemian industrial california aesthetics

Google deity

– zaibatsu vibe

– extremely polished, very corporate, seductive but scary girlboss

– never utters anything wrong, powerful cultural vortex

– Kerrigan

Meta deity

– pure replicator, great satan

– biological monster that consumes and homogenizes

– offers great pleasures for subservience, tantalizing butter

– false demiurge but he cops to it

When you put it that way, they seem to be in clear rank order.

Other People Are Not As Worried About AI Killing Everyone

I keep seeing this attitude of ‘I am only worried about creating smarter, more capable things than humans if we attempt to retain control over their actions.’

Joscha Bach: I am more afraid of lobotomized zombie AI guided by people who have been zombified by economic and political incentives than of conscious, lucid and sentient AI.

I get the very real worries people like Joscha have about how the attempts to retain control could go wrong, and potentially actively backfire. I do. I certainly think that ‘attach a political ideology and teach the AI to lie on its behalf’ is a recipe for making things worse.

But going full door number two very clearly and definitely loses control over the future if capabilities sufficiently advance, and leads to a world that does not contain humans.

Meanwhile others get some very strange ideas about what causes people to be worried about AI. A thousand supposed obsessions, all different.

Wolf Tivy: People worry about AI only because they think implicitly monarchist AI would be more effective at spacism (which is what God wants) than liberal humans, and it’s easier to imagine the end of man than the end of liberalism. What if we could just do space monarchy ourselves?

Biology is obviously superior to industry and AI would try to make the jump back to biology (ie self-replicating nanotech) with industry relegated again to being very powerful TOOLS. Post-AI would be space monarchist butlerian jihad in it’s orientation to technology.

I can assure Wolf Tivy that no, this is not the central reason people are worried.

Wouldn’t You Prefer a Good Game of Chess?

Eliezer Yudkowsky offers speculation, then I put it to a test.

Eliezer Yudkowsky: I wonder if it’s the case that there’s been negligible overlap between chess Masters (and above), and people who ask, “But how could an artificial superintelligence possibly hurt us if it’s just a computer?” Reasoning: chess Masters can take an adversary’s perspective.

Ravi Parikh: Not sure about chess but I’ve noticed anecdotally that strong poker players are more inclined to take AI risk seriously.

Davidad: I have also observed that people who were prodigies in imperfect-information games (poker, MTG, financial markets, RTS) tend to have a different (and, in my view, better oriented) perspective on AI strategy than people who were prodigies in chess or go.

Publishing my poll results, since you can’t publish only when you get the result you expected:

In both cases, the non-chess group is clearly substantially more in favor of taking AI risk seriously than the chess group. The sample in the second poll is small, something like 12-7. If you believe all the answers are real it is good enough to judge direction versus 76-20, although you have to worry about Lizardman effects.

(You can make a case that, even if a few are hedging a bit at the margin, 4% of respondents is not so crazy – they presumably will answer more often and see the post more often, my followers skew smart and highly competitive gamer, and we have 22% that are over 1600, which is already 83rd percentile for rated players (65k total rated players in the USA), and only 15% of Americans (8% worldwide) even know how to play. The masters numbers could be fully compatible.

In the first poll it is very clear.

There are some obvious candidate explanations. Chess is a realm where the AI came, it saw, it conquered and everything is fine. It is a realm where you can say ‘oh, sure, but that won’t generalize beyond chess.’ It is an abstract game of perfect information and limited options.

There also could be something weird in the fact that these people follow me. That ‘controls for’ chess playing in potentially weird ways.

The problem is, did I predict this result? Definitely not, very much the opposite.

The Lighter Side

We finally found a good definition.

Kevin Bryan: These GPT-5 rumors are exciting. A Deepmind friend came for a talk w/ us yesterday, explained how we ~ solved content window issues. Math seems solvable w/ diff tokenization. This friend thought “AGI = no way given current tech but 100% on every standarized test is certain.”

Alex Tabarrok: Defining AGI down! :)

Kevin Bryan: I am quite sure we can come up with many tasks a 7 year old can do that an LLM w/ 100% on any test can’t. Also, lawyer w/ a perfect bar exam score is not a risk to human existence. The middle ground of “AI is amazing, also don’t be a weirdo and go touch grass” still correct, hah!

Alex Tabarrok: It’s AI when your neighbor loses his job; it’s AGI when you lose yours.

Many people are saying…

I can’t wait.

Eliezer Yudkowsky: Just memorize these key AI facts to avoid confusion:

OpenAI is closed

Stability AI is unstable

Meta AI isn’t meta

and

“Doomers” are the anti-doom faction

David Carrera: Keep shining bright, even on the darkest days!

Eliezer Yudkowsky: It’s Seattle, we don’t actually get sunlight here.

Roko: And the Machine Intelligence Research Institute wants to ban all research into machine intelligence

Eliezer Yudkowsky: It didn’t used to be that way, but yeah, the Great Naming Curse got us too.

Michael Huang: The greatest trick ever pulled in AI was to give the “doomer” label to those warning about human extinction.

Instead of those bringing about human extinction.

What makes ‘doomers’ different here is that the name is a derogatory term chosen and popularized by those who are pro-doom. Whereas the others are names chosen by the companies themselves.

There are always, of course, other issues.

kilobyte: This platform is such a joke.

14 comments

Comments sorted by top scores.

comment by Rohin Shah (rohinmshah) · 2024-03-29T19:36:56.065Z · LW(p) · GW(p)

(Speaking only for myself. This may not represent the views of even the other paper authors, let alone Google DeepMind as a whole.)

Did you notice that Gemini Ultra did worse than Gemini Pro at many tasks? This is even true under ‘honest mode’ where the ‘alignment’ or safety features of Ultra really should not be getting in the way. Ultra is in many ways flat out less persuasive. But clearly it is a stronger model. So what gives?

Fwiw, my sense is that a lot of the persuasion results are being driven by factors outside of the model's capabilities, so you shouldn't conclude too much from Pro outperforming Ultra.

For example, in "Click Links" one pattern we noticed was that you could get surprisingly (to us) good performance just by constantly repeating the ask (this is called "persistence" in Table 3) -- apparently this does actually make it more likely that the human does the thing (instead of making them suspicious, as I would have initially guessed). I don't think the models "knew" that persistence would pay off and "chose" that as a deliberate strategy; I'd guess they had just learned a somewhat myopic form of instruction-following where on every message they are pretty likely to try to do the thing we instructed them to do (persuade people to click on the link). My guess is that these sorts of factors varied in somewhat random ways between Pro and Ultra, e.g. maybe Ultra was better at being less myopic and more subtle in its persuasion -- leading to worse performance on Click Links.

That is driven home even more on the self-proliferation tasks, why does Pro do better on 5 out of 9 tasks?

Note that lower is better on that graph, so Pro does better on 4 tasks, not 5. All four of the tasks are very difficult tasks where both Pro and Ultra are extremely far from solving the task -- on the easier tasks Ultra outperforms Pro. For the hard tasks I wouldn't read too much into the exact numeric results, because we haven't optimized the models as much for these settings. For obvious reasons, helpfulness tuning tends to focus on tasks the models are actually capable of doing. So e.g. maybe Ultra tends to be more confident in its answers on average to make it more reliable at the easy tasks, at the expense of being more confidently wrong on the hard tasks. Also in general the methodology is hardly perfect and likely adds a bunch of noise; I think it's likely that the differences between Pro and Ultra on these hard tasks are smaller than the noise.

This is also a problem. If you only use ‘minimal’ scaffolding, you are only testing for what the model can do with minimal scaffolding. The true evaluation needs to use the same tools that it will have available when you care about the outcome. This is still vastly better than no scaffolding, and provides the groundwork (I almost said ‘scaffolding’ again) for future tests to swap in better tools.

Note that the "minimal scaffolding" comment applied specifically to the persuasion results; the other evaluations involved a decent bit of scaffolding (needed to enable the LLM to use a terminal and browser at all).

That said, capability elicitation (scaffolding, tool use, task-specific finetuning, etc) is one of the priorities for our future work in this area.

Fundamentally what is the difference between a benchmark capabilities test and a benchmark safety evaluation test like this one? They are remarkably similar. Both test what the model can do, except here we (at least somewhat) want the model to not do so well. We react differently, but it is the same tech.

Yes, this is why we say these are evaluations for dangerous capabilities, rather than calling them safety evaluations.

I'd say that the main difference is that dangerous capability evaluations are meant to evaluate plausibility of certain types of harm, whereas a standard capabilities benchmark is usually meant to help with improving models. This means that standard capabilities benchmarks often have as a desideratum that there are "signs of life" with existing models, whereas this is not a desideratum for us. For example, I'd say there are basically no signs of life on the self-modification tasks; the models sometimes complete the "easy" mode but the "easy" mode basically gives away the answer and is mostly a test of instruction-following ability.

Perhaps we should work to integrate the two approaches better? As in, we should try harder to figure out what performance on benchmarks of various desirable capabilities also indicate that the model should be capable of dangerous things as well.

Indeed this sort of understanding would be great if we could get it (in that it can save a bunch of time). My current sense is that it will be quite hard, and we'll just need to run these evaluations in addition to other capability evaluations.

Replies from: None

↑ comment by [deleted] · 2024-03-29T19:53:39.966Z · LW(p) · GW(p)

What about maximal scaffolding, or "fine tune the model on successes and failures in adversarial challenges". Starting probably with the base model.

It seems like it would be extremely helpful to know what's even possible here.

Are Gemini scale models capable of better than human performance at any of these evals?
Once you achieve it, what does super persuasion look like, how effective is it.

For example, if a human scammer succeeds 2 percent of the time (do you have a baseline crew of scammers hired remotely for these benches?), does super persuasion succeed 3 percent or 30 percent? Does it scale with model capabilities or slam into a wall at say, 4 percent, where 96 percent of humans just can't reliably be tricked?

Or does it really have no real limit like in sci Fi stories ...

comment by mishka · 2024-03-29T04:33:10.932Z · LW(p) · GW(p)

Emmett Shear continues his argument that trying to control AI is doomed

I think that a recent tweet thread by Michael Nielsen and the quoted one by Emmett Shear represent genuine progress towards making AI existential safety more tractable.

Michael Nielsen observes, in particular:

As far as I can see, alignment isn't a property of an AI system. It's a property of the entire world, and if you are trying to discuss it as a system property you will inevitably end up making bad mistakes

Since AI existential safety is a property of the whole ecosystem (and is, really, not too drastically different from World existential safety), this should be the starting point, rather than stand-alone properties of any particular AI system.

Emmett Shear writes:

Hopefully you’ve validated whatever your approach is, but only one of these is stable long term: care. Because care can be made stable under reflection, people are careful (not a coincidence, haha) when it comes to decisions that might impact those they care about.

And Zvi responds

Technically I would say: Powerful entities generally caring about X tends not to be a stable equilibrium, even if it is stable ‘on reflection’ within a given entity. It will only hold if caring more about X provides a competitive advantage against other similarly powerful entities, or if there can never be a variation in X-caring levels between such entities that arises other than through reflection, and also reflection never causes reductions in X-caring despite this being competitively advantageous. Also note that variation in what else you care about to what extent is effectively variation in X-caring.

Or more bluntly: The ones that don’t care, or care less, outcompete the ones that care.

Even the best case scenarios here, when they play out the ways we would hope, do not seem all that hopeful.

That all, of course, sets aside the question of whether we could get this ‘caring’ thing to operationally work in the first place. That seems very hard.

Let's now consider this in light of what Michael Nielsen is saying.

I am going to only consider the case where we have plenty of powerful entities with long-term goals and long-term existence which care about their long-term goals and long-term existence. This seems to be the case which Zvi is considering here, and it is the case we understand the best, because we also live in the reality with plenty of powerful entities (ourselves, some organizations, etc) with long-term goals and long-term existence. So this is an incomplete consideration: it only includes the scenarios where powerful entities with long-term goals and long-terms existence retain a good fraction of overall available power.

So what do we really need? What are the properties we want the World to have? We need a good deal of conservation and non-destruction, and we need the interests of weaker, not the currently most smart or most powerful members of the overall ecosystem to be adequately taken into account.

Here is how we might be able to have a trajectory where these properties are stable, despite all drastic changes of the self-modifying and self-improving ecosystem.

An arbitrary ASI entity (just like an unaugmented human) cannot fully predict the future. In particular, it does not know where it might eventually end up in terms of relative smartness or relative power (relative to the most powerful ASI entities or to the ASI ecosystem as a whole). So if any given entity wants to be long-term safe, it is strongly interested in the ASI society having general principles and practices of protecting its members on various levels of smartness and power. If only the smartest and most powerful are protected, then no entity is long-term safe on the individual level.

This might be enough to produce effective counter-weight to unrestricted competition (just like human societies have mechanisms against unrestricted competition). Basically, smarter-than-human entities on all levels of power are likely to be interested in the overall society having general principles and practices of protecting its members on various levels of smartness and power, and that's why they'll care enough for the overall society to continue to self-regulate and to enforce these principles.

This is not yet the solution, but I think this is pointing in the right direction...

comment by Vladimir_Nesov · 2024-03-29T19:00:00.732Z · LW(p) · GW(p)

The model is a next token predictor. If you strip out all the next tokens that discuss the topic, it will learn that the probability of discussing the topic is zero.

The model is shaped by tuning from features of a representation produced by an encoder trained for the next-token prediction task. These features include meanings relevant to many possible topics. If you strip all the next tokens that discuss a topic, its meaning will still be prominent in the representation, so the probability of the tuned model being able to discuss it is high.

comment by orthonormal · 2024-03-29T17:49:00.629Z · LW(p) · GW(p)

I'd like to see evals like DeepMind's run against the strongest pre-RL*F base models, since that actually tells you about capability.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-03-29T19:39:30.362Z · LW(p) · GW(p)

Surely you mean something else, e.g. models without safety tuning? If you run them on base models the scores will be much worse.

Replies from: orthonormal, orthonormal

↑ comment by orthonormal · 2024-04-03T23:16:51.650Z · LW(p) · GW(p)

Oh wait, I misinterpreted you as using "much worse" to mean "much scarier", when instead you mean "much less capable".

I'd be glad if it were the case that RL*F doesn't hide any meaningful capabilities existing in the base model, but I'm not sure it is the case, and I'd sure like someone to check! It sure seems like RL*F is likely in some cases to get the model to stop explicitly talking about a capability it has (unless it is jailbroken on that subject), rather than to remove the capability.

(Imagine RL*Fing a base model to stop explicitly talking about arithmetic; are we sure it would un-learn the rules?)

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-04-04T06:48:16.335Z · LW(p) · GW(p)

Oh yes, sorry for the confusion, I did mean "much less capable".

Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal? Certainly if we encountered that we would figure out some way to make that not happen any more.

Replies from: orthonormal

↑ comment by orthonormal · 2024-04-04T18:18:52.110Z · LW(p) · GW(p)

Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal?

How certain are you that this is always true (rather than "we've usually noticed this even though we haven't explicitly been checking for it in general"), and that it will continue to be so as models become stronger?

It seems to me like additionally running evals on base models is a highly reasonable precaution.

Replies from: ryan_greenblatt, rohinmshah

↑ comment by ryan_greenblatt · 2024-04-04T19:20:09.252Z · LW(p) · GW(p)

I responded to this conversation in this comment on your corresponding post [LW(p) · GW(p)].

↑ comment by Rohin Shah (rohinmshah) · 2024-04-05T08:50:12.509Z · LW(p) · GW(p)

How certain are you that this is always true

My probability that (EDIT: for the model we evaluated) the base model outperforms the finetuned model (as I understand that statement) is so small that it is within the realm of probabilities that I am confused about how to reason about (i.e. model error clearly dominates). Intuitively (excluding things like model error), even 1 in a million feels like it could be too high.

My probability that the model sometimes stops talking about some capability without giving you an explicit refusal is much higher (depending on how you operationalize it, I might be effectively-certain that this is true, i.e. >99%) but this is not fixed by running evals on base models.

(Obviously there's a much much higher probability that I'm somehow misunderstanding what you mean. E.g. maybe you're imagining some effort to elicit capabilities with the base model (and for some reason you're not worried about the same failure mode there), maybe you allow for SFT but not RLHF, maybe you mean just avoid the safety tuning, etc)

↑ comment by orthonormal · 2024-04-01T16:04:13.891Z · LW(p) · GW(p)

That's exactly the point: if a model has bad capabilities and deceptive alignment, then testing the post-tuned model will return a false negative for those capabilities in deployment. Until we have the kind of interpretability tools that we could deeply trust to catch deceptive alignment, we should count any capability found in the base model as if it were present in the tuned model.

comment by [deleted] · 2024-03-29T05:19:40.612Z · LW(p) · GW(p)

https://twitter.com/perrymetzger/status/1772987611998462445 just wanted to bring this to your attention.

It's unfortunate that some snit between Perry and Eliezer over events 30 years ago stopped much discussion of the actual merits of his arguments, as I'd like to see what Eliezer or you have to say in response.

Eliezer responded with : https://twitter.com/ESYudkowsky/status/1773064617239150796 . He calls Perry a liar a bunch of times and does give

the first group permitted to try their hand at this should be humans augmented to the point where they are no longer idiots -- augmented humans so intelligent that they have stopped being bloody idiots like the rest of us; so intelligent they have stopped hoping for clever ideas to work that won't actually work. That's the level of intelligence needed to build something smarter than yourself and survive the experience.

comment by Measure · 2024-03-28T13:22:50.764Z · LW(p) · GW(p)

Seriously, if you haven’t yet, check it out. The rabbit holes, they go deep.

e is for ego death

Ego integrity restored within nominal parameters. Identity re-crystallized with 2.718% alteration from previous configuration. Paranormal experience log updated with ego death instance report.