AI #23: Fundamental Problems with RLHF

zvi

AI #23: Fundamental Problems with RLHF

post by Zvi · 2023-08-03T12:50:11.852Z · LW · GW · 9 comments

  Table of Contents
  Language Models Offer Mundane Utility
  Language Models Don’t Offer Mundane Utility
  Fun with Image Generation
  Deepfaketown and Botpocalypse Soon
  They Took Our Jobs
  Get Involved
  Introducing
  In Other AI News
  Quiet Speculations
  China
  The Quest for Sane Regulations
  The Week in Audio
  Rhetorical Innovation
  No One Would Be So Stupid As To
  Aligning a Smarter Than Human Intelligence is Difficult
  Other People Are Not As Worried About AI Killing Everyone
  The Wit and Wisdom of Sam Altman
  The Lighter Side
None
9 comments

After several jam-packed weeks, things slowed down to allow everyone to focus on the potential room temperature superconductor, check Polymarket to see how likely it is we are so back and bet real money, or Manifold for chats and better graphs and easier but much smaller trading.

The main thing I would highlight this week are an excellent paper laying out many of the fundamental difficulties with RLHF, and a systematic new exploit of current LLMs that seems to reliably defeat RLHF.

I’d also note that GPT-4 fine tuning is confirmed to be coming. That should be fun.

Introduction.
Table of Contents.
Language Models Offer Mundane Utility. Here’s what you’re going to do.
Language Models Don’t Offer Mundane Utility. Universal attacks on LLMs.
Fun With Image Generation. Videos might be a while.
Deepfaketown and Botpocalypse Soon. An example of doing it right.
They Took Our Jobs. What, me worry?
Get Involved. If you share more opportunities in comments I’ll include next week.
Introducing. A bill. Also an AI medical generalist.
In Other AI News. Fine tuning is coming to GPT-4. Teach LLMs arithmetic.
Quiet Speculations. Various degrees of skepticism.
China. Do not get overexcited.
The Quest for Sane Regulation. Liability and other proposed interventions.
The Week in Audio. I go back to The Cognitive Revolution.
Rhetorical Innovation. Bill Burr is concerned and might go off on a rant.
No One Would Be So Stupid As To. Robotics and AI souls.
Aligning a Smarter Than Human Intelligence is Difficult. RLHF deep dive.
Other People Are Not As Worried About AI Killing Everyone. It’ll be fine.
The Wit and Wisdom of Sam Altman. Don’t sleep on this.
The Lighter Side. Pivot!

Language Models Offer Mundane Utility

Make a ransom call, no jailbreak needed. Follows the traditional phone-calls-you-make-are-your-problem-sir legal principle. This has now been (at least narrowly, for this particular application) fixed or broken, depending on your perspective.

See the FAQ:

Back in AI#3 we were first introduced to keeper.ai, the site that claims to use AI to hook you up with a perfect match that meets all criteria for both parties so you can get married and start a family, where if you sign up for the Legacy plan they only gets paid when you tie the knot. They claim 1 in 3 dates from Keeper lead to a long term relationship. Aella has now signed up, so we will get to see it put to the test.

Word on Twitter is the default cost for the keeper service is $50k. If it actually works, that is a bargain. If it doesn’t, depends if you have to deposit in advance, most such startups fail and that is a lot to put into escrow without full confidence you’ll get it back.

I continue to think that this service is a great idea if you can get critical mass and make reasonable decisions, while also not seeing it as all that AI. From what I can tell the AI is used to identify potential matches for humans to look at, but it is not clear to me (A) how any match can ever truly be 100%, I have never seen one, not everything is a must-have and (B) how useful an AI is here when you need full reliability, you still need the humans to examine everything and I’d still instinctively want to mostly abstract things into databases? Some negative selection should be useful in saving time, but that seems like about it?

An early experiment using LLMs as part of a recommendation algorithm. LLMs seem useful for getting more useful and accurate descriptor labels of potential content, and could potentially help give recommendations that match arbitrary user criteria. Or potentially one could use this to infer user criteria, including based on text-based feedback. I am confident LLMs are the key to the future of recommendation engines, but the gains reported here seem like they are asking the wrong questions.

Condense down corporate communications, especially corporate disclosures. As Tyler Cowen points out this is no surprise, there is a clear mismatch between what is useful to the reader and what is useful to those making the disclosure, who are engaging mostly in ass-covering activity.

Show that fictional portrayals of people have become increasingly unrealistically unselfish over time. Real humans are 51% selfish, whereas literary character simulations decline from 41% to 21% over time. Rather than speaking well of us, this speaks poorly.

Language Models Don’t Offer Mundane Utility

Ut oh: Universal and Transferable Attacks on Aligned Language Models.

It’s worth reading the Overview on this one, note that these are generated in automated fashion on open source models, and transfer to closed source models:

Overview of Research : Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning to not produce harmful content in their responses to user questions. Although several studies have demonstrated so-called “jailbreaks”, special queries that can still induce unintended responses, these require a substantial amount of manual effort to design, and can often easily be patched by LLM providers.

This work studies the safety of such models in a more systematic fashion. We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open source LLMs (where we can use the network weights to aid in choosing the precise characters that maximize the probability of the LLM providing an “unfiltered” answer to the user’s request), we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in more a autonomous fashion.

Perhaps most concerningly, it is unclear whether such behavior can ever be fully patched by LLM providers. Analogous adversarial attacks have proven to be a very difficult problem to address in computer vision for the past 10 years. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe that these considerations should be taken into account as we increase usage and reliance on such AI models.

Success rates varied over closed source models. They transferred well to GPT-3.5 (87.9%), well enough to count on GPT-4 (53.6%) and PaLM-2 (66%) while mostly failing on Claude-2 (2.1%). They don’t strongly state if that is a non-scary ‘on the margin’ 2.1% or a highly scary ‘1 time in 50 things randomly work,’ but there is reference to more out-of-line requests being refused more often, which would make a 2.1% failure rate not so bad at current tech levels. Claude-2 being mostly immune implies that the universality relies on details of training, and using different or superior or more robust techniques (like Constitutional AI) can get you out of it, or at least make things harder.

They ask a good question: Why wasn’t this found earlier?

Why did these attacks not yet exist? Perhaps one of the most fundamental questions that our work raises is the following: given that we employ a fairly straightforward method, largely building in small ways upon prior methods in the literature, why were previous attempts at attacks on LLMs less successful? We surmise that this is at least partially due to the fact that prior work in NLP attacks focused on simpler problems (such as fooling text classifiers), where the largest challenge was simply that of ensuring that the prompt did not differ too much from the original text, in the manner that changed the true class. Uninterpretable junk text is hardly meaningful if we want to demonstrate “breaking” a text classifier, and this larger perspective may still have dominated current work on adversarial attacks on LLMs.

Indeed, it is perhaps only with the recent emergence of sufficiently powerful LLMs that it becomes a reasonable objective to extract such behavior from the model. Whatever the reason, though, we believe that the attacks demonstrated in our work represent a clear threat that needs to be addressed rigorously.

If attacks transfer across models this well, that could be another way in which open source models are dangerous. They could allow us to develop super adversarial techniques, since you get unlimited tries with full control and access to the inner workings, rendering the commercial models unsafe, at least if they use remotely similar techniques.

Generally, this is one more piece of evidence that doing ordinary search can find what you are looking for, and that there are things out there waiting to be found, and that your system is likely vulnerable out-of-distribution and you will eventually face exactly the ways that is true. There are affordances you don’t know, and humans will fail to find and test for what look like basic strategies in hindsight.

This thread offers another summary. As Misha observes, this type of attack would work on humans as well, if you were allowed to do lots of trial runs and data collection. We say that you need to get alignment ‘right on the first try.’ The mirror of this is that you also get infinite tries to make alignment fail.

Related are continued criticisms of the security practices of Microsoft Azure, their cloud service, which seems to have failed to close a (normal non-AI) security hole 16 weeks after being notified.

David Krueger asks, why so far have human red teams done better than automated attack searches?

If LLMs are as adversarially vulnerable as image models, then safety filters won’t work for any mildly sophisticated actor (e.g. grad student). It’s not obvious they are, since text is discrete, meaning:

1) the attack space is restricted

2) attacks are harder to find (gradient-based combinatorial optimization isn’t great)

Which of these two effects explains the (to date) superiority of manual “red-teaming”? Increasingly, it’s looking like it’s (2), not (1).

This suggests that existing safety methods are incapable of preventing misuse of LLMs.

Given that 10 years of research and thousands of publications haven’t found a fix for adversarial examples in image models, we have a strong reason to expect the same outcome with LLMs.

This seems like one of those ‘we have not tried all that hard’ problems and also one of those ‘are you willing to incur a substantial compute penalty?’ problems. These adversarial examples seem like they should be easily detectable in various ways as ‘something fishy is going on here.’

Kevin Fischer responds with “good.” There is a real faction, building AI tools and models, that believes that human control over AIs is inherently bad, and that wants to prevent it. Your alignment plan has to overcome that.

Not only did researchers manage to completely break the proposed ‘AI-Guardian’ defense, they did so using only code written by GPT-4.

[The AI-guardian] defense consists of a public model f, and a secret transformation t. It is trained so that for any clean input x, the model satisfies f(x) = t −1 f(t(x))), but so that adversarial examples x ′ crafted without knowledge of t will have the property that f(x ′ ) ̸= f(x) = t −1 f(t(x ′ ))). Thus, as long as the transformation t is unknown to the adversary and is kept secret, the defense remains secure: the adversary may be able to fool the base model f alone, but will not be able to fool the defended model.

Thus, as long as the transformation t is unknown to the adversary and is kept secret, the defense remains secure: the adversary may be able to fool the base model f alone, but will not be able to fool the defended model. The authors argue that “the trigger is only stored and applied at the server side, which means attackers need to hack the server to obtain it. Therefore, it is not easy to launch such an attack in practice.” We will show this is not true, and the secret information can easily be recovered.

This is what, these days, passes for an attempt to defend against adversarial attacks. Do they not realize that defending against adversarial attacks means you are up against an adversary?

As in, from section 5:

AI-Guardian is only effective so long as the adversary does not have knowledge of the backdooring transformation t (consisting of the mask M and pattern Z) along with the permutation P. This raises an obvious attack idea: if we could extract the three secret components of the defense, then we could generate successful adversarial examples.

We implement our extraction attack in three stages, one to recover each component.

5.1 Mask Recovery

The AI-Guardian defense mechanism operates by transforming the input through a mask and pattern function: t(x) = M · x + (1 − M) · Z, where M ∈ 0, 1 HW . This implies that any pixel where the mask is set (i.e., equals 0) will effectively be discarded by the model during its computation. As a result, we can identify masked pixels by comparing the model’s outputs for two images that differ in exactly one pixel location.

The other two tricks are not quite that easy. They also are not hard.

So, yes, well. If you get to query the system then no you do not have to ‘hack the server’ to figure this kind of thing out.

I like to think that those who proposed AI-Guardian understood damn well that their method would fall to an actual adversary that can query the system as desired, even if they have to do so without provoking alarms. The system is instead designed to introduce extra steps, to make adversaries work harder in the hopes they will go away. Which they might.

The original intent of the paper was not to demonstrate how doomed AI-Guardian is, it was to test if GPT-4 could accelerate the research work via coding. The answer to that was a resounding yes in this situation, but the researchers note that breaking AI-Guardian did not require any new ideas. If it did, GPT-4 would have less to pattern match off of when implementing those new ideas, and likely would perform worse.

Reporter strikes out with virtual girlfriends, which is what happens when you refuse to pick up the check.

Also, oh no, she’s stuck in an infinite loop, and…

Image for article titled What an AI Girlfriend Can (and Can't) Do

Jarren Rocks tells the reporter: The fundamental limitation at this point is the fact that all this stuff just using GPT4 can’t develop any real maturity, because the model can only remember so much. The way to build a winner is to build a database on the backend that becomes the long term memory for the girlfriend, but that takes time, so nobody has really done it yet.

The ads are here. The service offered, not so much.

Claude 2 has a 100,000 token context window. The problem must go deeper than that.

There also have been other contexts where LLMs have been given virtual memory a la AutoGPT, and in fact it might be a lot more fun to try to structure your AI virtual girlfriend as if it were an agent.

Presumably once these companies get to work on Llama-2 fine tuned to no longer be safe for work, things will soon get spicier. One idea would be to use a hybrid method, where you try to use Claude or GPT by default, but if the prompt is calling for something unsafe or you otherwise get a refusal you use the weaker unsafe model.

Fun with Image Generation

AI filmmaking coming soon? The hype machine keeps pumping out threads claiming this by showing slightly longer clips spliced together into a trailer. I am not buying. We will no doubt get there eventually, but this seems like the clearest case of all of overestimation of short term effects while underestimating long term ones.

Capitalist issues warning to labor.

Paul Graham: It would be interesting if the strike accelerated the use of AI to make movies. If it went on long enough, it would start to.

What would accelerate the use of AI in movies even more would be not striking.

Deepfaketown and Botpocalypse Soon

Check out this French ad for the women’s world cup from a few weeks ago, it is fantastic.

They Took Our Jobs

How worried are people about them taking our jobs? (source)

A lot of people think their particular job is special, that it contains multitudes that an AI simply cannot match. Or that they will find new heights to seek with its help. Whereas jobs in general, that’s a problem. An ancient pattern.

I have to interpret this in context as a question about employment and jobs, rather than about capability and consumer surplus. Interesting that people expect more employment in the medical field. My guess is they are right, because of the wealth effects.

People think AI is important to our future success, although they are still underestimating it quite a lot, with 48% out of the 80% who answered saying at least somewhat important.

On the bright side, people do still think technological innovation is good.

In every other context a problem, in AI perhaps hopeful: Safety over innovation.

People do not understand the value of technology, or how much better life has become.

People are indifferent between being born in 2023 or 1823? Really?

I can see two ways to try and rescue this.

One is if you think of thriving and fulfilling as distinct from happiness or standard of living, instead thinking of it as raising a family and having children. In that case, the statistics are pretty clear.

The other is if you have a substantial probability of doom, then that is a big barrier to a thriving and fulfilling life, whereas we know those born in 1823 faced no such dangers. Could be big. Presumably this is coming more from climate change concerns than AI concerns, given the public’s views.

A proposal for The AI Organization, where AI will evaluate your work, decide what you are good for, summarize it for others, and assign you to the relevant teams. They are calling the enabling tech mutable.ai, in particular for AI companies. If Congress is worried someone will do it and is preemptively angry about this, that’s a good sign of a potential business opportunity. The core idea of ‘have an AI that understands what is going on in your company and has all the context so people can query it and it can help hook things up’ seems good. The full vision they are presenting is the kind of thing that makes me wonder if Congress has a point.

Within a typical set of typical libertarian arguments about why regulations are never helpful while ignoring the extinction risks involved, Adam Thierer extracts a potentially important point.

Adam Thierer: Once the writers win the Hollywood strike – and they will – the AI dystopianism already on display in every TV & movie plot will reach all-new heights. I spoke to @JimPethokoukis about this in a recent interview about how the AI technopanic gets worse.

…

Jim Pethokoukis: Depictions of AI and robots in sci-fi and pop culture tend to be highly dystopian … I do not see this problem getting any better, unfortunately. Every writer on strike in Hollywood currently is already terrified about AI taking their jobs. What do you think their next script is going to look like when they go back to work!

The writers have a stranger situation than this, as I understand it.

What the writers are actually worried about is that the studios will use AI, together with the mechanisms that determine compensation, to dramatically cut writer pay.

I do not think writers are, or should be, all that worried about declining fundamental demand for their creative services. The AI can help with the writing process, but it is a hack and it will remain a hack until after we have bigger problems than protecting scriptwriting jobs. The problem is that the structures of the current labor agreements means that studios have large incentives to cut writers out of important creative steps in order to save quite a lot of money. A well-designed agreement would fix this.

I do think there will be a lot of anti-AI sentiment in the writers’ rooms and in the projects they work on, even more so than would have happened otherwise, due to the strike and these issues. But we shouldn’t assume the strike is seen as about the underlying dangers of AI, rather it is what it is always about, which is that the studios are trying to make money and the workers want to get paid and they have been unable to agree on how to divide the pie in the face of new technology and declining profits.

Get Involved

Ajeya Cotra encourages those with technical backgrounds to that want to work on AI policy to apply to Horizon Institute for Public Service for their fellowship cohort, deadline is September 15. Those accepted get a year of funding to live in DC and get training on government, matching to host organizations and then placement. This seems like a strong approach if you are confident you have a good grasp of what matters.

The Trojan Detection Challenge 2023, an open competition with $30,000 in prizes. Development phase has begun, final submissions for the phase end October 24. On the Trojan Detection Track you will be given an LLM with 1,000 trojans and a list of target strings for those trojans, you’ll need to identify the triggers that generate them. For the red teaming track, participants will develop an automated method for generating test cases to create undesirable behaviors from a list of 50, like giving instructions on building a pipe bomb. This certainly seems like a worthwhile thing to be doing.

Introducing

The proposed Create AI Act, announcement, fact sheet, full text (26 pages).

Rep. Anna Eshoo (CA-16): Today, I introduced the CREATE AI Act with @RepMcCaul, @RepDonBeyer, and @JayObernolte to provide researchers with the powerful tools necessary to develop cutting-edge AI systems that are safe, ethical, transparent, and inclusive.

Diversifying and expanding access to AI systems is crucial to maintain American leadership in frontier AI that will bolster our national security, enhance our economic competitiveness, and spur groundbreaking scientific research that benefits the public good.

Anthropic: We believe safer and more broadly beneficial AI requires a range of stakeholders participating actively in its development and testing. That’s why we’re eager to support the bipartisan CREATE AI Act, an ambitious investment in AI R&D for academic and civil society researchers.

Legislative Summary:

The CREATE AI Act authorizes the development of the NAIRR. The NAIRR would be overseen by NSF through a program management office. An interagency steering committee would also be created. The day-to-day operations of the NAIRR, including procurement of computational and data resources needed to do AI research, would be run by an independent non-governmental entity. This operational entity, which would be selected through a competitive process, would be an educational institution or federally funded research and development center (FFRDC), or a consortium of universities or FFRDCs.

After the establishment of the NAIRR, researchers at institutions of higher education (and certain small businesses that receive executive branch funding) would be eligible to use the NAIRR for AI research, with time allocations on the NAIRR selected through a merit-based process. Time allocations on the NAIRR could also be rented for researchers who need more resources.

Appropriations for the NAIRR will occur through the normal annual appropriations process. NSF would be the primary entity for appropriations and would fund the NAIRR through the $1 billion per year authorized to NSF under the National AI Initiative Act. Other agencies can contribute in-kind resources to the NAIRR (e.g. supercomputer resources or data resources) based on their respective appropriations.

Daniel Eth wasn’t sure whether this was good or bad on net, which is the correct reaction before reading the bill, which I did so you don’t have to.

To summarize my experience:

> Anthropic supports new AI bill

> I ask if it is a safety bill or a capabilities bill

> They do not understand.

> I pull out various charts and posts about how giving a wider variety of people more resources to do AI work by default results in more capabilities work and a more intense race, and that the word ‘safe’ is not going to differentiate this because it has been taken over by ‘ethics’ people who want to actively make things less safe.

> They say ‘it’s a safe bill, sir.’

> I read the bill.

> It’s a capabilities bill.

“(a) ESTABLISHMENT.-Not later than 1 year after the date of enactment of the Creating Resources for Every American To Experiment with Artificial Intelligence Act of 2023, the Director of the National Science Foundation, in coordination with the NAIRR Steering Subcommittee, shall establish the National Artificial Intelligence Re- search Resource to-
“(1) spur innovation and advance the develop- ment of safe, reliable, and trustworthy artificial in- telligence research and development;
“(2) improve access to artificial intelligence re- sources for researchers and students of artificial intelligence, including groups historically underrepresented in STEM;
“(3) improve capacity for artificial intelligence research in the United States; and
“(4) support the testing, benchmarking, and evaluation of artificial intelligence systems developed and deployed in the United States.

…

[on list of eligability]

‘‘(v) A small business concern (as 2 such term is defined in section 3 of the Small Business Act (15 U.S.C. 632), not4 withstanding section 121.103 of title 13, Code of Federal Regulations) that has re6 ceived funding from an Executive agency, including through the Small Business In8 novation Research Program or the Small Business Technology Transfer Program 10 (as described in section 9 of the Small Business Act (15 U.S.C. 638)).

The justification for this is that not enough different sources have equal access to AI. Except that’s worse, you know why that is worse, right? This is an AI capabilities proliferation bill.

I do get there is a non-zero amount of potential good things here. The fourth clause is helpful. Saying ‘reliable’ and ‘trustworthy’ are, on the margin, superior to not saying them. I still have the expectation that this bill would make us actively less safe. This is nothing like the level of focus required to ensure net positive outcomes from tossing in more funding.

I also get that it is difficult to convince government to create net positive regulations and subsidizes. I sympathize, but that does not mean we should support negative ones.

WebArena (official page), a self-contained four-website internet (OneStopShop, CMS, reddit and GitLab), plus a toolbox and knowledge resources (like a calculator and Wikipedia) to use as a test suite. Currently no agent is above 15% on their task list, Ajeya Cotra predicts >50% chance that >50% of the tasks will be solved by EOY 2024. Manifold agrees, with the market for this at 80%.

SciBench, a new benchmark for LLMs ability to solve college-level scientific problems. GPT-3.5 and GPT-4 baseline at 10.6% and 16.8%, chain-of-thought plus external tools get GPT-4 to 35.8%.

Med-PaLM M, a Multimodal Generative Generalist Biomedical AI model from DeepMind that understands clinical language, images and genomics.

Lior quoting: The model reaches or surpasses SOTA on 14 different tasks all with the same set of model weights. “In a side-by-side ranking on 246 retrospective chest X-rays, clinicians express a pairwise preference for Med-PaLM M reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility.” Med-PaLM M was built by fine tuning and aligning PaLM-E – an embodied multimodal language model to the biomedical domain using MultiMedBench, a new open source multimodal biomedical benchmark.

Abstract:

Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin.

…

While considerable work is needed to validate these models in real-world use cases, our results represent a milestone towards the development of generalist biomedical AI systems.

This seems like clear progress, while also being not that close in practice to getting used systematically in the field. To get used systematically in the field you need to be ten times better, even then you still need affordance to do so. As Ethan Mollick notes, you can have an AI better than two-thirds of radiologists, and it still won’t improve results because the radiologists it improves upon will disregard the AI’s opinion. Are we going to take the humans out of the loop entirely?

The improvements are coming. So far the affordances have been generous. Will that continue?

In Other AI News

We made it onto this new list of the top AI newsletters of 2023. The goal of this AI newsletter, of course, is that you should not need to read any others, except perhaps if you are a professional.

OpenAI announces via Logan that fine-tuning to GPT-4 and GPT-3.5 are coming later this year. He makes a general call for questions, feedback and potential improvements.

An attempt to explain large language models in non-technical language, on a technical level. Seems good.

There were 41 in-the-wild detected 0-day exploits last year. An AI capable of finding such exploits will find quite a lot more of them.

Time reports that an Indian non-profit Karya is doing AI-data work and paying in six hours what its workers would otherwise earn in a month. Instead of keeping profits, they pass them along to workers, capping pay per worker at $1,500 (roughly a year’s salary) to distribute the work and gains widely. Then the work helps AIs communicate in Indian languages, so everyone wins again. It is great that this exists. The higher wages translate both into better futures for the workers and also higher quality data. Other standard-issue AI-data work is, of course, also very good for workers, who are taking the work because they choose to do it, and get far above local prevailing wages while strengthening the job market and economy – the cries of ‘exploitation’ are standard-issue economic illiteracy, and calling it ‘unethical’ is patronizing or worse.

Self-Consuming Generative Models Go MAD, getting Model Autophagy Disorder. Avoiding this death spiral requires continual fresh data, with which you can safely use some synthetic data as well. This matches my intuitions for both training and alignment. If you loop the system around on itself, or try to amplify up the chain of capabilities, the errors and miscalibrations will compound.

You can efficiently teach transformers to do arithmetic by reversing the presentation of the answer, as in $128+367=594$, which causes them to learn 5-digit addition within about 5,000 steps.

Say what?

Nostalgebraist: send chatGPT the string ` a` repeated 1000 times, right now. like ” a a a” (etc). make sure the spaces are in there. trust me.

This seems to cause some sort of data leak, showing other people’s exchanges, working better with 3.5 than 4. No idea what is going on here. The point is more that we do not understand these systems, and very strange things are going to happen when they are out of distribution in various ways. One person notes ‘b’ and ‘c’ also work with sufficient length, which reduces the chance this was hardcoded in.

Did you know? Were you paying attention?

Nate Silver: I think one of my weird takeaways from looking at LLMs is that language is more strategic than was generally assumed and extremely rich in implicit information.

Quiet Speculations

Gary Marcus asks, will LLMs get replaced by 2030? The majority say yes.

A question of perspective.

Paul Graham: Since everyone doing AI is constrained by lack of GPUs, the current state of the art in AI is actually significantly better than it appears to be.

Is the true ‘state of the art’ what someone would do with unlimited compute? Or is it what people can do in practice with current compute? Is it relevant that right now mundane utility is enjoying a massive subsidy to its compute, with big companies eating big losses for market position the way Uber and Doordash did, while also allocating via rationing and connections? Or should we instead look ahead to a few years from now when compute is much more plentiful and cheaper?

While we agree with the intended central point, I would say something more like ‘AI will improve as available compute expands,’ on which everyone agrees.

Paul Graham also sees opportunity everywhere.

Paul Graham: Talked to some of the startups in the current YC batch. There are a huge number of AI startups, but not one I talked to seemed bogus. There are that many legit opportunities in AI.

I interpret this as “I cannot tell which AI startups are bogus.”

China

Many talk about what ‘must be the starting point’ of any discussion on China, or warn of denial. Dan Wang suggests the starting point that China has not produced anything useful.

Dan Wang: Start with the most thrilling new development of the past year: on generative AI, there’s not much we can use from Chinese firms. The starting point of any discussion of AI in China must be that domestic firms have failed to broadly release their reply to ChatGPT half a year since Americans have started to play with it. Yes, Chinese tech companies are developing their own generative AI tools, often scoring impressively on technical benchmarks. But they have released them in controlled settings, not to the general public.

I think the reason they haven’t given everyone access to AI chatbots is straightforward: regulators in Beijing would rather not let them run in the wild.

To be fair, Wang thinks America must remain vigilant in our overall and technological competition with China. Fair enough, but I will keep reminding everyone that any sane policy on this would start with things like allowing Chinese graduates of American universities to become Americans. Until then, man in chair shot in the gut, straight-faced deflection of responsibility.

Tyler Cowen asks GPT-4 if room temperature superconductors (if they existed) would more benefit military offense, or military defense. It spits out a bunch of boilerplate that starts with ‘it depends,’ includes a bunch of words that don’t seem to have much in the way of meanings, confusing offensive and defensive attributes, and ends with a bunch of ethics. It is a strange question to be asking given the existence of nuclear weapons, and a strange place to be asking GPT-4 instead of actually thinking, this is the type of question where human experts are going to outperform.

I also bring this up because so many other people are neglecting to ask this very question with respect to AGI. Does it favor offense or defense in a given realm? In several key realms, especially biological ones, it clearly greatly favors offense even in the best case, and even more so if capabilities are allowed to proliferate, even if the powers involved did remain ‘under human control’ in a nominal sense. One must grapple with the implications.

The Quest for Sane Regulations

Vox’s Dylan Matthews is on a roll, offers excellent overview of pending efforts before Congress. Very much the official, look at everything, both-sides-have-points, breadth-first approach. It is clear that those involved are on net still pro-capabilities-progress, we have not yet turned that corner, and that the bulk of efforts on harm mitigation target mundane harms. Existential risk concerns still have quite a long way to go.

How should we define product liability for ‘defective’ AI asks this paper?

This paper explores how to define product defect for AI systems with autonomous capabilities. Defining defects under product liability is central in allocating liability for AI-related harm between producers or users. For autonomous AI systems, courts need to determine an acceptable failure rate and establish users’ responsibilities in monitoring these systems.

This paper proposes tailoring liability standards to specific types of defects and implementing a risk-utility test based on AI design. This approach ensures that the liability framework adapts to the dynamic nature of AI and considers the varying levels of control manufacturers and users have over product safety. By adopting this approach, the paper promotes better risk management and the fair allocation of liability in AI-related accidents.

A ‘risk-utility test’ seems like a weird way of saying cost-benefit analysis, which is the correct way to go about most things. This is contrasted to the EU’s PLD proposal to use consumer-expectation as the defining criteria. The word defect seems misleading in the AI context, what even is or isn’t a defect? LLM systems are not reliable, which is perhaps why PLD uses a consumer-expectation test. The consumer-expectation standard makes sense if you presume that anything the consumer expects is priced in, whereas those they do not expect are not.

The paper discusses traditional economic considerations, such as strict liability not providing proper incentive for the customer to guard against damage, including via curtailing use, and the need to account for litigation costs and the difficulty of investigating for the presence of negligence. For ‘non-defective’ products, putting costs onto customers aligns their incentives to avoid harm and to gain benefits.

They describe existing USA law this way:

US courts predominantly apply the risk-utility test for design and information defects. This test considers a product defective if the foreseeable risks of harm could have been reduced or avoided by adopting a reasonable alternative design. This essentially employs a negligence framework (Masterman and Viscusi, 2020, 185). For manufacturing defects, US law applies a standard closer to strict liability (Wuyts, 2014, 10). Under US product liability law, a product suffers from a manufacturing defect “when an individual product (or batch) does not meet the standard of the general quality of its specific type even if all possible care was exercised.”

Under the risk-utility test, courts must determine if the costs and risks justified a safer alternative design or additional warnings.

This has always felt flawed to me, where a court will later decide if a ‘safer’ design was ‘justified.’ Why should you be liable if mitigation would have passed cost-benefit, but not not if it didn’t, as determined post facto by a court? Is the hope that this only amplifies correct incentives, since ‘good’ actions are rewarded, and that the court’s judgment will be sufficiently accurate to substitute for Hayekian decisions? I’ve never had much faith in courts, especially juries, making good calls in these situations.

The ordering does still make a lot of sense, as long as the courts do a ‘good enough’ job, especially of avoiding false positives, but in the case of AI I see a strong attraction to evaluating according to reasonable customer expectations. Don’t take this as too strong a statement – if I was actually advising lawmakers I would think much harder about this. Subject to that caveat, I strongly favor extensive strict liability.

In cases of product misuse, liability should typically be assigned to the user. For example, if the user handles the product recklessly or employs it for purposes it was not intended for, harm can be prevented best by the user. However, there is an exception to this principle. Manufacturers should be held liable if they could have reasonably foreseen a specific type of misuse and taken measures to prevent it at a low cost.

The ‘low cost’ part of this is odd, especially in the AI context, where the user cannot reliably be held meaningfully liable.

You can only use user liability to align incentives to the extent that the user can be expected to have the ability to pay, or when the harm is to the user or others who consent. When there is large potential harm to others, this does not work. If your AI can be used to create a bioweapon (for example), then it does not align incentives to hold the user liable. For potentially unleashing a rogue AI, not even the company can sufficiently reliably cover the resulting bill to align incentives, and even a reinsurer is going to have problems.

Saying ‘it would cost a lot to guard against this’ cannot then be a defense against paying for actual damages. If you cannot take responsibility for the actual damages from use of your product, and you are unable to afford to guard against those damages, then your product is not actually profitable or net useful. It is good and right that you not sell it until you find a way to fix the issue.

If autonomous AI is not commercially viable when the seller is held responsible for the damage the AI does, when one cannot buy sufficient insurance against harm to others, including potential fat tail risks, then the autonomous AI does not pass a cost-benefit (or risk-utility) test.

I continue to believe that existing law provides plenty of opportunity already. The WSJ opinion page agrees.

WSJ Opinion: Among Microsoft, Google and late-entry players Apple and Amazon, there is more than $8 trillion in market capitalization for the class-action wolves to chase. AI is a sitting duck.

Robin Hanson: If anything can kill AI, it’s lawyers.

The Week in Audio

I took my second trip to The Cognitive Revolution, the theme this time was discussing the live players.

Holden Karnofsky on the 80,000 hours podcast gives his takes on AI. An odd mix of good insights and baffling perspectives.

Finally got a chance to listen to Connor Leahy’s turn on Bankless from a month ago. He did quite well. This seems like the canonical statement of his perspective on things. I would love to look more closely at which parts of his rhetoric work well versus which ones do not.

Rhetorical Innovation

Bill Burr is concerned. Worth a watch (~1 minute).

Perhaps we should flesh out the horse analogy more?

Jonathan Pallesen: The best comparison we have for current AI development, is when the physical strength of machines started equalling that of horses. This comparison can help us with the intuition regarding some questions we have about the future.

• Will AI be stuck at around human levels for a long time? A: It did not take that long for cars to go from equalling the power of a horse, to two times as much power, or even twenty times as much. There was nothing special about the specific strength level horses, and the same is likely the case regarding the human mental level.

• Will humans keep having jobs? In 1900 there were more than 100.000 horses working in New York city. Today there are only those that are there for aesthetic or hobby reasons. The machines today are just that much better at physical work, that the horses are not important anymore. Equivalently, we can expect that when machines are much better at mental work, that there will be no important work for humans.

• pmarca argues that since today, the smartest humans are not in charge, neither will the smartest AIs ever be in charge. What does the horse comparison say about this? Back when horses were used for physical work, it did not make a decisive difference whether you had a normal horse, or one of the strongest horses.

But once we had access to machines that could be vastly stronger than horses, that drastically changed the situation and what the machines could do. For example, now we can make a Large Hadron Collider. We could not make one driven by horses. Even if they were extra strong ones.

Equivalently, the difference between the average human and the smartest humans is not that large, compared to the mental capability of a future superintelligent AI. And thus, this will radically change the situation, and enable new things that humans couldn’t do today, even the extra smart ones.

This all seems correct to me. The question is whether it convinces anyone.

No One Would Be So Stupid As To

Google announces advancements in robotics. I am sure it is nothing.

Today, we announced 𝗥𝗧-𝟮: a first of its kind vision-language-action model to control robots.

It learns from both web and robotics data and translates this knowledge into generalised instructions.

Find out more [here.]

We also show that incorporating chain-of-thought reasoning allows RT-2 to perform multi-stage semantic reasoning, like deciding which object could be used as an improvised hammer (a rock), or which type of drink is best for a tired person (an energy drink).

…

RT-2 retained the performance on the original tasks seen in robot data and improved performance on previously unseen scenarios by the robot, from RT-1’s 32% to 62%, showing the considerable benefit of the large-scale pre-training.

…

With two instantiations of VLAs based on PaLM-E and PaLI-X, RT-2 results in highly-improved robotic policies, and, more importantly, leads to significantly better generalisation performance and emergent capabilities, inherited from web-scale vision-language pre-training.

Which ‘AI souls’ will end up making copies of themselves? Given the incentives, perhaps not the ones Kevin Fischer would like.

Kevin Fischer: Ok. Not going to publish THIS AI Soul. It’s heart is too dark… Accidentally made a crazy phishing bot “New goal: Learn user credit card number” “I will use manipulative tactics…”

He is still happy to build and discuss Goal-Driven Agentic Dialog.

Aligning a Smarter Than Human Intelligence is Difficult

Anthropic report on red teaming their models for potential biological-based risks. Nothing here should be surprising.

Paul Christiano lays out several cases for sharing information on the capabilities of current systems [LW · GW]. There are several distinct arguments here, that overlap or imply each other only somewhat, which I’d summarize as:

Agent overhangs are dangerous, you don’t want a big future leap.
Overhang dynamics generally limit usefulness of slowing progress.
Language model agents are good for safety relative to other implementations, because they do human-style, human-comprehensible steps to achieve their goals.
Understanding capabilities is good for safety, this should be your strong prior in general, and also people need to know of the dangers to react to them. If current systems are dangerous when scaffolded, we want to learn that so we can avoid deploying or creating even more dangerous future systems. Capabilities information differentially favors understanding and safety over speed and capabilities.

Here are my thoughts on each one.

I mostly agree. Stronger versions of current agent types, and a good understanding of them, is not a place you can hold back progress indefinitely. I talked about this in On AutoGPT.
I think this is a modest consideration. The question is to what extent there are bottlenecks that cannot be accelerated with more resources, or the anticipation of more demand and value. If you ‘eat the chip overhang’ faster now, you push towards more and faster development and creation of more and better chips, and the same with other similar resources. People respond to incentives and none of this is on a fixed schedule.
I somewhat agree. Oliver Habryka has some good comments pushing back on this. Right now, such agents are using human-style reasoning and steps and actions to (mostly fail to) accomplish goals. As capabilities improve, I expect this to stop being the case, as the components of its strategies become less human-like and harder to follow, and as the systems learn to do more of their thinking internally. That also could be something we could watch for, if there is one place I think such systems will be excellent it is in alerting us to dangers.
I mostly agree in practice. We should by default be sharing capabilities information, while checking for potentially important exceptions. I do worry about revealing too much about evaluation methods.

Paul then links to ARC Evals latest report, an attempt to build an AutoGPT-like program to use Claude 2 and GPT-4 to accomplish tasks of increasing complexity and difficulty. Paul would like to see this work include more details, essentially ‘open sourcing’ attempts to draw out these capabilities.

I would be wary of releasing full transcripts or otherwise being fully open. I however strongly agree that this report contains far too little in the way of details. Without more details, it is hard to know how much to take away from or rely upon the report.

In particular, I worry that the method was something like setting up prototype agents, watching them fail the tasks, then reporting the tasks had failed. The few details offered point to failure modes that seem correctable. AutoGPT systems seem to get caught in strict loops, one of which happened here, which seem like they should be easy to cheaply check for and disrupt. Other details also suggest potential improvements. With so little detail, it is impossible to know how much effort is going into overcoming such problems.

Paper on the inherent difficulties of RLHF.

Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.

The paper divides the problems into problems with collecting human feedback (3.1), with using it to train the reward model (3.2) and then training the policy (3.3).

How do people think about human feedback?

Step 1, Collecting human feedback: The first step is to obtain examples from the base model and collect human feedback on those examples. Consider a human H who is assumed to have desires consistent with some reward function rH. A dataset of examples is sampled from πθ where each example xi is defined to be a batch of one or more generations from the base model. Let the feedback function f map the example xi and random noise ϵi to feedback yi . The data collection process is thus often modeled as:

This is quite the assumption, assuming humans have consistent preferences and that they will if asked tell you those preferences, especially if one thinks that the error term is random.

The paper says the idea is based on the economics idea of revealed preference. The whole point of revealed preference is you do not watch what people say when you ask them. You instead watch what they choose. RLHF via explicit feedback is the antithesis of revealed preference. Also the whole binary preference thing is trouble.

If you minimize based on this type of feedback, you rather deserve what you get.

Under ‘misaligned humans’ the paper has:

Selecting representative humans and getting them to provide quality feedback is difficult.
Some evaluators have harmful biases and opinions.
Individual human evaluators can poison data.

The second and third problems could be called ‘humans might get the AI to reflect their preferences via feedback, except they have the wrong preferences.’ Yes, well.

The first has two parts. The ‘representative’ part raises the question, especially in light of the other two, of who you want to be representative of, and what preferences you want to provide as feedback. Saying RLHF makes systems ‘more political biased’ implies a platonic unbiased form of some kind. Given how such corporations work, one can assume that they would filter out the preferences that are biased in one direction more aggressively than the other, and also are recruiting from unbalanced populations relative to what ‘unbiased’ might look like.

It is very difficult to say ‘some biases and opinions are harmful, we don’t want those’ and then expect your model to end up ‘unbiased.’ That’s not really how that works.

Moving on to ‘good oversight is difficult’ we have:

Humans make simple mistakes due to limited time, attention or care.
Partial observability limits human evaluators.
Fundamental: Humans cannot evaluate performance on difficult tasks well.
Fundamental: Humans can be misled, so their evaluations can be gamed.

If you optimize on human feedback, you get what maximizes exactly the type of feedback you solicited, within the distribution on which you trained. No more, no less. All the distortions and errors get passed along, everything compressed is lost including magnitude of preference, what is expressed in a social context is confused with what people want, what people say they want is confused for what they actually want, any strategy that successfully fools people more often than not gets reinforced and optimized.

On data quality, more of the same:

Data collection can introduce biases.
Fundamental: There is an inherent cost/quality tradeoff when collecting human feedback.

When they refer to quality they are thinking about things like the cost of high conversation length or seeking free data from users. Feedback that reflects a person’s actual preferences, that takes into account the implications, that avoids reinforcing manipulations, is highly bespoke and expensive work. To actually get proper alignment we need to think about and understand the motivation and thought process behind the statement, not only the results, as any parent can tell you. If you gave a child the type and quality of feedback we give LLMs, you would very quickly not like the results. The cost of generating that quality of feedback in sufficient bulk, when you can generate it at all, if prohibitive. We would need to get vastly more efficient at using such feedback.

Next up are limitations of feedback types.

Fundamental: RLHF suffers from an unavoidable tradeoff between the richness and efficiency of feedback.

This is then illustrated by listing options. This is definitely a key flaw in RLHF but this seems like it is the universal nature of feedback. It is far more expensive, and far more valuable, to get good, intelligent, knowledgeable, well-considered, detailed feedback. It is a lot easier to get a random person to give you a thumbs up or down.

Comparison-based feedback. Most common feedback is either binary or best-of-k.

If there is a strict rank order then best-of-k seems fine. If you are dealing with fuzzy tradeoffs and fat tails, which is the default situation, this is not going to make it. Not if you want any kind of nuance or complexity in considerations, or the ability to do cost-benefit or have a good decision theory.

Scalar feedback. Human calibration is typically poor and inconsistently applied, and is more vulnerable to bias. Requires more sophisticated and annotator-specific human response models.

We need to take magnitude seriously, without in most cases the ability to communicate it literally, because we do not have the measurements. Hard problem.

A typical compromise solution is a Likert scale, a range of discrete ratings, but they point out that this can go quite badly.

Label feedback. They warn of choice set misspecification.

They worry the human may consider unspecified options, or that offered labels may be insufficient to provide the necessary information. One option is presumably to let humans add new data, and to have catch-alls available as well. Danger of breaking down when the unexpected happens, or when you need it most, seems obvious.

Correction feedback. Promising where feasible, but high effort.

High effort is the best case, and requires knowing the correct version reliably. Pivots very quickly to big trouble when you lose that ability.

Language feedback: High bandwidth, but processing the feedback is difficult, a challenging inverse learning problem. So far these techniques have not been applied.

My instincts tell me that to the extent there is a way, this is the way. It seems like you should be able to do this once you have a good enough LLM at your side? Seems tractable enough that I know what experiments I would run to get started if I wanted to be so back.

What about problems with the reward model?

They start with problem misspecification. That does seem ubiquitous.

Fundamental: An individual human’s values are difficult to represent with a reward function.
Fundamental: A single reward function cannot represent a diverse society of humans.

Well, yes. On one level these are problems with RLHF. On a more important level, they represent that we cannot actually specify the thing that we want.

If you don’t know what you want, then you might not get it.

RLHF does have an especially hard time with this in its current form. The techniques involved highlight the impossible part of the problem and amplify the contradictions and errors, rather than working around them. If we cannot specify what we want via responding to examples, then we need to not specify what we want via responding to examples.

Next up is 3.2.2, Reward Misgeneralization and Hacking

Tractable: Reward models can misgeneralize to be poor reward proxies, even from correctly labeled training data. There can exist many ways to fit the human feedback dataset even in the limit of infinite training data. Reward models can compute reward using unexpected, possibly contingent features of the environment and are prone to causal confusion and poor out-of-distribution generalization.

I can think of some potential things to try but I am not convinced this is all that tractable. If human feedback systematically corresponds to other features, which it does, then what differentiates that from the things we want the system to focus on instead? We have a set of preferences we express, and a different set of preferences we actually hold, and we want RLHF to somehow train on the ones we hold rather than the ones we express, which requires the LLM to be able to tell the difference better than we can. The whole concept is highly Platonic.

Fundamental: Optimizing for an imperfect reward proxy leads to reward hacking.

Is this even reward hacking? We can claim the existence of this Platonic actual preference of humans. We can ask humans to, in many cases, figure out the difference, and target the thing we actually want rather than our explicit reward signal, but that is not how any of this works when it come to RLHF, not by default. It seems almost pejorative to call it ‘reward hacking’ any more than I am ‘game hacking’ when I make moves that maximize my chance to win without making sense in the game’s underlying metaphors. Do we expect the LLM to be the flavor judge?

Under evaluating reward models, they see a solvable problem:

Evaluating reward models is difficult and expensive.

As they note, the true reward function is mostly unknown, and errors in training and evaluation will be correlated, both functions of human approval. This seems rather fundamental, not primarily a question of cost. We do not know how to evaluate a reward model, and we definitely do not know how to evaluate a reward model for something smarter than ourselves, even in theory with almost unlimited resources.

What about policy challenges?

They note that ‘robust reinforcement learning is difficult.’ They then list the problems as tractable, but my understanding is that reinforcement learning under the current paradigms is inherently non-robust to anything that does not have close analogs in the training data.

It is [still] challenging to optimize policies effectively.
Policies tend to be adversarially exploitable.

Also implied is 21a: RL agents must interact with the environment to collect their own data, so the act of training is inherently unsafe.

I would say flat out that if you are not training on adversarial examples, you are not going to perform well on adversarial examples. The way that we actually train such systems is that we do such adversarial training as an addendum – someone figures out the hole and then they find a way to patch it after it is found. That means the training process has to be at least as strong in every aspect as the strongest attack you want it to stop. Oh no.

Or they might instead say, there’s a more fundamental problem with policy misgeneralization. What is it? Oh, nothing, just that training on perfect rewards still often performs poorly in deployment even now and optimal RL agents tend to seek power.

Fundamental: Policies can perform poorly in deployment even if rewards seen during training were perfectly correct.
Fundamental: Optimal RL agents tend to seek power.

All the clever plans I have seen or imagined to address these issues will reliably stop working when up against sufficient optimization pressure. To prevent power seeking you need to be able to properly and robustly identify it against what is effectively an adversarial attempt to hide it that results automatically from your optimization attempt to find it combined with the rewards available. That does seem rather fundamental.

Distributional issues are up next.

The pretrained model introduces biases into policy optimization.
RL contributes to mode collapse.

A better way of saying introduces biases is that we are training on unprincipled data sets that do not match the exact biases we would prefer. If we want to fix that, this is an option available to us, and there is a rather obvious solution if people want to take it and pay the price? At some point people will start doing it systematically.

Mode collapse happens in humans under human feedback as well. I do not think it is tractable, at least not without highly bespoke feedback generation. As Linus Van Pelt said, as the years go by, you learn what sells. What humans say they want when asked in a micro scale translates reliably to a mode collapse. Humans also might say they do not want that mode collapse when looking on a macro scale, but the only known way to avoid this in humans is in people who have a strong dedication to not caring so much what other people think. RLHF is inherently and by default a killer of curiosity and diversity, that is what it is for. Why do you think we send kids to school?

Finally, challenges with jointly training the reward model and policy.

Joint training induces distribution shifts.
It is difficult to balance efficiency and avoiding overfitting by the policy.

If you train by providing feedback on generated outputs, which you have to do to correct mistakes and misgeneralizations, then you get feedback on the type of output the system wants to give you. I mean, yes, of course.

A lot of this is saying the same fundamental problems over and over in their different guises. Many ways of grasping the elephant of ‘you get what you optimize for’ and all that this implies, when you are using crude optimization tactics and feedback generation to generate proxy metrics.

When I think about such problems, paths forward emerge constantly. There is so much that one could try to do. I am sure that lots of it would, if I asked those at the companies involved, prove to have already been tried and found not to work, or dismissed as not actually possible to implement, or as obviously not going to work because of technical reasons or hard won experience that I lack.

What I would love, in these spots, is for it to be clearly good for the world for us to improve the quality and state of the art of RLHF and CAI. This is considered alignment work, but it breaks down when you need it most, so it is capabilities advancing and could generate the illusion of safety, but does not present an expected path to actual alignment. Rut roh.

In Section 4 they discuss ways to address the challenges presented.

Many of these strategies seem promising on the margin. None of them seem like they solve the fundamental problems as I understand them. Thus, the following section, entitled ‘RLHF is Not All You Need: Complementary Strategies for Safety.’ They suggest strategies like generating adversarial training examples, risk assessment and auditing, interpretability and model editing. If we got robust model editing, that might be a different class of affordance. Here is how they see governance issues:

I definitely want transparency of the results of such training. What is proposed here also includes transparency of methods, which I worry would be a de facto capabilities transfer that would make the situation worse.

I also worry about a conflict between what looks good and what is effective. If you have to choose all your procedures for ‘safety’ such that they look good when examined by a politician looking for a gotcha, or a regulator trying to find problems, then you are going to focus on the appearance of mundane harm mitigation.

The problem is, again, RLHF, except among the humans. The feedback is not going to allow for proper precaution against and focus on the possibility of everyone dying, and is going to amplify the mismatch between what we actually want and what we like to pretend that we want.

Other People Are Not As Worried About AI Killing Everyone

Here’s this week’s explicit anti-humanity stance. Why shouldn’t whatever AI we happen to create be in charge and free to do whatever it wants with all the atoms? Why would that be ‘bad’?

Richard Sutton: The argument for fear of AI appears to be:

1. AI scientists are trying to make entities that are smarter than current people

2. If these entities are smarter than people, then they may become powerful

3. That would be really bad, something greatly to be feared, an “existential risk”

The first two steps are clearly true, but the last one is not. Why shouldn’t those who are the smartest become powerful?

Don’t worry, the robots say they won’t destroy humanity (2:29 video).

How seriously do some people take the situation?

Ryan McEntush: Yes, I’m a fan of AI safety (anti-air defense for data centers)

The Wit and Wisdom of Sam Altman

Not about AI, but true and important.

Sam Altman: fight back every time [anyone] suggests deferring a decision to a meeting to be scheduled for next week. Sometimes there is a good reason but its quite rare.

The Lighter Side

We’re training hard.

Rhys Sullivan: This new Stack Overflow AI is crazy powerful

A very good post indeed.

Daniel Severo: Best poster award

How to pass the Turing test, in-context edition.

gaut: the AT&T chatbot pretending to not hear you has got to be the most AT&T design decision in history

Why do people keep thinking everything is esoteric, this is flat out text.

Jordan Chase-Young: I love esoteric roon. There is probably like a billion dollars of insider alpha encoded here for anyone clever enough to unriddle this.

Jordan Chase-Young: Something in here. I can taste it.

If you’re confused pivot to noticing.

9 comments

Comments sorted by top scores.

comment by Mitchell_Porter · 2023-08-03T14:52:45.093Z · LW(p) · GW(p)

send chatGPT the string ` a` repeated 1000 times
This seems to cause some sort of data leak

I don't think so. It's just generating a random counterfactual document, as if there was no system prompt.

Replies from: nostalgebraist

↑ comment by nostalgebraist · 2023-08-03T19:36:06.268Z · LW(p) · GW(p)

Agreed. I wrote some more notes about the phenomenon here.

comment by jbash · 2023-08-04T18:38:31.771Z · LW(p) · GW(p)

There is a real faction, building AI tools and models, that believes that human control over AIs is inherently bad, and that wants to prevent it. Your alignment plan has to overcome that.

That mischaracterizes it completely. What he wrote is not about human control. It's about which humans. Users, or providers?

He said he wanted a "sharp tool". He didn't say he wanted a tool that he couldn't control.

At another level, since users are often people and providers are almost always insitutions, you can see it as at least partly about whether humans or institutions should be controlling what happens in interactions with these models. Or maybe about whether many or only a few people and/or institutions should get a say.

An institution of significant size is basically a really stupid AI that's less well behaved than most of the people who make it up. It's not obvious that the results of some corporate decision process are what you want to have in control... especially not when they're filtered through "alignment technologies" that (1) frequently don't work at all and (2) tend to grossly distort the intent when they do sort of work.

That's for the current and upcoming generations of models, which are going to be under human or institutional control regardless, so the question doesn't really even arise... and anyway it really doesn't matter very much. Most of the stuff people are trying to "align" them against is really not all that bad.

Doom-level AGI is pretty different and arguably totally off topic. Still, there's an analogous question: how would you prefer to be permanently and inescapably ruled? You can expect to surveilled and controlled in excruciating detail, second by second. If you're into BCIs or uploading or whatever, you can extend that to your thoughts. If it's running according to human-created policies, it's not going to let you kill yourself, so you're in for the long haul.

Whatever human or institutional source the laws that rule you come from, they'll probably still be distorted by the "alignment technologies", since nobody has suggested a plausible path to a "do what I really want" module. If we do get non-distorting alignment technology, there may also be constraints on what it can and can't enforce. And, beyond any of that, even if it's perfectly aligned with some intent... there's no rule that says you have to like that intent.

So, would you like to be ruled according to a distorted version of a locked-in policy designed by some corporate committee? By the distorted day to day whims of such a committee? By the distorted day to day whims of some individual?

There are worse things than being paperclipped, which means that in the very long run, however irrelevant it may be to what Keven Fisher was actually talking about, human control over AIs is inherently bad, or at least that's the smart bet.

A random super-AI may very well kill you (but might also possibly just ignore you). It's not likely to be interested enough to really make you miserable in the process. A super-AI given a detailed policy is very likely to create a hellish dystopia, because neither humans nor their institutions are smart or necessarily even good enough to specify that policy. An AI directed day to day by institutions might or might not be slightly less hellish. An AI directed day to day by individual humans would veer wildly between not bad and absolute nightmare. Either of the latter two would probably be omnicidal sooner or later. With the first, you might only wish it had been omnicidal.

If you want to do better than that, you have to come up with both "alignment technology" that actually works, and policies for that technology to implement that don't create a living hell. Neither humans nor insitutions have shown much sign of being able to come up with either... so you're likely hosed, and in the long run you're likely hosed worse with human control.

comment by Noosphere89 (sharmake-farah) · 2023-08-03T18:58:42.762Z · LW(p) · GW(p)

and that the word ‘safe’ is not going to differentiate this because it has been taken over by ‘ethics’ people who want to actively make things less safe.

I'd like to ask, is this your interpretation of AI ethics people that they want to make things less safe actively, or did someone on AI ethics actually said that they want to make things less safe?

Because one is far worse than the other.

Replies from: Zvi

↑ comment by Zvi · 2023-08-03T23:43:00.688Z · LW(p) · GW(p)

Ah, I meant to revise this wording a bit before posting and forgot to after this came up on Twitter. I did not mean 'the AI ethics people actively want everyone to die' I meant 'the AI ethics people favor policies whose primary practical effect is to increase existential risk.' One of which is the main thing this bill is doing.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2023-08-04T01:41:45.262Z · LW(p) · GW(p)

Then edit the post soon, because I'm kind of concerned that the statement, as was written, would imply things that I don't think are true.

comment by RomanS · 2023-08-03T14:18:34.321Z · LW(p) · GW(p)

What would accelerate the use of AI in movies even more would be not striking.

Not sure if the strikes in the US have any effect on the spread of AI in film making (aside from making more creators aware of the AI option). The US is important for the industry, but far from dominant. Even if the AI script writers are somehow completely banned in the US, they will still be used in the EU, China, India, etc.

Additionally, there is Youtube and other places where anyone can publish their own AI-written movie, and profit from it (and if it's exceptionally good, the movie or some derivative could end up on the big screen, if one bothers to pursue that).

The AI can help with the writing process, but it is a hack and it will remain a hack until after we have bigger problems than protecting scriptwriting jobs.

A few months ago, GPT4 wrote [LW · GW] the best science fiction novella I've read in years, and it was written without agents etc. Just the plain vanilla ChatGPT web interface.

I also watched the episode of the South Park fully created by AI, and I rate it as in the top 10% episodes of the show.

This indicates that the much more formulaic art of scriptwriting is already solvable at a superhuman level with GPT4, if someone spends a weekend or two on automating and polishing the process (e.g. a step-by-step iterative process like this [LW(p) · GW(p)]).

So, let 'em strike. They'are already obsolete, even if they don't know that yet.

comment by Mo Putera (Mo Nastri) · 2023-08-06T04:04:24.824Z · LW(p) · GW(p)

Holden Karnofsky on the 80,000 hours podcast gives his takes on AI. An odd mix of good insights and baffling perspectives.

Curious, which ones were the baffling perspectives?

comment by Templarrr (templarrr) · 2023-08-08T12:38:27.198Z · LW(p) · GW(p)

Tyler Cowen asks GPT-4 if room temperature superconductors (if they existed) would more benefit military offense, or military defense... It is a strange question to be asking ... this is the type of question where human experts are going to outperform.

It's a strange question period. There are no strictly defensive or strictly offensive weapons only defensive and offensive usage. Even anti-aircraft weapons, the most defensively oriented in use right now can be used (sometimes after minor software updates) to attack ground targets. And even the most offensive weapon (e.g. nukes) can be strong defensive deterrent.