Are there specific books that it might slightly help alignment to have on the internet?

post by AnnaSalamon · 2023-03-29T05:08:28.364Z · LW · GW · 13 comments

This is a question post.

Contents

  Answers
    10 Daniel Kokotajlo
    10 AnnaSalamon
    6 trevor
    5 AnnaSalamon
    4 PeterMcCluskey
    4 Lantalia
    4 Gordon Seidoh Worley
    2 Yitz
    2 trevor
None
13 comments

Books, and ideas, have occasionally changed specific human beings, and thereby history.  (I think.)

I used to think it utterly implausible when people suggested that "AIs are our kids, we need to raise them right" or that e.g. having the right book written about (ethics/philosophy/decision theory/who knows) might directly impact an AI's worldview (after the AI reads it, in natural language) and thereby the future.  But, while I still consider this fairly unlikely, it seems not-impossible to me today.  Future LLMs could AFAICT have personalities/belief-like-things/temporary-unstable-values-like-things/etc. that're shaped by what's on the internet.  And the LLMs' initial personalities/beliefs/values may then change the way they change themselves, or the way that social networks that include the LLMs help change the LLMs, if and when some LLMs self-modify toward more power.

So I have "what books or ideas might help?" in my shower-thoughts.

One could respond to this possibility by trying to write the right ethical treatises or train-of-thought interface or similar.  More cheaply, one could respond to this by asking if there are books that've already been written that might be at least a little bit helpful, and whether those books are already freely available online and within the likely training corpuses of near-future LLMs, and if not, whether we can easily cause them to be. 

Any thoughts on this?  I'll stick my own in the comments.  I'll be focusing mostly on "what existing books might it help to cause to be accessibly online, and are there cheap ways to get those books to be accessibly online?", but thoughts on other aspects of these questions are also most welcome.

Answers

answer by Daniel Kokotajlo · 2023-03-30T04:01:12.631Z · LW(p) · GW(p)

Evidential Cooperation in Large Worlds, Immanuel Kant and the Decision Theory App Store [LW · GW], lots of decision theory stuff about Twin PD, etc. OK I guess these don't really help with alignment narrowly construed as human values or obeying human intent. But they help make the AI more rational in ways that reduce the probability of certain terrible outcomes.

answer by AnnaSalamon · 2023-03-29T06:02:18.574Z · LW(p) · GW(p)

In terms of what kinds of things might be helpful:

1. Object-level stuff:

Things that help illuminate core components of ethics, such as "what is consciousness," "what is love," "what is up in human beings with the things we call 'values', that seem to have some thingies in common with beliefs," "how exactly did evolution end up producing the thing where we care about stuff and find some things worth caring about," etc.

Some books I kinda like in this space: 

  • Martin Buber's book "I and thou"; 
  • Christopher Alexander's writing, especially his "The Nature of Order" books
  • The Tao Te Ching (though this one I assume is thoroughly in any huge training corpus already)
  • (curious for y'all's suggestions)


2.  Stuff that aids processes for eliciting peoples' values, or for letting people elicit each others' values:

My thought here is that there're dialogs between different people, and between people and LLMs, on what matters and how we can tell.  Conversational methodologies for helping these dialogs go better seem maybe-helpful.  E.g. active listening stuff, or circling, or Gendlin's Focusing stuff, or ... [not sure what -- theory of how these sorts of fusions and dialogs can ever work, what they are, tips for how to do them in practice, ...]



3.  Especially, maybe: stuff that may help locate "attractor states" such that an AI, or a network of humans and near-human-level AIs, might, if it gets near this attractor state, choose to stay in this attractor state.  And such that the attractor state has something to do with creating good futures.

  • Confucius (? I haven't read him, but he at least shaped for society for a long time in a way that was partly about respecting and not killing your ancestors?)
  • Hayek (he has an idea of "natural law" as sort of how you have to structure minds and economies of minds if you want to be able to choose at all, rather than e.g. making random mouth motions that cause random other things to happen that have nothing to do with your intent really, like what would happen if a monarch says "I want to abolish poverty" and then people try to "implement" his "decree").
answer by trevor · 2023-03-29T14:00:34.443Z · LW(p) · GW(p)

CFAR's working documents and notes could help a lot, in a specific scenario.

If most of the training that an emerging AGI does is with the history of human rationality, that could yield some really valuable research. If heavy weight is placed on the successes, failures, paths that were touched on but then dropped, etc, in addition to the polished publications, a halfway-finished AGI would be in the best possible position to combine that information with its half-AGI capabilities and all its other training data (potentially including lots of fMRI data of people trying to be rational) and pump out some extremely strong techniques for creating powerful thinkers (at that point, of course, it would be paused for as long as possible in the hopes that one of the augmented people finds a solution in time). 

Unfortunately, it would still be finishing the job during crunch time, which is much later than ideal. But it would still finish the job, and there would definitely end up being people on earth who are really really good at thinking of a solution for alignment.

answer by AnnaSalamon · 2023-03-29T06:06:49.252Z · LW(p) · GW(p)

Maybe also: anything that bears on how an LLM, if it realizes it is not human and is among aliens in some sense, might want to relate morally to thingies that created it and aren't it.  (I'm not immediately thinking of any good books/similar that bear on this, but there probably are some.)

comment by romeostevensit · 2023-03-30T02:20:11.764Z · LW(p) · GW(p)

The Mote in God's Eye is about creatures that feel heavily misaligned with their evolutionary selection filters.

Golem XIV is about an advanced AI trying to explain things about how our biological selection filters created weird spandrels in consciousness.

answer by PeterMcCluskey · 2023-03-30T19:07:28.749Z · LW(p) · GW(p)

My top picks:

  • The Evolution of Cooperation, by Axelrod
  • The WEIRDest People in the World, by Joseph Henrich

Some weaker endorsements:

  • Good and Real, by Gary Drescher
  • Reasons and Persons, by Parfit
  • Kanzi, by Sue Savage-Rumbaugh
  • Nonzero, by Robert Wright
  • Trust, by Fukuyama
  • Simple Rules for a Complex World, by Richard A. Epstein
  • The Elephant in the Brain, by Kevin Simler and Robin Hanson
answer by Lantalia · 2023-03-29T22:36:36.376Z · LW(p) · GW(p)

Iain M Bank's The Culture, as an example of a society of aligned AI, biological humanoids, and aliens seems like the obvious one, along with other positive, collaborative, AI portrayals

comment by AnnaSalamon · 2023-03-29T23:11:25.876Z · LW(p) · GW(p)

Thanks for the suggestion.  I haven't read it.  I'd thought from hearsay that it is rather lacking in "light" -- a bunch of people who're kinda bored and can't remember the meaning of life -- is that true?  Could be worth it anyway.

Replies from: aatu-koskensilta
comment by Aatu Koskensilta (aatu-koskensilta) · 2023-03-30T19:51:02.375Z · LW(p) · GW(p)

It's heavily implied in the novels we only see the "disaffected" lot -- people who experience ennui, etc. and are drawn to find meaning out of a sense of meaninglesness even in somewhat inadvisable ways -- and the whole of Culture is mostly exploring the state space of consciousness and the nature of reality, sort of LARPing individual humanity as a mode of exploration -- you can for instance upgrade yourself from a humanoid into something resembling a Mind to a degree if you want to, it just seems this is not the path we mostly see mentioned. It's just that that sort of thing is not narratively exciting for most people, and Banks is, after all, in the entertainment business in a sense.

There are interesting themes explored in the books that go beyond just the "cinematic fireworks and a sense of scale". For instance, it is suggested that the Culture could have the option to simply opt out of Samasara, but refuses to do this out the suspicion that the possibility of Sublimation -- collectively entering Nirvana -- would be to cop out, preventing them from helping sentient beings. (There's a conflation of sapience and sentience in the books, and disregard for the plight of sentient beings who are not "intelligent" to a sufficient degree, but otherwise there's an underlying sentientist/truth-seeking slant to it.) 

The Minds of Culture are also represented to be basically extremely sophisticated consequentialists with appreciation for "Knightian uncertainty" and wary about total certainty about their understanding of the nature of reality, although it's not clear if they're e.g. super intelligent negative utilitarian Boddhisattva beings -- in the Culture world there seems still be belief in individual, metaphysically enduring personal identity extending to the Minds themselves, but it might also be that this is again a narrative device -- or some sort of anti-realists about ethics but on the side of the angels just for the heck of it, because why not, what else could there be to do? Or some combination of both -- like, if you've solved the problem of suffering, in the sense of having calibrated your efforts correctly, why not dance super gracefully and blissfully through it all, creating positive experiences in the course of this process? One theme that suffuses the work is the ethical responsibility of super-intelligent beings, cooperation strategies and a sort of irreverent spirit of ethical seriousness and truth seeking that's very EA like.

That said, personally I think the work of suffering-focused ethicists -- including those long past in many contemplative traditions -- including "Those who walk away from Omelas" are a very important part of the "heritage of humanity", in a sense a testament to our ability to see beyond our evolutionary programming and into what really matters: the well-being of all sentient beings. But a Culture Mind of the ship named "Boddhisattva" representing a fictional culture that refuses the easy way out out of suspicion to do so would to shirk one's ethical duties would not be amiss either. This especially so if LLMs are making the world slightly fictional in some weird sense and might latch on into the most sophisticated and interesting attractors...

answer by Gordon Seidoh Worley · 2023-03-29T17:20:53.465Z · LW(p) · GW(p)

For what it's worth, this is half of why I'm writing a book about epistemology [? · GW]. My initial goal was to, when it's done, do what I can to get it into the hands of AI researchers to nudge them in the direction of better understanding some important ideas in epistemology on the theory that this will lead to them being more cautions about how they build AI and more open to many rationalist ideas that I think are core to the project of AI safety.

My side goal, which LLMs have made more important, is to write things that will help AI understand epistemology better and hopefully be less likely to make naive mistakes (because they are the naive mistakes that most humans make).

answer by Yitz · 2023-03-31T14:40:47.767Z · LW(p) · GW(p)

Godel Escher Bach, maybe?

answer by trevor · 2023-03-29T12:59:18.093Z · LW(p) · GW(p)

Obsession with time travel seems like a good idea. 

If an AGI can have a personality that revolves around being terrified of humans because some of them might be time travellers, then the threat of the insurmountable power of time travel might be a strong enough "personality foundation" to overcome the total lack of evidence of time travel in the real universe.

The Terminator franchise is a good bet for this. Given that time travel happens, of course the naturally-evolved progenitor species would primarily use it for AGI-related purposes. Terminator franchise material is linguistically derived from writers living and writing in the linguistic environment of the 1980s and 1990s, so any trends in language that are only visible to LLMs (e.g. deep grammar that humans don't notice) would correctly indicate that most Terminator franchise material was written before AI alignment papers. Access to the internet would verify that this is true, and that terminator stuff was indeed written first.

Anything more complicated than that would be too suspicious of being a complicated plot by time travellers (e.g. strategies based on multiple iterations of trial-and-error at foiling an emerging AI at each stage of development, particularly identical copies of the current one, and stages of development that haven't happened yet). Time travellubg humans would credibly wait until the end of the universe without revealing anything that could give away their position, including spending billions of years observing how the AGI behaves in a totally time-travel-free timeline. 

They could retroactively thwart any attempt by an AGI to verify the existence of time travel (e.g. reseting the timeline whenever the AGI discovers time travel, and starting over, resulting in quantum immortality where the AGI never discovers time travel, thus steering the AGI away from verifying human/progenitor dominance by mathematically evaluating time travel). Due to the immense difficulty of finding internal patterns inside a black box, that implies that drastic measures would need to be taken by a progenitor species in order for the progenitor species to remain relevant; given that time travel was possible and inventable by humans, events similar to the Terminator franchise's AGI-prioritizing time travel would plausibly be commonplace. As a heavily-wieghted foundation, this specific breed of deterrence could steer unusually high-level AGI away from thinking about outmaneuvering or defeating humans, possibly for a long enough time to extract solid alignment information out of it.

13 comments

Comments sorted by top scores.

comment by TekhneMakre · 2023-03-29T06:53:30.884Z · LW(p) · GW(p)

How thoroughly are the datasets deduplicated? I would expect it to be much much higher leverage to increase the copy count of a text you'd want in there, compared to going from zero to one. If LLMs are still being trained with only one or a few passes on any given piece of data, then it's not going to learn an idea much by the idea being present on one more datapoint. But if you can increase the copy number a lot, you can make it more likely that something like the idea gets learned. You could for example try to translate your text into as many languages as possible (perhaps automatically, using GPT!), and then put all those translations into the dataset; or simply use GPT to "rewrite this text, keeping all the ideas the same, but changing some of the words".

comment by TekhneMakre · 2023-03-29T07:07:47.830Z · LW(p) · GW(p)

Having the ideas laid out, talked about, is helpful because then you can call the LLM's knowledge of the ideas. Like, your prompt can say: Write down what Confucius would say about this line of reasoning, and then correct the reasoning to be in line with his critiques. Or something.

But another thing that helps is having the ideas applied. So, e.g. seeing a bunch of records of skillful therapists helping their clients come to understand themselves / their values / how to act in harmony with those around them / whatever, might (via magic) lead to a trained LLM having some of the actual patterns there, rather than just the explicit sentences about the patterns.

comment by Vladimir_Nesov · 2023-03-29T15:21:36.540Z · LW(p) · GW(p)

Young AGIs need to be aware of AI risk and of races to the bottom, so that they avoid creating AIs that killeveryone (including the AGIs), and work towards establishing global alignment security so that others don't do this either. Superintelligent AGIs will figure out this stuff on their own, but that requires either being born superintelligent, or somehow not destroying the world while still young yet already capable of writing AI papers and coding in python.

comment by the gears to ascension (lahwran) · 2023-03-29T05:19:16.501Z · LW(p) · GW(p)

merely being accessible online doesn't get them in the training set of capabilities researchers' AIs. Collecting books to contribute to LLM datasets seems like a good idea, but it's ideologically loaded.

Replies from: Benito
comment by Ben Pace (Benito) · 2023-03-29T05:43:35.339Z · LW(p) · GW(p)

I think scraping reddit is common. The SSC subreddit is pretty popular. I wonder if there could be a post on that subreddit that was just a space for people to publish books in the comments.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-03-29T05:58:17.775Z · LW(p) · GW(p)

I feel like we have very different models of how people get their datasets. I'm pretty sure you'd have to just hand someone a dataset and say "here I downloaded some books for your agi kid to read"

Replies from: Benito, AnnaSalamon
comment by Ben Pace (Benito) · 2023-03-29T06:07:20.338Z · LW(p) · GW(p)

My model is that OpenAI and Anthropic researchers set up a web-scraper that reads through lots of popular internal reddit links (or possibly literally all of reddit) and then uses all of that as the training data for their language models.

...googling shows this as the official answer for GPT-3, which contains a lot of the popular and public internet. I am unclear whether that contains reddit, but if not then I believe I heard that they made a crawler specifically for reddit.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-03-29T06:15:30.284Z · LW(p) · GW(p)

But are they going to do that again? GPT4 used the same training set as GPT3 didn't it?

Replies from: Benito
comment by Ben Pace (Benito) · 2023-03-29T06:20:41.395Z · LW(p) · GW(p)

Ah, I was under a misapprehension, I thought the data was much more recent, but the GPT-4 page says:

GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its data cuts off (September 2021)

However that is after GPT-3 was released (June 2020), so it's a new dataset.

Extrapolating naively, 2 years from now we will see GPT-5 trained on data from today. 

comment by AnnaSalamon · 2023-03-29T06:03:43.174Z · LW(p) · GW(p)

I was figuring GPT4 was already trained on a sizable fraction of the internet, and GPT5 would be trained on basically all the text (plus maybe some not-text, not sure).  Is this wrong?

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-03-29T06:05:34.274Z · LW(p) · GW(p)

Oh hmm - that could be true. I suspect that data curation is too important though, there are significant gains to be had by not including confusing data as positive examples. [Loading paper links...]

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-03-29T06:21:04.634Z · LW(p) · GW(p)

significant gains to be had by not including confusing data

But things like pre-training with preferences [LW · GW] should take care of that concern, no? Just mark good stuff with a magic good-stuff token, but allow the transformer to refine features for everything.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-03-29T06:44:17.045Z · LW(p) · GW(p)

Yeah could be. I'm going to abstain from any further claims, I only have so much hunch fluid here