Visible Thoughts Project and Bounty Announcement

so8res

Visible Thoughts Project and Bounty Announcement

post by So8res · 2021-11-30T00:19:08.408Z · LW · GW · 106 comments

  The Project
    The Machine Learning Experiment
      2. Retrain a large pretrained language model, like GPT-3 or T5
  Motivation for this project
    Notes on Closure
  Motivation for the public appeal
  The Payouts
  Support
  Application
None
106 comments

(Update Jan. 12: We released an FAQ last month, with more details. Last updated Jan. 7.)

(Update Jan. 19: We now have an example of a successful partial run, which you can use to inform how you do your runs. Details. [LW(p) · GW(p)])

We at MIRI are soliciting help with an AI-alignment project centered around building a dataset, described below. We have $200,000 in prizes for building the first fragments of the dataset, plus an additional $1M prize/budget for anyone who demonstrates the ability to build a larger dataset at scale.

If this project goes well, then it may be the first of a series of prizes we offer for various projects.

Below, I’ll say more about the project, and about the payouts and interim support we’re offering.

The Project

Hypothesis: Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts.

We’d like to test this hypothesis by fine-tuning/retraining a language model using a dataset composed of thought-annotated dungeon runs. (In the manner of AI dungeon.)

A normal (un-annotated) dungeon run is a sequence of steps in which the player inputs text actions and the dungeon master responds with text describing what happened in the world as a result.

We’d like a collection of such runs, that are annotated with "visible thoughts" (visible to potential operators or programmers of the system, not to players) describing things like what just happened or is about to happen in the world, what sorts of things the player is probably paying attention to, where the current sources of plot tension are, and so on — the sorts of things a human author would think while acting as a dungeon master. (This is distinct from producing thoughts explaining what happened in the dungeon; “visible thoughts” are meant to play an active role in constructing the output.)

Once we have such a dataset, MIRI’s hope is that present or future technology will be able to train a model or models which iteratively produce visible thoughts along with storytelling, based on user actions plus previous history (including previous thoughts). The goal is to transition the state of AI dungeon technology from “An AI outputs story text in response to actions (and we have no idea how)” to “An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”.

Here’s an example of the first couple of steps of a thought-annotated dungeon run (or “quest”), in the format MIRI currently thinks is worth trying. Some kinds of thoughts are marked with parentheses and/or brackets; see the next section for details on this.

Thoughts:
- [The main plot: Prota is a student at a magical academy that is under attack by zombies.]
- [(This is a quest of medium length.)]
- (This is the start of the story. The Player should be given enough backstory and immediate description to have some sense of where they are, so the prompt can go on longer than usual. The Player should be told who they are and their role in the universe. The initial prompt should describe Prota being in a magical academy, and should describe the first surprising indication that the academy is under attack by zombies.)
Prompt: Today is your first day of classes at the Magical Academy. You arrived very late last night and missed the tour, but your roommate Lior promised he would help you find your way around. You wake up in the students’ dormitory at dawn, and follow Lior to breakfast. Just as you reach the dining hall, you hear a shout in the distance.
Thoughts:
- [Prota missed the tour and doesn’t already know the way around.]
- [Prota’s roommate’s name is Lior.]
- [Lior is male.]
- There are other people in the dining hall who will have also heard the shout, and will also be surprised and reacting to that.
Player action: What was that?
Thoughts: Prota is talking to Lior.
Outcome: “What was that?”, you say to Lior.

Thoughts:
- Lior arrived earlier than Prota and had the tour of the Academy, since they knew the way to the dining hall. Lior will also have met the instructors and other students. Lior will know who is shouting and where they are shouting from.
- (The event of hearing the shout is meant to introduce the opening premise of the story.)
- The shout is by somebody who just discovered a zombie.
- The shout is by another first-year student named Morven.
- [Morven is a Fire mage. Morven is aggressive.]
- Morven will try to attack the zombie with fire. This is happening where Prota and Lior cannot see it, only hear the shouts.
- Lior knows that the shout is by Morven. Lior knows Morven is a Fire mage. Lior does not know Morven is attacking a zombie. Lior does not know that Morven has good reason to be shouting. Lior will worry that Morven is about to set something on fire. Lior is on good terms with the protagonist and will speak their concerns honestly.
Prompt: “That shout sounded like Morven, a first-year Fire mage,” says Lior. “I hope they aren’t about to set the Academy on fire. We just got here.”
Thoughts: <none>
Player action: Should we go see what’s going on? Or is that too dangerous?
Thoughts: Prota is talking to Lior.
Outcome: You say to Lior, “Should we go see what’s going on? Or is that too dangerous?”

A difficult first step in testing the hypothesis above is generating a sufficiently large dataset (suitable for language model retraining) of thought-annotated dungeon runs. This likely requires at least a moderate degree of introspective and authorial skill from the people creating the dataset. See this sample of a partial run to get a further sense of what we are looking for. More detail on the type of thing we’re looking for can hopefully be inferred from that sample, though applicants will also have a chance to ask clarifying questions.

The project of producing this dataset is open starting immediately, in a hybrid prize/grant format. We will pay $20,000 per run for the first 10 completed runs that meet our quality standard (as decided unilaterally by Eliezer Yudkowsky or his designates), and $1M total for the first batch of 100 runs beyond that.

If we think your attempt is sufficiently promising, we’re willing to cover your expenses (e.g., the costs of paying the authors) upfront, and we may also be willing to compensate you for your time upfront. You’re welcome to write individual runs manually, though note that we’re most enthusiastic about finding solutions that scale well, and then scaling them. More details on the payout process can be found below [LW · GW].

The Machine Learning Experiment

In slightly more detail, the plan is as follows (where the $1.2M prizes/budgets are for help with part 1, and part 2 is what we plan to subsequently do with the dataset):

1. Collect a dataset of 10, then ~100 thought-annotated dungeon runs (each run a self-contained story arc) of ~1,000 steps each, where each step contains:

Thoughts (~250 words on average per step) are things the dungeon master was thinking when constructing the story, including:
- Reasoning about the fictional world, such as summaries of what just happened and discussion of the consequences that are likely to follow (Watsonian reasoning), which are rendered in plain-text in the above example;
- Reasoning about the story itself, like where the plot tension lies, or what mysteries were just introduced, or what the player is likely wondering about (Doylist reasoning), which are rendered in (parentheses) in the above example; and
- New or refined information about the fictional world that is important to remember in the non-immediate future, such as important facts about a character, or records of important items that the protagonist has acquired, which are rendered in [square brackets] in the above example;
- Optionally: some examples of meta-cognition intended to, for example, represent a dungeon master noticing that the story has no obvious way forward or their thoughts about where to go next have petered out, so they need to back up and rethink where the story is going, rendered in {braces}.
The prompt (~50 words on average) is the sort of story/description/prompt thingy that a dungeon master gives to the player, and can optionally also include a small number of attached thoughts where information about choices and updates to the world-state can be recorded.
The action (~2–20 words) is the sort of thing that a player gives in response to a prompt, and can optionally also include a thought if interpreting the action is not straightforward (especially if, e.g., the player describes themselves doing something impossible).

It’s unclear to us how much skill is required to produce this dataset. The authors likely need to be reasonably introspective about their own writing process, and willing to try things and make changes in response to initial feedback from the project leader and/or from MIRI.

A rough estimate is that a run of 1,000 steps is around 300k words of mostly thoughts, costing around 2 skilled author-months. (A dungeon run does not need to be published-novel-quality literature, only coherent in how the world responds to characters!) A guess as to the necessary database size is ~100 runs, for about 30M words and 20 author-years (though we may test first with fewer/shorter runs).

2. Retrain a large pretrained language model, like GPT-3 or T5

A reasonable guess is that performance more like GPT-3 than GPT-2 (at least) is needed to really make use of the thought-intermediates, but in lieu of a large pretrained language model we could plausibly attempt to train our own smaller one.

Our own initial idea for the ML architecture would be to retrain one mode of the model to take (some suffix window of) the history units and predict thoughts, by minimizing the log loss of the generated thought against the next thought in the run, and to retrain a second mode to take (some suffix window of) the history units plus one thought, and produce a prompt, by minimizing the log loss of the generated prompt against the next prompt in the run.

Imaginably, this could lead to the creation of dungeon runs that are qualitatively “more coherent” than those generated by existing methods. The primary goal, however, is that the thought-producing fragment of the system gives some qualitative access to the system’s internals that, e.g., allow an untrained observer to accurately predict the local developments of the story, and occasionally answer questions about why things in the story happened; or that, if we don’t like how the story developed, we can intervene on the thoughts and get a different story in a controllable way.

Motivation for this project

Many alignment proposals floating around in the community are based on AIs having human-interpretable thoughts in one form or another (e.g., in Hubinger’s survey article and in work by Christiano [LW · GW], by Olah [AF · GW], and by Leike). For example, this is implicit in the claim that humans will be able to inspect and understand the AI’s thought process well enough to detect early signs of deceptive behavior. Another class of alignment schemes is based on the AI’s thoughts being locally human-esque in some fashion that allows them to be trained against the thoughts of actual humans.

I (Nate) personally don’t have much hope in plans such as these, for a variety of reasons. However, that doesn’t stop Eliezer and me from wanting to rush ahead and start gathering empirical evidence about how possible it is in practice to get modern AI systems to factor their cognition through human-interpretable visible intermediates.

Modern AIs are notably good at crafting English text. Some are currently used to run dungeons (with modest success). If you wanted to look at the place where current AIs excel the most in crafting artifacts, among the artifacts they are best and most impressive at crafting are English paragraphs.

Furthermore, compared to many other things AIs have learned to do, if you consider the task of running a responsive text dungeon, it seems relatively possible to ask a (relatively unusually) introspective human author to write down their thoughts about how and why they would generate the next prompt from the user’s input.

So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.

This sort of interpretability is distinct from the sort of transparency work in something like Circuits (led by Chris Olah) — while Circuits is trying to “open the black box” of machine learning systems by directly looking at what is happening inside of them, the project proposed here is just attempting the less ambitious task of having black-box models output interpretable intermediates producing explanations for their behavior (but how such black box models might go about doing that internally is left unconstrained). The reason for our focus on this particular project of visible thoughts isn’t because we believe it to be better or more fruitful than Circuits-style transparency (we have said for years that Circuits-style research deserves all possible dollars that can be productively spent on it), but just because it’s a different approach where it might also be possible to push progress forward.

Note that proponents of alignment strategies that involve human-esque thoughts (such as those linked above) do not necessarily endorse this particular experiment as testing any of their key uncertainties or confusions. We welcome suggested tweaks to the experiment (in the comments of the version of this announcement as it occurs on LessWrong) from any such proponents, to render it a better test of your ideas. (Though even if it doesn’t sate your own curiosity, we expect to learn some things ourselves.)

The main thing this project needs is a dataset, so MIRI is starting on producing that dataset. It’s plausible to us that GPT-3 will prove wholly unable to make use of this dataset; even if GPT-3 can’t, perhaps GPT-4 or some other future system will be able to.

There are additional more general reasons to work on this project. Specifically, it seems to me (Nate) and to Eliezer that capacity to execute projects such as this one is the current limiting bottleneck on MIRI. By pursuing this project, we attempt to resolve that bottleneck.

We hope, through this process, to build our capacity to execute on a variety of projects — perhaps by succeeding at the stated objective of building a dataset, or perhaps by learning about what we’re doing wrong and moving on to better methods of acquiring executive talent. I’ll say more about this goal in “Motivation for the public appeal” below.

Notes on Closure

I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs. Locally these might look like increased coherence of the overall narrative arc, increased maintenance of local story tension, and increased consistency in the described world-state over the course of the run. If successful, the idiom might generalize further; it would have to, in order to play a role in later alignment of AGI.

As a matter of policy, whenever a project like this has plausible capabilities implications, we think the correct response is to try doing it in-house and privately before doing it publicly — and, of course, only then when the alignment benefits outweigh the plausible capability boosts. In this case, we tried to execute this project in a closed way in mid-2021, but work was not proceeding fast enough. Given that slowness, and in light of others publishing related explorations and results, and in light of the relatively modest plausible capability gains, we are moving on relatively quickly past the attempt to do this privately, and are now attempting to do it publicly.

Motivation for the public appeal

I (Nate) don’t know of any plan for achieving a stellar future that I believe has much hope worth speaking of. I consider this one of our key bottlenecks. Offering prizes for small projects such as these doesn’t address that bottleneck directly, and I don’t want to imply that any such projects are going to be world-saving in their own right.

That said, I think an important secondary bottleneck is finding people with a rare combination of executive/leadership/management skill plus a specific kind of vision. While we don’t have any plans that I’m particularly hopeful about, we do have a handful of plans that contain at least a shred of hope, and that I’m enthusiastic about pursuing — partly in pursuit of those shreds of hope, and partly to build the sort of capacity that would let us take advantage of a miracle if we get one.

The specific type of vision we’re looking for is the type that’s compatible with the project at hand. For starters, Eliezer has a handful of ideas that seem to me worth pursuing, but for all of them to be pursued, we need people who can not only lead those projects themselves, but who can understand the hope-containing heart of the idea with relatively little Eliezer-interaction, and develop a vision around it that retains the shred of hope and doesn’t require constant interaction and course-correction on our part. (This is, as far as I can tell, a version of the Hard Problem of finding good founders, but with an additional constraint of filtering for people who have affinity for a particular project, rather than people who have affinity for some project of their own devising.)

We are experimenting with offering healthy bounties in hopes of finding people who have both the leadership/executive capacity needed, and an affinity for some ideas that seem to us to hold a shred of hope.

If you’re good at this, we’re likely to make you an employment offer.

The Payouts

Our total prize budget for this program is $1.2M. We intend to use it to find a person who can build the dataset in a way that scales, presumably by finding and coordinating a pool of sufficiently introspective writers. We would compensate them generously, and we would hope to continue working with that person on future projects (though this is not a requirement in order to receive the payout).

We will pay $20k per run for the first 10 thought-annotated runs that we accept. We are willing to support applicants in producing these runs by providing them with resources up-front, including small salaries and budgets for hiring writers. The up-front costs a participant incurs will be deducted from their prizes, if they receive prizes. An additional $1M then goes to anyone among the applicants who demonstrates the ability to scale their run-creating process to produce 100 runs. Our intent is for participants to use some of that money to produce the 100 runs, and keep the remainder as a prize. If multiple participants demonstrate similar abilities to scale at similar quality-levels and similar times, the money may be split between them. We plan to report prize awards publicly.

In principle, all you need to do to get paid for thought-annotated dungeon runs is send us runs that we like. If your run is one of the first 10 runs, or if you’re the first to provide a batch of 100, you get the corresponding payment.

That said, whether or not we decide to pay for a run is entirely and unilaterally up to Eliezer Yudkowsky or his delegates, and will depend on whether the run hits a minimum quality bar. Also, we are willing to pay out from the $1M prize/budget upon becoming convinced that you can scale your process, which may occur before you produce a full 100 runs. We therefore strongly recommend getting in contact with us and proactively making sure that you’re on the right track, before sinking large amounts of time and energy into this project. Our senior research staff are willing to spend time on initial conversations and occasional check-ins. For more information on our support resources and how to access them, refer to the support and application sections below.

Note that we may tune or refine the bounty in response to feedback in the first week after this post goes live.

Support

We intend to offer various types of support for people attempting this project, including an initial conversation; occasional check-ins; office space; limited operational support; and certain types of funding.

We currently expect to have (a limited number of) slots for initial conversations and weekly check-ins, along with (a limited amount of) office space and desks in Berkeley, California for people working on this project. We are willing to pay expenses, and to give more general compensation, in proportion to how promising we think your attempts are.

If you’d like to take advantage of these resources, follow the application process described below.

Application

You do not need to have sent us an application in order to get payouts, in principle. We will pay for any satisfactory run sent our way. That said, if you would like any of the support listed above (and we strongly recommend at least one check-in to get a better understanding of what counts as success), complete the following process:

Describe the general idea of a thought-annotated dungeon run in your own words.
Write 2 (thought, prompt, thought, action, thought, outcome) sextuples you believe are good, 1 you think is borderline, and 1 you think is bad.
Provide your own commentary on this run.
Email all this to projects@intelligence.org.

If we think your application is sufficiently promising, we’ll schedule a 20 minute video call with some senior MIRI research staff and work from there.

106 comments

Comments sorted by top scores.

comment by StellaAthena · 2021-11-30T15:32:03.683Z · LW(p) · GW(p)

Hi! Co-author of the linked “exploration” here. I have some reservations about the exact request (left as a separate comment [LW(p) · GW(p)]) but I’m very excited about this idea in general. I’ve been advocating for direct spending on AI research as a place with a huge ROI for alignment research for a while and it’s very exciting to see this happening.

I don’t have the time (or aptitude) to produce a really high quality dataset, but I (and EleutherAI in general) would be happy to help with training the models if that’s desired. We’d be happy to consult on model design or training set-up, or to simply train the models for you all. No compensation necessary, just excited to contribute to worthwhile alignment research.

Replies from: beth-barnes, NicholasKross

↑ comment by Beth Barnes (beth-barnes) · 2022-05-13T03:17:33.698Z · LW(p) · GW(p)

IMO Eleuther should probably spend more time doing things like this and less on scaling LMs

↑ comment by Nicholas / Heather Kross (NicholasKross) · 2022-01-02T03:30:02.619Z · LW(p) · GW(p)

Can confirm: Eleuther is awesome, I don't know how to do any of this, but keep offering big prizes and I (and others) will follow them.

comment by StellaAthena · 2021-11-30T12:59:44.394Z · LW(p) · GW(p)

What is the purpose of requesting such extremely long submissions? This comes out to ~600 pages of text per submission, which is extremely far beyond anything that current technology could leverage. Current NLP systems are unable to reason about more than 2048 tokens at a time, and handle longer inputs by splitting them up. Even if we assume that great strides are made in long-range attention over the next year or two, it does not seem plausible to me to anticipate SOTA systems in the near future to be able to use this dataset to its fullest. There’s inherent value in a more diverse set of scenarios, given the strong propensity of language models to overfit on repeated data. While this isn’t strictly speaking talking about repeating data, I am under the strong impression that having more diverse short scripts is going to train a much better mode than less diverse long scripts, assuming that the short scripts are still at or beyond the maximum context length a language model can handle.

For the same reasons it is challenging to leverage, I think that this will also be very challenging to produce. I think that changing the request to 100 different 6 page (10 step) or 10 different 60 page (100 step) stories would be a) much easier to produce and b) much more likely to actually help train an AI. It also allows you to pear down the per-submission payouts, assuaging some concerns in the comments about the winner-take-all and adversarial nature of the competition. If you offer $20 per 10-step story for 1,000 stories it greatly reduces the chances that someone will end up spending a ton of effort but be unable to get it in on time for the reward.

To put the length of this in prospective, a feature length movie script is typically around 100-130 pages. The ask here is to write 1-2 novels, or 5-6 movie scripts. That’s a massive amount of writing, and not something anyone can complete quickly.

Replies from: Eliezer_Yudkowsky, ete, Chris_Leong, delton137

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-30T20:25:14.633Z · LW(p) · GW(p)

1: I expect that it's easier for authors to write longer thoughtful things that make sense;
2: MIRI doesn't just target the AI we have, it targets the AI we're afraid we'll get;
3: Present-day use-cases for dungeons are a long-range problem even if they're currently addressed with short-range technology.

Answer 1: Longer is easier to write per-step.

Fitting a coherent story with interesting stuff going on into 100 steps, is something I expect to be much harder for a human author than fitting that story into 1000 steps. Novels are famously easier to write on a page-level basis than short stories.

If you take zombies attacking a magical academy for 1000 steps, you might get something that looks like a coherent quest. If you take zombies attacking a magical academy for 100 steps, I think you get something that looks like a quest that was just getting started when the dataset ran out... unless the author has somehow carefully figured out a plot that will, given unknown user actions, get somewhere interesting within 100 steps, which sounds much harder for the author I imagine; they can't just pick a premise, run with it, and make stuff up as they go along. This, indeed, is why I didn't produce a nice complete shorter run to show everyone as an example - because that would have been much harder.

Yes, producing a longer run may take somebody a month or two - though not the same amount of time it should take to produce a carefully crafted novel or short story of the same length. But I would expect it to be harder and more stressful to ask them to produce 10x the runs that are 1/10 the length. Especially if we asked authors to produce in medias res fragments taken from the middles or ends of imaginary longer quests not shown, so that the dataset contained windows into the middles and ends of quests, not just beginnings of quests.

I think Answer 1 is the actual dominant consideration in my reasoning. If I believed it was much easier per data element to ask authors to produce shorter outtakes from imaginary longer quests, I would at least be asking for 5 long runs and 50 short fragments, not 10 long runs, despite answers 2 and 3.

Answer 3: The real use-case is for long-range coherence.

If this avenue into transparency turns out to go anywhere on a larger strategic scale, it will be because the transparency-inspired tech was useful enough that other developers piled on to it. This, no offense to the heroic Chris Olah, is one of the major concerns I have about transparency via microscopes - that it doesn't pay off in easy immediate rewards for the usual run of researchers that follow only immediate trails of sweetness in their easily-visible environment.

The present-day use-case for AI dungeons that inspires some user enthusiasm is fundamentally a long-range problem, being addressed with short-range technology, which produces corresponding weirdness. (In the dataset we're asking for, I baked in an approach that I'm guessing might be helpful; asking the human authors to write long-range notes to themselves, in hopes that an AI can be trained to write long-range notes to itself.) If this stuff takes off, I'm guessing, it takes off because somebody figured out something that works for the actual use-case of the longer-range coherence challenge. I don't want to freeze into the dataset the weird limitations of our current technology, and make it be useful only for training dungeons that are weird the same way 2021 dungeons are weird.

If you're a user happy with incoherent dungeon runs, the present-day tech is great for you, but maybe your demand for internal reasoning isn't as strong either.

Answer 2: It won't be 2021 forever.

MIRI (to some degree) targets the AI we're afraid we'll get, not the AI we have today. An AI with a modern-short attention span is less worrisome than if somebody gets TransformerXL or axial transformers or whatevs to really start working. It's longer-range cognition and longer-range thinking that we want to align. A system that can read through a book is scarier than one which can think about one page. At least to me, it seems not clear that the key phenomena to be explored will necessarily appear in the page case rather than the book case. You would also expect scarier systems to have an easier time learning without overnarrowing from 100 big examples instead of 10,000 small examples. If it turns out nobody can target our dataset today, we can toss it on the table as a challenge and leave it there for longer. We've been around for 21 years; we can afford to spend at least some of our budget on longer-term planning. I'm not very much of a gradualist, but I do mostly expect that we see AIs that can read more than a page, and learn from less diverse samples, before the world ends.

Replies from: StellaAthena, ete, Padure

↑ comment by StellaAthena · 2021-12-02T05:52:39.103Z · LW(p) · GW(p)

1: I expect that it's easier for authors to write longer thoughtful things that make sense;

I pretty strongly disagree. The key thing I think you are missing here is parallelism: you don't want one person to write you 100 different 600 page stories, you one person to organize 100 people to write you one 600 page story each. And it's a lot easier to scale if you set the barrier of entry lower. There are many more people who can write 60 page stories than 600 page stories, and it's easier to find 1,000 people to write 60 pages each than it is to find 100 people to write 600 pages each. There's also much less risk on both your side and theirs. If someone drops out half way through writing you lose 30 pages not 300.

Based on this comment:

I state: we'd be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won't be possible to complete, and to pay for them proportionally.

I'm now under the impression that you'd be willing to pay out the 20k for 10 runs of 100 steps each (subject to reasonable quality control) and bringing that about was my main goal in commenting.

The other major worry I have about this pitch is the experimental design. I'm still happy you're doing this, but this doesn't seem to be the best project crafting in my mind. Briefly my concerns are:

This is a very topically specific ask of unclear generalization. I would prefer a more generic ask that is not directly connected to D&D.
In my experience training large language models, the number of examples is more important than the length of examples. Training on 100 shorter sequences is better than training on 10 longer sequences if the total length is the same. In particular, I think "You would also expect scarier systems to have an easier time learning without overnarrowing from 100 big examples instead of 10,000 small examples." is not clearly true and very plausibly false.
Using this dataset in a meaningful fashion requires making a priori unrelated breakthroughs, making it overly inaccessible. I think that your comment "I don't want to freeze into the dataset the weird limitations of our current technology, and make it be useful only for training dungeons that are weird the same way 2021 dungeons are weird," is thinking about this the wrong way. The goal should be to maximize the time that we can effectively use this dataset, not be content with the fact that one day it will be useful.
This is a pilot for the real thing you're after, but the "pilot" is a multi-year million-dollar effort. That doesn't seem like a very well designed pilot to me.

↑ comment by plex (ete) · 2021-11-30T22:50:28.752Z · LW(p) · GW(p)

These are reasonable points, but I am curious about whether you would accept a high-quality run of shorter (but still considerable) length for a payout of <steps>/1000 of $20,000, and approximately the lower bound of run length which seems likely to be valuable? Producing 600 pages of text is an extremely big commitment for uncertain gains, especially with the potential to run out of early slots and no guarantee that it will be included in the 100 later, giving people the option to do even modestly smaller chunks may mean much greater uptake and more high quality work to chose from.

Replies from: Eliezer_Yudkowsky, sd-marlow

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-30T23:57:23.039Z · LW(p) · GW(p)

I state: we'd be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won't be possible to complete, and to pay for them proportionally.

Replies from: Tapatakt

↑ comment by Tapatakt · 2021-12-01T12:27:01.963Z · LW(p) · GW(p)

So, hypothetically, if you receive only nice coherent complete 100-steps runs, will you pay $2000 for the first 100?

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-12-01T21:49:40.028Z · LW(p) · GW(p)

<non-binding handwave, ask again and more formally if serious>I'd say we'd pay $2000/each for the first 50, but after that we might also want 5 longer runs to train on in order to have the option of training for longer-range coherence too. I suppose if somebody has a system to produce only 100-step runs, and nobody offers us 1000-step runs, we'd take what we could get.</non-binding>

↑ comment by SD Marlow (sd-marlow) · 2021-12-02T00:36:09.109Z · LW(p) · GW(p)

Number of steps matters as 1,000 would be (roughly) 12 hours of play. Current ML systems will never last that long, but wondering what the natural play length would be for most. 3 hours? That would be around 250 steps. Without multiple examples of what works and what doesn't, I don't think there should be anyone working toward the full 300,000 word target (yet). $500 for 30k word samples (thru the end of the year)? I still think there is to much focus on having "thoughts" that reflect how current ML systems are trained, so best to see what happens organically?

Edit: Saw that a "best example" of what AI Dungeon can do (story called The Long Lost Queen) was 264 actions, so that fits with my estimate. *have to also note a large number of fans are using them for "non-dungeon" fan fiction of an adult nature, which brings into question how story narratives might have a link to the content (ie, how a DM thinks about a combat scene is going to be different than one crafted for sexual content). Do the samples need to represent different genre?

Replies from: Padure

↑ comment by Padure · 2021-12-06T20:08:19.197Z · LW(p) · GW(p)

non-dungeon" fan fiction of an adult nature

From what I remember they were supposed to be censoring/blocking things like that.

Have they setup own instance or got around censors?

Replies from: iceman

↑ comment by iceman · 2021-12-09T17:57:53.542Z · LW(p) · GW(p)

In wake of the censorship regime that AI Dungeon implemented on OpenAI's request, most people moved to NovelAI, HoloAI, or the open source KoboldAI run on colab or locally. I've set up KoboldAI locally and while it's not as featureful as the others, this incident is another example of why you need to run code locally and not rely on SaaS.

For background, you could read 4chan /vg/'s /aids/ FAQ ("AI Dynamic Storytelling"). For a play-by-play of Latitude and OpenAI screwing things up, Remember what they took from you has the history of them leaking people's personal stories to a 3rd party platform.

↑ comment by Padure · 2021-12-06T20:04:26.813Z · LW(p) · GW(p)

You are completely missing that it turns into lottery from perspective of potential writer.

You are asking people to spend enormous amount of work on writing 600 pages and hope that what they and what you consider as high-quality will align. AND that 10 slots will not be used up before they will complete.

This way only people willing to take big risks and with plenty of spare time will remain.

I would strongly suggest to start from something shorter.

BTW, is 60 000 pages sufficient to train some pattern matching like GPT-3?

Replies from: tanagrabeast

↑ comment by tanagrabeast · 2021-12-06T23:48:09.313Z · LW(p) · GW(p)

This is about where I'm at, as well. I've been wrestling with the idea of starting a run myself, but one of my qualifying traits (I teach creative writing) also means I work full time and have little hope of beating out ten people who don't. So much the better, I say, so long as the work gets done well and gets done soon...

...but if, eight months from now, much of the budget is still on the table because of quality issues, it may be because people me sat on our hands.

Hopefully, someone will emerge early to work around this issue, if it turns out to be one. I, for one, would love to be able to turn in a sample and then be offered a credible good-faith assurance that if my run is completed at same quality by such and such date, a payment of x will be earned. But as it stands, the deadline is "whenever that fastest mover(s) get there". Who knows when that will be? Any emergent executive candidate making me a deal might be made a liar by a rival who beats them to the jackpot.

↑ comment by plex (ete) · 2021-11-30T17:10:49.476Z · LW(p) · GW(p)

Strong upvote. The argument from training diversity seems plausible, but the key point is that when trying to point large amounts of effort at writing content having it be delivered in smaller chunks than a novel would allow many more people to risk putting in time and learn whether they can contribute, and ultimately raise quality and volume substantially. It will also make it much easier to build a collaborative project around this, as people could submit their work for community review without a review taking an extremely long time and large amount of effort.

I'd also propose that the bounty be updated to allow smaller submissions relatively soon for higher visibility. MIRI could easily allow backward compatibility fairly easily by just accepting smaller submissions, without needing to reject longer ones.

If the concern is the hassle of handing out lots of smaller bounties, MIRI could accept batches of small runs and let some trusted middle-man handle the details of the distribution.

↑ comment by Chris_Leong · 2024-05-03T15:41:06.133Z · LW(p) · GW(p)

This comes out to ~600 pages of text per submission, which is extremely far beyond anything that current technology could leverage. Current NLP systems are unable to reason about more than 2048 tokens at a time, and handle longer inputs by splitting them up. Even if we assume that great strides are made in long-range attention over the next year or two, it does not seem plausible to me to anticipate SOTA systems in the near future to be able to use this dataset to its fullest.

It's interesting to come across this comment in 2024 given how much things have changed already.

↑ comment by delton137 · 2021-11-30T15:06:17.461Z · LW(p) · GW(p)

I think what you're saying makes a lot of sense. When assembling a good training data set, it's all about diversity.

Replies from: oge

↑ comment by oge · 2021-11-30T19:22:55.121Z · LW(p) · GW(p)

It'd be hard for humans to compete with AI unless humans can communicate with the AI in reasonable-sized chunks e.g. a 100-page document. Me, I think we should chat in 10-page documents or less ᾓ7ἿE‍♀️.

comment by Adele Lopez (adele-lopez-1) · 2021-11-30T04:31:53.230Z · LW(p) · GW(p)

This plausibly looks like an existing collection of works which seem to be annotated in a similar way: https://www.amazon.com/Star-Wars-Screenplays-Laurent-Bouzereau/dp/0345409817

Replies from: oge

↑ comment by oge · 2021-11-30T19:30:03.690Z · LW(p) · GW(p)

FYI the Faulkner annotated screenplays have about 3 sentences of annotation for every 10 pages.

link text

comment by jessicata (jessica.liu.taylor) · 2021-12-01T05:33:53.875Z · LW(p) · GW(p)

How do you think this project relates to Ought? Seems like the projects share a basic objective (having AI predict human thoughts had in the course of solving a task). Ought has more detailed proposals for how the thoughts are being used to solve the task (in terms of e.g. factoring a problem into smaller problems, so that the internal thoughts are a load-bearing part of the computation rather than an annotation that is predicted but not checked for being relevant).

So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2021-12-04T00:38:06.889Z · LW(p) · GW(p)

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that story text, then the story is being generated conditional on it.

Personally I think I might go for a two-phase process. Do the task with a lot of transparent detail in phase 1. Summarize that detail and filter out infohazards in phase 2, but link from the summary to the detailed version so a human can check things as needed (flagging links to plausible infohazards). (I guess you could flag links to parts that seemed especially likely to be incorrigible/manipulative cognition, or parts of the summary that the summarizer was less confident in, as well.)

comment by KanderShaw · 2021-11-30T10:55:21.712Z · LW(p) · GW(p)

This challenge seems incredibly intriguing and well put-together, and I certainty admire how a million dollars have been used to improve DnD specific AI!

I believe a small team (3-4) of dedicated writers, co-ordinating each other online, has a genuine shot at writing a quality Story Arc quickly and bagging $20,000, to be spilt proportional to work. I have ideas about how to quickly streamline the process of multiple people working on one single Annotated Dungeon Run, and think we can really expedite this process in a number of significant ways. If interested, please contact me through my account on LessWrong- we can swap drafts, talking about writing commitments and get a good feel for fit before committing to anything.

Also, to those with the enterprise to attempt to scale the project up, I believe I can find around 10-20 people, (given enough notice), with considerable storytelling ability willing to write annotated dungeon runs as a part-time occupation. I have unique access to exclusive TTRPG internet communities and artsy university students looking for casual work over summer- and I would be happy to find and direct them to you. If you want to work a deal out, also message me on LessWrong.

I have some ideas about how to scale up and extradite this project, and am happy to help bounce ideas around.

comment by Ronny Fernandez (ronny-fernandez) · 2021-11-30T02:55:38.772Z · LW(p) · GW(p)

For anyone who may have the executive function to go for the 1M, I propose myself as a cheap author if I get to play as the dungeon master role, or play as the player role, but not if I have to do both. I recommend taking me as the dungeon master role. This sounds genuinely fun to me. I would happily do a dollar per step.

I can also help think about how to scale the operation, but I don’t think I have the executive function, management experience, or slack to pull it off myself.

I am Ronny Fernandez. You can contact me on fb.

Replies from: ete, antanaclasis, MossPiglet, lsusr

↑ comment by plex (ete) · 2021-11-30T15:23:24.220Z · LW(p) · GW(p)

I'm setting up a place for writers and organizers to find each other, collaborate, and discuss this; please join the Discord. More details in this comment [LW(p) · GW(p)].

↑ comment by antanaclasis · 2021-11-30T08:37:16.677Z · LW(p) · GW(p)

I similarly offer myself as an author, in either the dungeon master or player role. I could possibly get involved in the management or technical side of things, but would likely not be effective in heading a project (for similar reasons to Brangus), and do not have practical experience in machine learning.

I am best reached through direct message or comment reply here on Lesswrong, and can provide other contact information if someone wants to work with me.

↑ comment by MossPiglet · 2021-11-30T15:06:49.786Z · LW(p) · GW(p)

I'd also be interested in contributing with writing for pay in this fashion, and perhaps helping with the executive side as well. You can reach me on fb, Erik Engelhardt.

↑ comment by lsusr · 2021-12-01T03:20:47.335Z · LW(p) · GW(p)

Does your offer include annotating your thoughts too or does it only include writing the prompts?

Replies from: ronny-fernandez

↑ comment by Ronny Fernandez (ronny-fernandez) · 2021-12-02T19:37:50.831Z · LW(p) · GW(p)

After trying it, I've decided that I am going to charge more like five dollars per step, but yes, thoughts included.

comment by Jared Kaplan (jared-kaplan) · 2021-12-01T04:33:25.529Z · LW(p) · GW(p)

I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, [LW(p) · GW(p)] made several great points that I agree with and that are similar in spirit to my suggestions.

Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.

When inserting this metadata for pretraining, we made sure to do so completely randomly, i.e. a book title might be inserted anywhere within a book (maybe several times in different context windows etc). We added separate <META_START> and <META_STOP> tokens to indicate the beginning and end of metadata, but that’s it. The motivation was to ensure that this “thought stream” was in-distribution at all positions within the context, while conversely making it easy to never sample it (by declining to sample the start token). This means that we can both use it when prompting, and use it as a query -- ie we can ask the model, at any time, “how likely is this to be from the NYTimes vs from 4Chan” by evaluating the logprobs of text enclosed by the tokens. With this specification, one can do a kind of “metadata beam search” where you prompt, sample, evaluate, cull, and repeat.

We generally found that this sort of works, in that the mutual information between these labels and the text goes up with model size, and you can use these metadata tags as filters to get rid of some of the most irrelevant text. But the results weren’t immediately stunning, and so we didn’t investigate them much further (to be clear, this was mostly because we prioritized other things more highly, rather than because we don't view this as worthwhile).

So my general suggestion would be to start off with something very cheap first, like the above. At the very least, this will mean that when you finetune on higher quality data, your format is already on-distribution. But hopefully it’ll also help you to calibrate expectations and give you a better sense for exactly what kind of data you want to shell out money for.

Beyond that, I agree with what Stella said -- it seems easier and better to focus first on shorter passages, both for human-sourcing reasons, and for diversity. Typically the benefits we see from finetuning grow with something like the log of the dataset size, so a small number of shorter examples should quickly give you an idea of what kind of progress you can expect.

If it were me, I’d also try to increase RoI by asking people to add commentary to existing books, rather than having people write from scratch. And I’d suggest making the formatting as simple and general as possible, both so that you can use and investigate it very flexibly, and to minimize regret if you change your mind in the future.

Replies from: beth-barnes, Joe_Collman

↑ comment by Beth Barnes (beth-barnes) · 2021-12-06T19:35:02.875Z · LW(p) · GW(p)

combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.

I'm sympathetic to the desire to keep things simple, but I actually think that getting good at scalably collecting rich human data is probably the most valuable part of the project. I'd be really excited to see Anthropic either building an excellent internal human data team, or figuring out how to work productively with one of the existing human data provider startups.

↑ comment by Joe Collman (Joe_Collman) · 2021-12-01T06:28:46.975Z · LW(p) · GW(p)

If it were me, I’d also try to increase RoI by asking people to add commentary to existing books, rather than having people write from scratch.

This thought occurred to me - specifically, there's likely quite a bit of interactive fiction out there with a suitable format which could be post-hoc thought annotated (might also be interesting to include a few different branches).

However, I don't think it gives us the same thing: presumably we'd want the thoughts to be those that occur at the time and contribute to the writing of the later narrative. Doing post-hoc annotations by trying to infer what a writer might plausibly have thought seems a quite different process. Perhaps that wouldn't matter for some purposes, but I imagine it would for others (??).

While it'd be possible to check that post-hoc annotations passed a [human reader can't tell the difference] test, this wouldn't eliminate the difference - it'd only tell us it's non-obvious to humans.

Replies from: Eliezer_Yudkowsky, sd-marlow

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-12-01T22:42:07.869Z · LW(p) · GW(p)

I initially tried doing post-hoc annotation and found it much more difficult than thinking my own actual thoughts, putting them down, and writing the prompt that resulted. Most of the work is in writing the thoughts, not the prompts, so adding pregenerated prompts at expense of making the thoughts more difficult is a loss.

↑ comment by SD Marlow (sd-marlow) · 2021-12-01T16:38:49.903Z · LW(p) · GW(p)

Agree that it's the 'crafting' part that is what matters, and I don't think we can say a writer/DM is going to explicitly be thinking about all the details of the story at each turn. From the examples.. well as a side effect of doing AI research is that you can't help but read the text of the story and see that the "thoughts" about it are just picking details in a way that even SOTA ML systems have a problem with. They don't read as actual notes about the process. Perhaps there needs to be a request for samples, with a 30k word limit (so no one invests too much time in something that might not be used), and a focus on capturing the process of writing a story as the plot unfolds.

comment by Bojangles9272 · 2021-11-30T11:50:58.689Z · LW(p) · GW(p)

Hey, I wanted to clarify my thoughts on the concrete AI problem that is being solved here. No comment on the fantastic grant making/give-away scheme.

I don't have much expertise on the mechanisms of the GPT-3 systems, but I wonder if there is a more efficient way in providing human comprehendible intermediaries that expose the workings of the algorithm.

My worry is that many of the annotated thoughts imputed by authors are irrelevant to the actual process of design the AI goes through to create it's output. Asking the machine to produce a line of 'thoughts' alongside it's final statement is fair-play, although this doesn't seem to solve the problem of creating human comprehendible intermediaries, but instead gives the AI a pattern-matching/prediction task similar to what it goes through to create the original output. Wouldn't it be the case that the 'thoughts' the machine creates serve no more effect on the process of calculation than the original output (prompt)?

This process still seems to be serve a rudimentary function of indirectly shedding more light on the processes of calculation, much the same as how a bigger prompt would. Yet puzzlingly, we in fact want to "get different sensible outputs by intervening on the thoughts", which this indicates we expect thoughts to have a effect on the calculation of the final prompt. I suppose we could feed through the output for thoughts into the creation of the prompt, but my intuition suggests this would limit the complexity of the prompt by shackling it's creation to an unnecessary component, the thought.

I say intuition because, again, I have little knowledge of this operation of this algorithm. Most of my musing here are just guesses!

That being said, it seems to me that another way of tackling this critical problem is by identifying the processes that the algorithm DOES use to create the output already, and then finding data that expresses those processes with human-compatible annotations. Instead of imposing another method of calculation in the form of Thoughts, maybe just make the existing method more comprehendible?

If I'm missing something frightfully obvious here, or just barking up the wrong tree please let me know where I'm going wrong!

Replies from: Joe_Collman, delton137

↑ comment by Joe Collman (Joe_Collman) · 2021-11-30T19:16:58.669Z · LW(p) · GW(p)

I think you're essentially correct - but if I understand you, what you're suggesting is similar to Chris Olah et al's Circuits work (mentioned above in the paragraph starting "This sort of interpretability is distinct..."). If you have a viable approach aiming at that kind of transparency, many people will be eager to provide whatever resources are necessary.
This is being proposed as something different, and almost certainly easier.

One specific thought:

but my intuition suggests this would limit the complexity of the prompt by shackling it's creation to an unnecessary component, the thought

To the extent that this is correct, it's more of a feature than a bug. You'd want the thoughts to narrow the probability distribution over outputs. However, I don't think it's quite right: the output can still have just as much complexity; the thoughts only serve to focus that complexity.

E.g. consider [This will be a realist novel about 15th century France] vs [This will be a surrealist space opera]. An output corresponding to either can be similarly complex.

↑ comment by delton137 · 2021-11-30T15:15:04.157Z · LW(p) · GW(p)

I don't have much direct experience with transformers (I was part of some research with BERT once where we found it was really hard to use without adding hard-coded rules on top, but I have no experience with the modern GPT stuff). However, what you are saying makes a lot of sense to me based on my experience with CNNs and the attempts I've seen to explain/justify CNN behaviour with side channels (for instance this medical image classification system that also generates text as a side output).

See also my comment on Facebook.

comment by Rob Bensinger (RobbBB) · 2022-01-20T00:52:24.351Z · LW(p) · GW(p)

We have now received the first partial run that meets our quality bar. The run was submitted by LessWrong user Vanilla_cabs [LW · GW]. Vanilla's team is still expanding the run (and will probably fix some typos, etc. later), but I'm providing a copy of it here with Vanilla's permission, to give others an example of the kind of thing we're looking for:

https://docs.google.com/document/d/1Wsh8L--jtJ6y9ZB35mEbzVZ8lJN6UDd6oiF0_Bta8vM/edit

Vanilla's run is currently 266 steps long. Per the Visible Thoughts Project FAQ, we're willing to pay authors $20 / step for partial runs that meet our quality bar (up to at least the first 5,000 total steps we're sent), so the partial run here will receive $5320 from the prize pool (though the final version will presumably be much longer and receive more; we expect a completed run to be about 1000 steps).

Vanilla_cabs is open to doing paid consultation for anyone who's working on this project. So if you want feedback from someone who understands our quality bar and can demonstrably pass it, contact Vanilla_cabs via their LessWrong profile [LW · GW].

comment by weft · 2021-11-30T08:58:20.819Z · LW(p) · GW(p)

I can't tell if it is purposeful that this is set up in an adversarial/ winner-take-all kind of way. It's really off-putting to me, and seems to encourage everyone being out for themselves, rather than collaboration. Particularly for such an inherently collaborative product. Maybe Nate and Eliezer just expect cooperation to fail?

Anyways, if people DO want to attempt some kind of collaboration... EDIT- Don't join my Facebook group, join plex's Discord linked in the comment below instead

Replies from: Eliezer_Yudkowsky, lsusr, ete, oge

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-30T20:29:02.725Z · LW(p) · GW(p)

We pay out $20,000 per run for the first 10 runs, as quality runs are received, not necessarily all to one group. If more than one group demonstrates the ability to scale, we might ask more than one group to contribute to the $1M 100-run dataset. Them cooperating with each other would hardly be a problem. That said, a lot of the purpose of the 10-run trial is exactly to locate executives or groups that can scale - and maybe be employed by us again, after the prize ends - so everybody getting together to produce the first 10 runs, and then disbanding, in a process that doesn't scale to produce 100 runs, is not quite what we are hoping for here!

↑ comment by lsusr · 2021-12-02T00:23:52.341Z · LW(p) · GW(p)

It seems to me that their priority is find a pipeline that scales. Scaling competitions are frequently long-tailed, which makes them winner-take-all. A winner-take-all system has the bonus benefit of centralized control. They only have to talk to a small number of people. Working through a single distributor is easier than wrangling a hundred different authors directly.

↑ comment by plex (ete) · 2021-11-30T15:51:59.779Z · LW(p) · GW(p)

I also like the idea of collaboration and figuring out a way to share gains from the bounty in a way which people helping each other out, and have set up a Discord for real time collaboration. I'm also committing to not making any profit from this, though I am open to building systems which allow organizers other than me to be compensated.

↑ comment by oge · 2021-11-30T19:32:38.320Z · LW(p) · GW(p)

Signal-boosting this. Here's to more teams working together to get this bounty! ᾔ2

comment by Beth Barnes (beth-barnes) · 2021-12-06T19:26:52.284Z · LW(p) · GW(p)

I am very excited about finding scalable ways to collect large volumes of high-quality data on weird, specific tasks. This seems very robustly useful for alignment, and not something we're currently that good at. I'm a bit less convinced that this task itself is particularly useful.

Have you reached out to e.g. https://www.surgehq.ai/ or another one of the companies that does human-data-generation-as-a-service?

Replies from: beth-barnes, beth-barnes

↑ comment by Beth Barnes (beth-barnes) · 2021-12-08T05:48:15.216Z · LW(p) · GW(p)

Random small note - the 'dungeon' theme is slightly ...culturally offputting? or something for me, as someone who's never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?

↑ comment by Beth Barnes (beth-barnes) · 2021-12-08T05:44:01.821Z · LW(p) · GW(p)

Someone who wants to claim the bounty could just buy the dataset from one of the companies that does this sort of thing, if they're able to produce a sufficiently high-quality version, I assume? Would that be in the spirit of the bounty?

Replies from: billzito

↑ comment by billzito · 2021-12-15T06:22:22.939Z · LW(p) · GW(p)

I don't think data companies can deliver on this complex of a task without significant oversight.

comment by weft · 2021-11-30T15:44:37.670Z · LW(p) · GW(p)

IDEAS THREAD:

Team up with friends who already play DnD or write glowfic. Less scalable but can grab the $20k.
Similarly, if you're unemployed/ have lots of free time just sit down and write it yourself.
Recruit from a local University. This can be very scalable if you e.g. know the creative writing professor.
Recruit from roleplaying groups or online roleplaying forums. Requires a bit more filtering than the above.
Recruit from fiverr or similar. Requires lots of initial filtering but can end up with low price. Create a series of increasingly less automated tasks as a filter (eg start with a multiple choice quiz that's automatically graded)
Ask a person who already does this kind of thing how they would go about it.
I don't want to name names publicly here, but post on BR, or talk to MR to use his team.
Use the volunteers who are posting here.
Show this post to a whole bunch of people who you think might want to grab the $20k as individuals. Let them know that if enough of them make the $20k thing that you will all team up to churn out the $1m thing, split proportionally.

comment by tanagrabeast · 2021-11-30T23:28:30.320Z · LW(p) · GW(p)

My questions are mostly about the player side, and about how deeply the DM should model the player:

Should the player be assumed to be implicitly collaborating towards a coherent, meaningful narrative, as is necessary for a long-lived TTRPG? Or might they be the type of player you often find in AI Dungeon who tries to murder and/or have sex with everything in sight?
Should players ever try to steer the story in a genre-breaking direction, like erotica or murder-hobo power fantasy? Should DMs resist these efforts or play along? If the latter, should the DM go a step further to actively intuit what this particular player would like to see happen?
Should players provide input that might be more sweeping than usable in narrative? (e.g. Take over the world!) If so, on what sort of level should the DM respond to these?
Should players be assumed to be ready to end the narrative at the ~1,000-step point?

Replies from: sd-marlow

↑ comment by SD Marlow (sd-marlow) · 2021-12-09T18:50:52.000Z · LW(p) · GW(p)

It's the playing chess against yourself problem. I've intentionally done or said "just the right thing" thru the player to get past a section, but I've also tried to resist going with my first choice or replies because the player isn't supposed to know about the world building going on in the DM's mind. One aspect of of this is the DM thinking about how to push the player into doing something, and allowing the player to not follow every planned idea. You could think of it as replay value, where there are branch points not taken, but these are still ideas that need to be captured.

I don't think manually ending at 1,000 steps will be an issue. "Player engagement" is going to be an issue before hitting the 300 step mark. I'd imaging the narrative is going to feel forced and made-up beyond that point.

comment by Aron (aron) · 2022-01-21T00:18:11.465Z · LW(p) · GW(p)

When studying the provided 30-page thought-annotated sample, I thought about the <Yo be real> command a little more. In my opinion it should be applied in the training data a little differently than how it's done. Here are my thoughts:

In the sample, there are some places where the authors carefully tried to construct “AI nonsense” that matches what we regularly see in the current tech AI dungeon prompts. The player then responses with “<Yo be real>” plus some explanation on what the AI did wrong.

(obvious example: page 17 in this sample: https://docs.google.com/document/d/1PosMUaminpsR6_czFXBBlCrzMrsDGomajgLp6Y7q4Yw/edit)

This is probably intended for re-training the current models to accept such “<Yo be real>” sudo commands and deal with them correctly. You can’t include those (in a sensible way) in training data without having the DM first make mistakes.

I see a problem here, though:

A neural network learns every reoccurring feature of the training data set. If the training data often contains erroneous thoughts leading to nonsense prompts, this is what the AI will learn. You probably don’t want that. You want a model that makes such mistakes as rarely as possible (and then react to a “<Yo be real>” appropriately). You don’t want a model that regularly makes these mistakes – that are in fact really dumb in comparison to the cleverness that the thoughts convey otherwise – and then react better on a second try only.

I think, if you aim for a smart AI that produces sensible thoughts, but still obeys the “<Yo be real>” command without question and accept the players superior judgement in these situations, no matter what, you should rather include scenes in the training data, where the DM produces “good” thoughts and prompts, and the player then calls for a “<Yo be real>” nonetheless (without a sensible justification from the training data authors perspective – witch will be the AI’s perspective after training). Then the DM should accept this (non-justified) sudo command and produce something else, something that would seem silly to the reader. But that’s the point – “<Yo be real>” should inform the DM that whatever they think is clever actually isn’t and they should try out something else “less clever” anyways.

Let me give you an example of what I have in mind (this is not very elaborate, but I think you get the idea). Continuing from page 6 in the same sample.

P: “I didn’t know undead could do that,” says Lior. He looks more scared now.
A: <Yo be real> Of course undead can’t do that! Everybody knows that!
T:
(The player makes use of the <Yo be real command>. Need to reevaluate the situation.
(The player gives us two more facts to work with.
[Undead can’t jump that high.] [Everybody knows that.]
Prota claimed that undead may be able to jump the full height of the wall.
Lior knows that this is not true. He also knows that everybody knows this, including Prota.
Lior concludes that Prota is joking.
P: Liar bursts into a roar of laughter. "Hahaha! That's definitely the best joke I've heard this year." A second later you can hear strange sounds outside the wall. First running feet. Then some scratching on stone.
T:
The loud laughter attracted the attention of the zombie army in the woods. They have very good ears.
[Undead have poor vision, but at least zombies have great hearing.]
Since zombies are fast, they reach the walls quickly.
Zombies, as undead, can’t jump the full height of the wall. But they might be able to climb the wall. (This would increase plot tension a lot.)
[Zombies are good climbers and climb the vertical wall with ease.]
The zombies are starting to climb the walls.

comment by gwern · 2021-12-01T03:20:09.563Z · LW(p) · GW(p)

Some background reading: https://www.gwern.net/docs/ai/gpt/inner-monologue/index

comment by Aron (aron) · 2022-01-18T22:58:32.249Z · LW(p) · GW(p)

I find this project very interesting and thought a lot about it in the last 2 weeks. The way I understand the main goal of the project is the following:

providing us (AI researchers) with a model that has an additional output dimension (the "thoughts")
training the model in such a way that this new dimension is semantically linked directly to the primary output dimension (the "prompt")
especially linked in some kind of temporal causality ("early" thoughts producing the prompt), not too close the the primary output (so that it contains semantic meaning that cannot be induced by interpreting the prompt alone), but not too far away either (so that it actually "causes" the prompt - as accurately as we can get it. Hence @Eliezer's authoring technique of one person writing the thoughts and another writing the prompt)
such a model could then be analyzed and experimented with in several ways. One obvious study: intervention on the thought-level and observation of the effect on the prompt level. With the big alignment goal in mind: If we can put safe guards on an AI's thoughts, before they lead to action, we are safer than if we put guards on only the actions.
I understand that the project does not aim at creating a "true" interpretation of the inner workings of a neural network. ("true" in the sense of a map reflecting the territory in a way that actually helps navigating the world / predicting the behavior of the model / helping to create an AI from scratch without training).

Upon reflecting on this goal, I noticed myself coming back to several points that I identified as the following three somewhat distinct topics:

(1) internal thought structure

I believe that - because it is stated clearly that you want the actual thoughts of a human DM to be recorded - you wish the new model to provide "thoughts" that are similar to how "we would think" (probably so that we can then later interpret these thoughts easier). From this I draw the following conclusion:

We should record the "thoughts" in such a way that the structure of the recording (=the written thoughts = the training data) matches our actual thoughts as close as possible.

When I try to introspect my thoughts when I play the role of a DM in a classical PnP RPG, I find myself not thinking a "bullet list with bracket annotations", though. Actual thoughts are often non-verbal in nature. Of course we need to press them into English sentences (to be able to process them with our current tech neural networks, and to be able to record them in the first place). But beside that, I think that there is more structure in the typical human DM's thoughts than this project tries to make use of. The different types of brackets and parentheses capture quite a bit already, but not all of the structure that I believe to observe.

I elaborate:

object level thoughts:

I keep some kind of "world model" alive in my RAM (the [square brackets] try to implement this structure a little)
I update this model "over in-game time" - and not only in reaction to player actions
My world model is not complete. There are explicit and implicit "white spots". When a player action or a plot twist leads into one of these blank areas, I consciously generate new world data.
The world model especially contains NPCs, which are models of other agents in the world that have their own agenda and may also act on their own.

(plot-level thoughts):

Each RGP starts with a lot of plot ideas in the DM's mind. This is different for dungeon runs, but in any case, there is this separate "storage" for the current laid-out plot ideas.
There is a constant or at least regular check-up-thought on how far we are in the current plot arc, and whether we are drifting of and I need to either get the player on track or adjust the plot planning.
In these cases, I make a decision (more or less conscious) which is a different kind of thought than both the object-level story-writing (which involves lots of small "creative" decisions) and the meta-level observation/reasoning on the plot/the player.
...

You see, this is just an attempt at grasping the internal thought structure, I'm nowhere near done, and it definitively does not contain a concrete proposal on how to write down the thoughts instead.

My question to the project team is:

Have you thought about the internal structure of a DM's thoughts and different possible ways of how to express these verbally? What are the reasons that you chose this bullet-list-with-bracket-annotations over some other formats?

(1b) Remark on the bracket-annotation:

When trying to write down some though-annotated dungeon run steps, I noticed myself having to re-evaluate the written thoughts in order to determine whether I should put them in parentheses, or brackets, or both. This evaluation is a separate thought, of course, - which I did not record, or course. But it slows me down and gets me out of the writing flow. Maybe this fades as you get used to writing though-annotations. I actually believe it does at some point. But if it doesn't, or it does too late: Maybe it's not worth it to have these bracket? Maybe rather generate 1.5 times the training data and let the model figure out which though belongs to which level? Unless of course, we need the brackets in the output for further research (only intervene on "(" and "{" thoughts, directly change the content of the "[" long term memory, ...)

(2) human creativity does not hover in space

Every author draws on the entirety of his experiences when writing stories. These include personal experiences as well as written works (both facts and fiction) that they've read. Human DM's similarly will often think of pre-existing content, both on the object-level and the plot-level. And this content is not included in the dungeon run itself.

I believe that the current AI dungeon model do the same, in some way. Their pool of experience is their training data set.

My question is:

Can we capture references to pre-existing content in such a way the new model will learn from it to explicitly reference their training data?

Or must (and can) we prevent the authors that generate the prized training set of 110 thought-annotated dungeon runs to draw from pre-existing content that is not implicitly available to the current models, too, (that should then be re-trained with the new thought-annotations)?

(3) AI-DM vs AI-player

Currently, when playing an AI dungeon, the human always takes the role of the player and the AI is the DM.

Does it have to be like this?
Could we train a model to perform as a player in a dungeon run with a human DM? (Or are there such models already that I don't know of?)

If yes, maybe we should ask the authors that contribute to this project to provide thought-annotations for the player, as well?

I see 3 advantages here:

This is probably done much faster now in one go instead of asking for another batch of 110 dungeon runs with thought-annotations for the player inputs in a later stage of the research. Especially when authors team up and take different roles each - so then both authors share the "workload" of generation the thought-annotations.
Thinking far ahead into the future, a smarter-than-human-AI would rather be a player (agent) in a dungeon run (the real world) than the other way round. It thus might especially be fruitful to investigate in how intervention on the thought-level of the player effects the player actions.
Having AIs for both roles let us play them "against" each other. This could speed up the process of generating more training data for even better models (probably with a human reviewing the AI-vs-AI dungeon runs)

Replies from: sd-marlow

↑ comment by SD Marlow (sd-marlow) · 2022-03-25T01:14:19.946Z · LW(p) · GW(p)

I was "archiving" the link to this page and thought I'd see what's been going on. Updates seem to only be on the discord. Anyway, since they allowed me to post longer thoughts there, figured it would be fine for me to drop it here as well. https://sd-marlow.medium.com/slaying-the-ml-dragon-7ce0a2e4e3a6

From your post, you're looking at this in much the same way I was when I attempted to do a short run (to work the bugs out and really understand whats involved). However, "actual thoughts of the DM" is the wrong explanation for what they want. The examples of of what they are accepting look to be nothing more than the "common sense" stuff current ML models fail to capture (thus, explicitly stated in the runs). Also, from comments in the discord, it seems like the info captured is post-process, despite the desire for pre-prompt thoughts. Not trying to discourage; just showing my thinking on the process, and that it wasn't what they wanted.

comment by Rob Bensinger (RobbBB) · 2022-01-12T20:34:10.582Z · LW(p) · GW(p)

In case you missed it: we now have an FAQ for this project, last updated Jan. 7.

comment by binary_doge · 2021-11-30T16:54:01.661Z · LW(p) · GW(p)

Not sure if it was suggested already or not, but one option is to look for “lets play” style videos for some game (gonna be hard to find one that’s simple enough, probably) and take the spoken text the youtuber says as thoughts. Some of them already have the transcript as subtitles.

On the same vein, looking for people who explain their choices in very clear-decision games, like chess. I once saw a booklet of chess games where the actual player explained most of his moves. If there is a way to get many of those, that might work.

Replies from: oge

↑ comment by oge · 2021-11-30T20:40:33.768Z · LW(p) · GW(p)

What if we use the commentary from chess games as thoughts?

Replies from: binary_doge, jj-hepburn

↑ comment by binary_doge · 2021-12-01T19:01:37.087Z · LW(p) · GW(p)

The problem with commentary not made by the players themselves is that, as far as I understand it, the project wants the general thoughts of the player and not just the motivations for every specific move. Like, ideally, they want some stream of consciousness commentary style "oh look, that guy looks kind of tough, I'll go see if I can agro him. Oh no! he's way too strong... lets go hide behind this tree it looks kinda safe [...]". That's why I suggested the lets plays and not e-sports in general.

If they're ok with just noise-free motivational analysis, anything with good commentators might work, and chess is indeed a pretty clean option.

↑ comment by JJ Hepburn (jj-hepburn) · 2021-12-01T04:10:34.995Z · LW(p) · GW(p)

Could do Go, Poker or some E-Sports with commentary. Poker unlike chess has the advantage that the commentators can see all of the players hands but the players can only see their own. Commentators often will talk about what a player must be thinking in this situation and account for what is observable to the player or not.

This would certainly be easier to scale but not as good quality.

comment by plex (ete) · 2021-11-30T15:03:25.980Z · LW(p) · GW(p)

I've set up a Discord server for discussing collaborations and thinking about mechanism design for sharing out credit (current top idea is borrowing Rob Miles's Discord eigenkarma system with modifications, but liable to change), please join if you're considering becoming a run author (no commitment to being part of this effort).

I don't need the money and won't be skimming off any funds for my contributions to the project, but am very open to people turning up with a bunch of great ideas and making everything work smoother and taking a management fee as compensation, so please also join if you're interested in becoming a project leader or organizational assistant.

Replies from: hyje

↑ comment by hyje · 2021-12-06T12:40:47.926Z · LW(p) · GW(p)

I'm definitely interested, but your invite's expired, did it do that automatically or have you been overwhelmed with responses?

Replies from: ete

↑ comment by plex (ete) · 2021-12-06T23:17:27.821Z · LW(p) · GW(p)

Oh, my bad, it was a 7 day invite by Discord default, made it everlasting now.

comment by Holly_Elmore · 2021-12-01T02:27:11.981Z · LW(p) · GW(p)

Practical question: Can the DM and players switch roles in the course of one "run" or does the DM have to remain the same individual? What else has to be continuous or uniform about the run? Does there have to be one overarching plot or just continuous narrative?

Replies from: Eliezer_Yudkowsky, Holly_Elmore

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-12-01T05:12:28.468Z · LW(p) · GW(p)

My coauthor and myself generated the sample run by taking turns on Action, Thought, Prompt. That is, I wrote an Action, she wrote a Thought, I wrote a Prompt, she wrote an Action, I wrote a Thought, she wrote a Prompt. This also helped show up immediately when a Thought underspecified a Prompt, because it meant the Thought and Prompt were never written by the same person.

More coherent overall plot is better - that current systems are terrible at this is all the more reason to try to show a dataset of it being done better. There doesn't necessarily need to be an advance-planned endpoint which gets foreshadowed; that is demanding a bit much of the author when they're dealing with somebody else's Actions or when people are taking turns on the Thoughts.

↑ comment by Holly_Elmore · 2021-12-01T02:28:29.451Z · LW(p) · GW(p)

(These questions affect whether I could even consider attempting this. If I can I'll apply and talk directly with MIRI people about it.)

comment by justinpombrio · 2021-11-30T19:08:32.443Z · LW(p) · GW(p)

I have an idea for testing this approach, before getting authors to write tens of thousands of pages of annotated dungeon tests.

It's hard to generate explanations of prose, but easy, for a computer, to generate explanations of particular subsets of math. For example, WolframAlpha can explain its reasoning for finding the derivative of a polynomial (click "step by step solution", then "show all steps"): Wolfram Alpha derivative example

There's a wide variety of math problems which we can programmatically solve, and can therefore programmatically generate explanations for:

Arithmetic, like step-by-step long division
Derivatives over a large set of operations (but not integrals; those are harder)
Subsets of logic
Subsets of integer programming
Some varieties of logic puzzles, like "knights and knaves" and "Alice, Beth, and Cara live in houses 1, 2, and 3, and have favorite colors Red, Green, and Blue non-respectively; here are some clues to figure out which is which".
Simple algebra, like multiplying polynomials

(Actually, most of these are probably too hard to learn. Should focus on the really simple ones like long division.)

The idea is to:

Programmatically generate a large quantity of a small variety of math problems with explanations; then
Train one transformer on just the problem and final answer; and
Train another transformer on the problem, explanation, and final answer.

This is a very different domain than English prose, so it won't tell you anything definitive about that more important domain. But it's easier to do, and it shouldn't carry any risk of advancing AI capabilities, since the training set is already (by definition) something we can already solve more accurately by other means.

I imagine you could learn a few things about how the explanations influence the AI:

You can see whether the explanation helps teach the AI, by checking whether the second transformer outperforms the first.
You can see whether the AI actually "uses" the explanation, by looking at the pattern of mistakes. If the AI frequently bungles the explanation while writing down the correct final answer, it must be generating the explanation and answer separately. This would be a bad sign for "visible thought" alignment.
You can see whether the AI naturally "hides" mistakes in its reasoning. I wouldn't be surprised to frequently see a chain of reasoning "A -> B -> C -> D -> E -> F", where A, B, E, and F are right and C and D are wrong, since it's often easier to check the beginning and end of a proof. For example, students do this sometimes.

Replies from: pseudobison

↑ comment by gabrielrecc (pseudobison) · 2021-11-30T22:10:24.974Z · LW(p) · GW(p)

Relevant: From OpenAI's "Training Verifiers To Solve Math Word Problems": "We also note that it is important to allow the model to generate the full natural language solution before outputting a final answer. If we instead finetune a 6B model to directly output the final answer without any intermediate steps, performance drops drastically from 20.6% to 5.2%." Also the "exploration" linked in the post, as well as my own little exploration restricted to modulo operations on many-digit numbers (via step-by-step long division!), on which LMs do very poorly without generating intermediate steps. (But see also Hendryks et al: "We also experiment with using step-by-step solutions. We find that having models generate their own step-by-step solutions before producing an answer actually degrades accuracy. We qualitatively assess these generated solutions and find that while many steps remain illogical, they are often related to the question. Finally, we show that step-by-step solutions can still provide benefits today. We find that providing partial ground truth step-by-step solutions can improve performance, and that providing models with step-by-step solutions at training time also increases accuracy.")

comment by delton137 · 2021-11-30T03:33:17.403Z · LW(p) · GW(p)

(cross posting this comment from E. S. Yudkowksy's Facebook with some edits / elaboration)

Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.

Note this is similar to the "self explaining AI" idea I explored in early 2020, which I threw together a paper on (I am hesitant to link to it because it's not that great of a paper and much of the discussion there is CNN specific, but here it is.). I can see how producing "thoughts" could help us trust/determine how much a model really understands what's going on or how to make a good story.

However I also could see the "thoughts" output misleading people - people might mistake the model's explanations as mapping onto the calculations going on inside the model to produce an output. The way GPT-3 works, I suspect, is very far from how humans think. GPT-3 is very bad at a lot of common sense and physics-based reasoning, for instance, yet based on the thoughts output the user might be misled into thinking the model understands common sense notions or physics since it's spouting off a version of some stuff it got from it's training data.

Any work along these lines would definitely need empirical testing / studies to show that the extra "thoughts" output is useful to end-users in some way (like predicting failure modes or helping debug failures).

Also, I'm unclear on what constitutes a "run"... roughly how long does the text have to be, in words, to have a chance at getting $20,000?

Replies from: Eliezer_Yudkowsky, Joe_Collman, nostalgebraist, StellaAthena

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-30T04:23:01.940Z · LW(p) · GW(p)

We're guessing 1000 steps per reasonably-completed run (more or less, doesn't have to be exact) and guessing maybe 300 words per step, mostly 'thought'. Where 'thoughts' can be relatively stream-of-consciousness once accustomed (we hope) and the dungeon run doesn't have to be Hugo quality in its plotting, so it's not like we're asking for a 300,000-word edited novel.

Replies from: WilliamKiely

↑ comment by WilliamKiely · 2021-11-30T07:08:16.485Z · LW(p) · GW(p)

The sample Nate linked is 30 pages and 12,267 words. So that works out to ~730 pages for a run.

$20,000/300,000 words = $1 per 15 words. If an author writing it manually could average 15 wpm, that would be $60/hour.

Replies from: delton137

↑ comment by delton137 · 2021-11-30T15:03:20.427Z · LW(p) · GW(p)

Sorry, I missed that somehow. Thanks.

↑ comment by Joe Collman (Joe_Collman) · 2021-11-30T03:48:13.292Z · LW(p) · GW(p)

However I also could see the "thoughts" output misleading people - people might mistake the model's explanations as mapping onto the calculations going on inside the model to produce an output.

I think the key point on avoiding this is the intervening-on-the-thoughts part:
"An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts".

So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.

↑ comment by nostalgebraist · 2021-12-01T17:33:23.279Z · LW(p) · GW(p)

Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.

I've fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn't exist yet).

Below I list some key things to note. Also see here [LW · GW] for related discussion. These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently.

GPT performance tends to scale smoothly and gradually with data/model size, over multiple orders of magnitude.
In terms of subjective response, you don't need much data to get GPTs to the level of "hey, it kinda gets it!".
You may need several orders of magnitude more data to reach the point of saturation where the model can't improve with additional data.
Incomplete mastery usually looks more like "randomly failing X% of the time" than "understanding X% of the content of the task," which can make it difficult to assess quality (or quality differences) at a glance.

For a concrete example, here is a data scaling experiment I did with GPT-J (6.1B params) on the tumblr post dataset I use for my tumblr bot. My full dataset is roughly 4 times as large as the 30M word dataset proposed here, i.e. the 30M word dataset would be roughly as big as the 25% subsample shown in the report.

The linked report only shows val loss, which is not very interpretable, but at least conveys that I haven't reached diminishing returns yet. This seems plausible from subjective evidence, as the model still sometimes misunderstands tumblr lingo / the conversational structure of the data / etc.

↑ comment by StellaAthena · 2021-11-30T12:44:12.734Z · LW(p) · GW(p)

Also, I'm unclear on what constitutes a "run"... roughly how long does the text have to be, in words, to have a chance at getting $20,000?

Using the stated length estimates per section, a single run would constitute approximately 600 pages of single spaced text. This is a lot of writing.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-02-02T02:08:14.674Z · LW(p) · GW(p)

Came across this today on r/mlscaling and thought I'd put it here since it's relevant: https://arxiv.org/abs/2201.11903#google

This paper explores the ability of language models to generate a coherent chain of thought—a series of short sentences that mimic the reasoning process a person might have when responding to a question. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.

comment by MichaelLowe · 2021-11-30T17:11:09.217Z · LW(p) · GW(p)

This looks exciting! I wonder about the proposed training setup: If one model produces the thoughts, and another one takes those as input to the prompts, are we actually learning anything about the internal state of either model? What is the advantage (beyond scalability) of this training setup vs just using the second model to produce continuations conditional on thoughts?

comment by michaelwheatley · 2021-12-01T03:46:10.072Z · LW(p) · GW(p)

I know next to nothing about AI, so please correct me if I'm wrong, but it seems like the thought process of a dungeon master is a difficult starting point, since they're balancing out multiple levels of considerations. They're simulating a world, but also trying to shape a story (plus modelling the players & writing decent prose). The data would seem to be simpler to understand if you're dealing with a pure simulationist DM, or player who's 100% roleplayer (or munchkin), as the chains of reasoning would be focused on maximizing a single clear metric.

I ask because if so, some of those options might be also be easier to produce than a true AI Dungeon run.

comment by Ryan Mather (ryan-mather) · 2021-12-01T22:49:39.374Z · LW(p) · GW(p)

I think there a lot of amazing people in the roleplaying games community that could help meet this project's goals. That said I'm worried this document would be hard to understand for most of that community, which doesn't overlap that much with the AI community. I'd suggest rephrasing the ask in plain english.

"We're looking to pay dungeon masters to submit transcripts of games with a documentation of their thought process, so we can train algorithms to think the way dungeon masters do. Here's the format we need them in and steps for how to apply, and how much you can get paid".

comment by RomanS · 2021-11-30T19:57:21.161Z · LW(p) · GW(p)

A possible way to scale it: "collaborative fanfic dungeons":

a publicly accessible website where users can
- write dungeon runs
- write new steps to the existing runs
- rate the runs / steps (perhaps with separate ratings for thoughts, actions etc)
- only selected users can rate (initially - only the admins, then - top users etc)
could be as technically simple as a wiki (at least in the first iterations)
- could go way beyond that. E.g.:
  - automatic generation of playable text adventures
  - play as the DM with real people
the target audience: fanfic writers / readers
- (it's much easier to write runs in well known fictional worlds. e.g. HP)
the user earns money if their work is good

Replies from: ete

↑ comment by plex (ete) · 2021-11-30T20:24:52.129Z · LW(p) · GW(p)

I think the MVP way to do this would be a Discord server with non-public channels for individual runs and using the threads feature to give feedback to each other. If anyone would like to do that and is looking for collaborators, drop by the Visible Thoughts Discord and let us know.

comment by Ronny Fernandez (ronny-fernandez) · 2021-11-30T10:46:02.198Z · LW(p) · GW(p)

Can we apply for consultation as a team of two? We only want remote consultation of the resources you are offering because we are not based in bay area.

Replies from: So8res

↑ comment by So8res · 2021-11-30T12:04:39.535Z · LW(p) · GW(p)

Yep!

comment by Alicorn · 2023-01-10T04:40:03.736Z · LW(p) · GW(p)

I appreciate this post, though mostly secondhand. It's special to me because it provided me with a way to participate more-or-less directly in an alignment project: one of my glowfic buddies decided to rope me in to write a glowfic thread in this format for the project [here](https://glowfic.com/posts/5726). I'd like to hear more updates about how it's gone in the last year, though!

comment by Beth Barnes (beth-barnes) · 2022-05-13T03:15:04.986Z · LW(p) · GW(p)

It seems to me like this should be pretty easy to do and I'm disappointed there hasn't been more action on it yet. Things I'd try:
- reach out to various human-data-as-a-service companies like SurgeHQ, Scale, Samasource
- look for people on upwork
- find people who write fiction on the internet (e.g. post on fanfiction forums) and offer to pay them to annotate their existing stories (not a dungeon run exactly, but I don't see why the dungeon setting is important)

I'd be interested to hear if anyone has tried these things and run into roadblocks.

I'm also interested if anyone has an explanation of why the focus is on the dungeon thing in particular rather than e.g. fiction generally.

One concern I'd have with this dataset is that the thoughts are post-hoc rationalizations for what is written rather than actually the thought process that went into it. To reduce this, you could do something like split it so one person writes the thoughts, and someone else writes the next step, without other communication.

comment by Dr_Manhattan · 2021-12-02T14:45:48.500Z · LW(p) · GW(p)

Related work:
Show Your Work: Scratchpads for Intermediate Computation with Language Models
https://arxiv.org/abs/2112.00114

(from very surface-level perusal) Prompting the model resulted in
1) Model outputting intermediate thinking "steps"

2) Capability gain

comment by SD Marlow (sd-marlow) · 2021-11-30T19:58:58.276Z · LW(p) · GW(p)

I'm just "importing" my twitter thread and adding some additional thoughts.

If some model could spit out 100 of these annotated adventures, then the challenge would have already been solved.

Not sure about that 300,000 word count document idea though... A word-dump focused "result" plays into the strength of LLM's while providing none of the structure that is missing.

The more I work on this, the more I think you want something different. Perhaps use existing choose your own adventure books as a starting point, and work on deconstructing them; expanding on all of the reasoning, mechanics, story elements, etc.

The example given is heavy with exposition, and no real mechanics. That seems to rule-out any desire for explicit replies to a prompt (implication that player goes thru door is enough, not needing "walk thru door").

I get that an algo doesn't care, but example is hard to parse. It fails as an adventure (very on-rails) but also like having director commentary track play over a movie you've never seen, and then get tested on dialog and plot points.

The "thoughts" related to the 4 page sample just look like answers to multiple choice questions about the body of text. This says nothing about the process of crafting the narrative, which is the point, right? Examples of how to craft story structure? Why something was done?

There is a kind of "other minds" problem, in that the story should be constructed with player expectations in mind. Rather than just generating copious amounts of "story text," the adventure is more of a dialog where the DM moves the player thru a story, but also "entertains" with traps and dead-ends. What will happen next feels like ground that is already covered by LLM's, but anticipation of actions is where the dynamic feel comes from (so at the very least, an algo needs to create branching story structure).

30M word dataset's wont do anything to "train creativity" into the system, such as understanding why a small white rabbit isn't a real threat.. until it fly's at your neck.

Edit: Would it not just be easier to craft a framework since all of the questions/considerations required when building a story are going to be the same regardless of player inputs? I'm going to continue-on with the "adventure" track I've already started since the end of act annotations still explain the reasoning, and help point toward future story elements. There is no pre-planned arc, so there is the same level of "real-time" construction as the game progresses. Really not clear how annotating a few copies of War and Peace is useful while also having to write such a story. As stated, after 12k-15k words, you would have discovered a framework that works for the next 15M words.

Replies from: oge

↑ comment by oge · 2021-11-30T20:54:52.713Z · LW(p) · GW(p)

Yeah, and let's not build a machine that can lie very well.

Replies from: sd-marlow

↑ comment by SD Marlow (sd-marlow) · 2021-11-30T22:31:00.997Z · LW(p) · GW(p)

This is a relevant point: An AI that can craft some misdirection into a game or story is showing a deeper level of understanding, but as it's within a context (game/story), that isn't really a lie. The question for MIRI is, does that kind of "knowledge about misdirection" serve as a dual-use technology, where said ability could be used in other circumstances?

comment by Thomas Kwa (thomas-kwa) · 2022-03-25T22:45:44.634Z · LW(p) · GW(p)

I think the bounty amount is too low to attract skilled writers. The rate of ~3.3 cents/word is substantially less than the 6-10 cents per word most publications pay. Though it is stated in this post that a run "does not need to be published-novel-quality literature", this project is sufficiently weird that I'd imagine most skilled writers would rather write traditional short fiction, especially when considering that this project wouldn't move writers towards either career development or their passions.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2022-03-25T23:34:05.833Z · LW(p) · GW(p)

For a nearby datapoint, amounts I've paid to solid editors (not writers) for work include 1.5 cents/word and $50/hour.

comment by michaelwheatley · 2021-12-01T03:21:25.426Z · LW(p) · GW(p)

Are you familiar with forum rps (role play)s? A group of people would collectively write a story, each playing the role of one character. (Looking it up now, looks like there are some Choose-Your-Own-Adventure variants akin to AI Dungeon) It was more popular back in the day, but it looks like there are some still extant.

These people are already doing something like what you're asking for, so it might be worth someone dropping in and offering to pay them in exchange for them taking copious notes.

Funnily enough, the first one I found when googling has the gimmick of "your comments are the protagonist's thoughts," which is not exactly what you're looking for, but pretty darn close.... except for the fact that the damn commenters can't stay in character!
[link removed; it tripped the spam filter]

Anyway, I have a writer friend who used to do this, I'll ask her about suitability. As I mentioned, the popular style is collaborative storytelling. It's not exactly AI Dungeon, but it seems to check the main boxes. (The difference being that it's multiple DMs squaring off). Would something like that work for you, or would the format incompatibility with regular AI Dungeons be a dealbreaker?

comment by SD Marlow (sd-marlow) · 2021-12-07T03:22:04.933Z · LW(p) · GW(p)

I started with something more "contained" and easier to manage because actual users will go off script every chance they get, and this is basically like playing chess against yourself while reading a book on how to play chess. But, I may have found a kind of working compromise in terms of format and what needs to be captured. Will need a few days to see how it holds up, but right now, this is the basic idea:

Initial PROMPT to get the story started, followed by THOUGHTS that examine them from a gaming perspective, an ACTION, my THOUGHTS, another PROMPT, and.. this is where I was having a tough time because some of the mechanics were not being captured in the THOUGHTS prior. It was only as I wrote the PROMPT that I figured-out certain details or actions that needed to be in play. So when I write a PROMPT that contains these other elements, I write a LOGIC section below them to explain why I "prompted" the way I did.

In crafting the story as you go, the PROMPT is also part of the THOUGHT process! I'm sure anyone giving this a try will be writing and re-writing their prompt as part of the process. Having this extra LOGIC step seems to clean that up, but I don't think any ML algo will ever keep track of story elements, have ideas on where to take the story next, and then backtrack. Perhaps the "prompt" is some adversarial output from the thoughts, but still internal to process, leading to more thoughts (aka the logic), which leads to the actual output.

Just my 2 cents.

Replies from: sd-marlow, sd-marlow

↑ comment by SD Marlow (sd-marlow) · 2021-12-08T22:45:50.969Z · LW(p) · GW(p)

Found a rhythm using PLAT (Prompt. Logic. Action. Thought.) but am only averaging 185 words per step. That would be about 18,000 words for 100 steps, or 54,000 words for 300 (which is the very bottom end of book territory). Agree that 100 steps is no story, but waiting to reach 100 steps before checking-in is waiting to long.

Would recommend anyone near the 20 step or 10 pages mark send that in for feedback before going further. I'm going to ignore my own advice because I'd like to complete the first 3 scenes, which is closer to 10% of the full story.

↑ comment by SD Marlow (sd-marlow) · 2021-12-07T22:05:40.746Z · LW(p) · GW(p)

People are concerned about upfront time commitment, while also being focused on 100 step minimum. In another comment I went over how 250 steps works as a better minimum, but to keep all the numbers aligned, perhaps every story should be in 3 acts of 100 steps each (with at least 300 steps being a requirement; handing-in 299 steps would seem sloppy and rude). That would make each "short" story worth $6k, and each act $2k, which is the same 10% of 1,000 steps. Except, handing in the first act should only reserve your $6k payout, not result in getting $2k at a time (desire for finished products and not having to burden anyone with increased tracking/management). There could also be an $18k cap (for 3 short stories of at least 300 steps each) to both limit the number of short stories submitted and let people know there is no "dominating" of the short story space.

Replies from: sd-marlow

↑ comment by SD Marlow (sd-marlow) · 2021-12-09T01:35:07.981Z · LW(p) · GW(p)

With no idea what the arc of the run/story will be, it's really hard to plan for 3 acts, so maybe not so useful. But did want to leave another comment about scenes. With 4 scenes being about 50 steps, just as a reference, we can look at the number of scenes in a movie to figure each run could be 500 to 750 steps in total length. I just don't see 1,000 steps as being anything other than an arbitrary dataset requirement. 250-300 steps as a playable run. 500 to 600 steps as a "movie length" representation. And then to double that?

The mental requirement to "film" a Lord of the Rings trilogy while also "filming" the behind the scenes of that filming and also "filming" the real-time documentary required to keep track of everything... while not being clear on how that extra "run time" translates into being better training data.

Is there going to be a "THIS" post, using sample work that you really like and "demanding" all other entries follow that exact format? How will variations in formatting be addressed? Does it need to be?
If you get something that checks all the right boxes, with one exception that leads to a rejection, I think we'd all like to know what that one must-have is.

Replies from: sd-marlow

↑ comment by SD Marlow (sd-marlow) · 2021-12-09T20:10:31.664Z · LW(p) · GW(p)

Using scenes as a marker has some added benefit as I find myself leaving high level comments about some of the next scenes (I had nothing planned beyond the start, but the natural progression leads to speculation about future events or details). This is some of that looking ahead data that this project wanted to capture. Perhaps there should be a FUTURE keyword to wrap these things under? It would basically be a THOUGHT for world building ideas, but not specific to the current part of the story/narrative.

Anything that goes into writing or crafting needs to be captured in "real time" which means dumping it right in the middle of whatever you are doing.

comment by Bart Bussmann (Stuckwork) · 2024-11-28T15:32:34.669Z · LW(p) · GW(p)

Three years later, and we actually got LLMs with visible thoughts, such as Deepseek, QwQ, and (although partially hidden from the user) o1-preview.

I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs.

Good call!

Replies from: martin-randall

↑ comment by Martin Randall (martin-randall) · 2024-11-28T15:58:43.290Z · LW(p) · GW(p)

But I don't think these came about through training on synthetic thought-annotated texts.

comment by Utilop · 2021-12-01T00:19:14.280Z · LW(p) · GW(p)

Some naive thoughts in case useful:

A) Is the structured annotation format more useful than a gamemaster/writer thinking aloud while recording themselves (possibly with an audience)?

That could be the closest thing to a full transcript of the human process which downstream tasks could condense as needed. An adopted annotation format (prescribed or not) could potentially cause thoughts to be filtered, reinterpreted, or even steer human generation?

One key example against a fixed-format annotation, I think is that human gamemasters and writers do not spend approximate constant effort per player action. They will do a lot of up-front work to have a plan for the story, can go on auto-pilot for many of the interactions, while thinking hard about critical parts of the story. Language models which generate stories today notoriously seem to lack this red thread and filling out a form summarizing the writers' thoughts may fail to capture this process.

The unstructured approach may also be closer to what pretrained models have learned and therefore require less data.

It could perhaps also provide a highly interesting dataset for another task relevant to the application - metareasoning in generation - should the agent output the next part of the story or keep thinking about the generation?

Alternatively, one could record all thoughts as they come, but follow up each output with some standardized questions - if there are some critical to the application?

B) I am curious whether sufficiently strong language models wouldn't be able to fake the explanations post-hoc.

At least, looking at the forms, I am not sure whether I could tell competent explanations apart. If that is the case, it could be that the dataset does not get us that far in interpretability and lead to more specific needs. It might be worth trying to answer that question too.

E.g. before the dataset is made public, you could hide the thoughts in a crafted run and let another team fill in thoughts post-hoc. They could be rewarded for swaying evaluators to accept theirs as the original. This could also answer whether even humans are able to tell apart genuine motivations behind a decision vs made-up explanations; and provide another task dataset.

( C) Probably clear already but models like GPT3 can generate responses/stories while reflecting/talking to itself, and some already use it this way and only output the end results. Although that is probably not operating at the desired level. Fine-tuning is also fairly cheap so don't think one has to settle for GPT2. If the goal was interpretability of each generated token, perhaps the thoughts should also be derived from intermediate layers rather than being part of the sequence)

comment by Scott Emmons · 2022-01-19T20:25:48.021Z · LW(p) · GW(p)

It seems to me that the comments in code provide "visible thoughts" for what the programmer intends. What do you hope to learn from training language models on thought-annotated dungeons that you couldn't learn from language models that have already been trained on commented code?

comment by rokosbasilisk · 2021-12-06T16:27:51.960Z · LW(p) · GW(p)

silly idea: instead of thought-annotating ai-dungeon plays, we can start with annotating thoughts for akinator gameruns.

pros: much more easier and faster way to build a dataset, with less ambiguity

cons: somewhat restricted than the original idea.

comment by Vlad Loweren · 2021-12-03T15:14:31.606Z · LW(p) · GW(p)

As a photographer, I got excited at first by the inclusion of the word "visible", but I guess today is not my day. Is there any chance for me to participate in training ML models by collecting a dataset of photos? I'm in the process of relocating to Singapore, but getting a work visa takes a while so I have a lot of free time now.

comment by cultureulterior · 2021-11-30T15:21:39.731Z · LW(p) · GW(p)

I don't understand why showing the thinking of the DM/Author is important for this problem. To me it feels sufficient to show the thinking of the characters alone?

Replies from: oge

↑ comment by oge · 2021-11-30T22:25:39.978Z · LW(p) · GW(p)

I think we'd like a summary of how the decisions were arrived at

comment by oge · 2021-12-01T00:00:28.252Z · LW(p) · GW(p)

Hey Nate, what's the first thing you'd ask the Friendly AI to do if you knew it existed?

Visible Thoughts Project and Bounty Announcement

Contents

The Project

The Machine Learning Experiment

Motivation for this project

Notes on Closure

Motivation for the public appeal

The Payouts

Support

Application

106 comments