Posts

AI x-risk, approximately ordered by embarrassment 2023-04-12T23:01:00.561Z
ELCK might require nontrivial scalable alignment progress, and seems tractable enough to try 2023-04-08T21:49:32.925Z
A tension between two prosaic alignment subgoals 2023-03-19T14:07:53.519Z
Deceptive failures short of full catastrophe. 2023-01-15T19:28:52.531Z
alexrjl's Shortform 2022-08-29T14:23:34.323Z
Thoughts on 'List of Lethalities' 2022-08-17T18:33:31.363Z
An easy win for hard decisions 2022-05-05T07:47:14.398Z
Incentive Problems With Current Forecasting Competitions. 2020-11-09T16:20:06.394Z

Comments

Comment by Alex Lawsen (alex-lszn) on What are examples of someone doing a lot of work to find the best of something? · 2023-07-28T05:42:04.801Z · LW · GW

James Hoffman's coffee videos have this kind of vibe. The "tasting every Nespresso pod" one is a clear example, but I also really appreciate e.g. the explanations of how to blind taste

https://youtube.com/@jameshoffmann

Comment by Alex Lawsen (alex-lszn) on AI x-risk, approximately ordered by embarrassment · 2023-04-13T08:54:38.997Z · LW · GW

Thanks, both for the thoughts and encouragement!

I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:


Appreciate you doing a quick version. I'm excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.

P(Doom) for each scenario would also be useful.


Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like:
- P(scenario B leads to doom | scenario A turns out not to be an issue by default)
- P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve)
- P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)

etc.

Comment by Alex Lawsen (alex-lszn) on Deceptive failures short of full catastrophe. · 2023-03-11T18:35:58.448Z · LW · GW

I think my suggest usage is slightly better but I'm not sure it's worth the effort of trying to make people change, though I find 'camouflage' as a term useful when I'm trying to explain to people.

Comment by Alex Lawsen (alex-lszn) on Deceptive failures short of full catastrophe. · 2023-03-08T16:32:31.081Z · LW · GW

Good question. I think there's a large overlap between them, including most of the important/scary cases that don't involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart:
- There's some kinds of 'oversight failure' which aren't 'scalable oversight failure' e.g. the ball grabbing robot hand thing. I don't think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage.
- There's also some kinds of scalable oversight failure where the issue looks more like 'we didn't try at all' than 'we tried, but selecting based only on what we could see screwed us'. Someone just deciding to deploy a system and essentially just hoping that it's aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a 'failure of scalable oversight', and instead want to call it a 'failure to even try scalable oversight', but I think the line is blurry - maybe people tried some scalable oversight stuff, it didn't really work, and then they gave up and said 'short term is probably fine'.
- I think most failures of scalable oversight have some story which roughly goes "people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer". These count as both.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-20T08:23:35.024Z · LW · GW

Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.

This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn't yet capable of gradient hacking, we shouldn't expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).

(Intuitions here mostly coming from the "cleanup" stage that Neel found in his grokking paper)

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-04T11:07:06.576Z · LW · GW

[epistemic status: showerthought] Getting language models to never simulate a character which does objectionable stuff is really hard. But getting language models to produce certain behaviours for a wide variety of prompts is much easier.

If we're worried about conditioning generative models getting more dangerous the more powerful LMs get, what if we fine tuned in an association between [power seeking/deception] and [wild overconfidence, including missing obvious flaws in plans, doing long "TV supervillain" style speeches before carrying out the final stage of a plan, etc.].

If the world model that gets learned is one where power seeking has an extremely strong association with poor cognition, maybe we either get bad attempts at treacherous turns before good ones, or models learning to be extremely suspicious of instrumental reasoning leading to power seeking given it's poor track record.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-03T10:41:59.111Z · LW · GW

I still think there's something here and still think that it's interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. 'write access to external memory of which there is no oversight', could bound the potential drift problem at lower capability levels than I was initially thinking for a 'pure' gradient-hack of the sort described here.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-01T11:22:53.523Z · LW · GW

I think there's quite a big difference between 'bad looking stuff gets selected away' and 'design a poisoned token' and I was talking about the former in the top level comment, but as it happens I don't think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-01T11:16:58.657Z · LW · GW

The different kinds of deception thing did eventually get written up and posted!

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-01T11:10:26.772Z · LW · GW

If goal-drift prevention comes after perfect deception in the capabilities ladder, treacherous turns are a bad idea.

Prompted by a thought from a colleague, here's a rough sketch of something that might turn out to be interesting once I flesh it out.

- Once a model is deceptively aligned, it seems like SGD is most just going to improve search/planning ability rather than do anything with the mesa-objective.

- But because 'do well according the overseers' is the correct training strategy irrespective of the mesa-objective, there's also no reason that SGD would preserve the mesa objective.

- I think this means we should expect it to 'drift' over time.

- Gradient hacking seems hard, plausibly harder than fooling human oversight.

- If gradient hacking is hard, and I'm right about the drift thing, then I think there are setups where something that looks more like "trade with humans and assist with your own oversight" beats "deceptive alignment + eventual treacherous turn" as a strategy. 

- In particular, it feels like this points slightly in the direction of a "transparency is self-promoting/unusually stable" hypothesis, which is exciting.

Comment by Alex Lawsen (alex-lszn) on Wentworth and Larsen on buying time · 2023-01-10T09:35:21.423Z · LW · GW

Could you explain your model here of how outreach to typical employees becomes net negative?

The path of: [low level OpenAI employees think better about x-risk -> improved general OpenAI reasoning around x-risk -> improved decisions] seems high EV to me.

I think the obvious way this becomes net negative is if the first (unstated) step in the causal chain is actually false:
[People who don't have any good ideas for making progress on alignment try to 'buy time' by pitching people who work at big ML labs on AI x-risk -> low level OpenAI employees think better about x-risk]

A concern of mine, especially when ideas about this kind of untargeted outreach are framed as "this is the thing to do if you can't make technical progress", is that [low level OpenAI employees think better about x-risk] will often instead be something like [low level employees' suspicion that the "AI doomer crowd" doesn't really know what it's talking about is reinforced], or [low level employee now thinks worse about x-risk].

Comment by Alex Lawsen (alex-lszn) on AGI Timelines in Governance: Different Strategies for Different Timeframes · 2022-12-20T09:10:35.989Z · LW · GW

[crossposting my comment from the EA forum as I expect it's also worth discussing here]

whether you have a 5-10 year timeline or a 15-20 year timeline

Something that I'd like this post to address that it doesn't is that to have "a timeline" rather than a distribution seems ~indefensible given the amount of uncertainty involved. People quote medians (or modes, and it's not clear to me that they reliability differentiate between these) ostensibly as a shorthand for their entire distribution, but then discussion proceeds based only on the point estimates.

I think a shift of 2 years in the median of your distribution looks like a shift of only a few % in your P(AGI by 20XX) numbers for all 20XX, and that means discussion of what people who "have different timelines" should do is usually better framed as "what strategies will turn out to have been helpful if AGI arrives in 2030".

While this doesn't make discussion like this post useless, I don't think this is a minor nitpick. I'm extremely worried by "plays for variance", some of which are briefly mentioned above (though far from the worst I've heard). I think these tend to look good only on worldviews which are extremely overconfident, and treat timelines as point estimates/extremely sharp peaks). More balanced views, even those with a median much sooner than mine, should typically realise that the EV gained in the worlds where things move quickly is not worth the expected cost in worlds where they don't. This is in addition to the usual points about co-operative behaviour when uncertain about the state of the world, adverse selection, the unilateralist's curse etc.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-10-15T15:17:46.877Z · LW · GW

(Written up from a Twitter conversation here. Few/no original ideas, but maybe some original presentation of them.)

'Consequentialism' in AI systems.

When I think about the potential future capabilities of AI systems, one pattern is especially concerning. The pattern is simple, will produce good performance in training, and by default is extremely dangerous. It is often referred to as consequentialism, but as this term has several other meanings, I'll spell it out explicitly here*: 

1. Generate plans 

2. Predict the consequences of those plans 

3. Evaluate the expected consequences of those plans 

4. Execute the one with the best expected consequences 

Preventing this pattern from emerging is, in my view, a large part of the problem we face.

There is a disconnect between my personal beliefs and the algorithm I described. I believe, like many others thinking about AI alignment, that the most plausible moral theories are Consequentialist. That is, policies are good if and only if, in expectation, they lead to good outcomes. This moral position is separate from my worry about consequentialist reasoning in models, in fact, in most cases I think the best policy for me to have looks nothing like the algorithm above. My problem with "consequentialist" agents is not that they might have my personal values as their "evaluate" step. It is that, by default, they will be deceptive until they are powerful enough, and then kill me.

The reason this pattern is so concerning is because, once such a system can model itself as being part of a training process, then plans which look like 'do exactly what the developers want until they can't turn you off, then make sure they'll never be able to again' will score perfectly on training, regardless of the evaluation function being used in step 3. In other words, the system will be deceptively aligned, and once a system is deceptively aligned, it will score perfectly on training.

This only matters for models which are sufficiently intelligent, but the term "intelligent" is loaded and means different things to different people, so I'll avoid using it. In the context I care about, intelligence is about the ability to execute the first two steps of the algorithm I'm worried about. Per my definition, being able to generate many long and/or complicated plans, and being able to accurately predict the consequences of these plans, both contribute to "intelligence", and the way they contribute to dangerous capabilities is different. Consider an advanced chess-playing AI, which has control of a robot body in order to play over-the-board. If the relevant way in which it's advanced corresponds to step 2, you won't be able to win, but you'll probably be safe. If the relevant way in which it's advanced corresponds to step 1, it might discover the strategy: "threaten my opponent with physical violence unless they resign". 

*The 4-step algorithm I described will obviously not be linear in practice, in particular, which plans get generated will likely be informed by predictions and evaluations of their consequences, so 1-3 are all mixed up. I don't think this matters much to the argument.


Parts of my model I'm yet to write up but which fit into this:

- Different kinds of deception and the capabilities required.

- Different kinds of myopia and how fragile they are

- What winning might look like (not a strategy, just a north star)

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-09-08T06:07:58.754Z · LW · GW

I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I'm skeptical about preventing this.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-09-07T10:31:25.326Z · LW · GW

Clarifying something based on a helpful DM: 
 

  • I know Steganography implies something like deliberate intent. 
  • I don't think there's currently deliberate intent. 
  • I also don't think deliberate intent matters, but I know this is controversial and haven't yet articulated why.
  • Fleshing out why I want to push hard against 'deliberateness' mattering is going to take more time than I expect to have this week, and it's possible writing anything about this before I had time to do it was a mistake.
  • I think it's pretty non-obvious, and would put more than 20% on me concluding that I'm totally wrong about this specific thing in the process of trying to write out my intuition.
Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-09-07T08:38:33.371Z · LW · GW

Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:

Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I've been thinking about this in the abstract for a while but somehow I didn't see what was right in front of me until now.

I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don't want this to drop out of my head.

I'll try to write something proper up soon, but the sketch of what happened is:
 

  • It's been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
  • And that this could be done in such a way that it wouldn't look obvious to readers.
  • And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
  • Then today I was thinking about LLM+scratchpad as my MVP for 'scale is all you need', and I was wondering about how small a scratchpad you'd need, where one possibility was you don't need one at all you just remove the first words in the context window as you run out of space.
  • And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
  • The whole discipline of prompt engineering is a demonstration that most of the instructions we're giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
  • Which is obviously miles away from being the case. I haven't thought about how to formalise it, but I suspect that almost all of the instructions we're actually giving in prompts are 'under the surface', rather than being 'natural language interpretable'.
  • I'm embarrassed by how obvious this realisation is in hindsight, but whatever, it's there now. Now to work out what to do about it.

(I don't think I'd have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)

Comment by Alex Lawsen (alex-lszn) on Simulators · 2022-09-04T13:16:32.483Z · LW · GW

Yeah this is the impression I have of their views too, but I think there are good reasons to discuss what this kind of theoretical framework says about RL anyway, even if you're very against pushing the RL SoTA.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-09-04T09:47:41.607Z · LW · GW

I'm confused about how valuable Language models are multiverse generators is as a framing. On the one hand, I find thinking in this way very natural, and did end up having what I thought were useful ideas to pursue further as I was reading it. I also think loom is really interesting, and it's clearly a product of the same thought process (and mind(s?)).

On the other hand, I worry that the framing is so compelling mostly just because of our ability to read into text. Lots of things have high branching factor, and I think there's a very real sense in which we could replace the post with Stockfish is a multiverse generator,  Alphazero is a multiverse generator, or Piosolver is a multiverse generator, and the post would look basically the same, except it would seem much less beautiful/insightful, and instead just provoke a response of 'yes, when you can choose a bunch of options at each step in some multistep process, the goodness of different options is labelled with some real, and you can softmax those reals to turn them into probabilities, your process looks like a massive tree getting split into finer and finer structure.'

There's a slight subtlety here in that in the chess and go cases, the structure won't strictly be a tree because some positions can repeat, and in the poker case the number of times the tree can branch is limited (unless you consider multiple hands, but in that case you also have possible loops because of split pots). I don't know how much this changes things.

Comment by Alex Lawsen (alex-lszn) on Simulators · 2022-09-04T09:28:29.420Z · LW · GW

Thanks for writing this up! I've found this frame to be a really useful way of thinking about GPT-like models since first discussing it.

In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the 'other methods' section of 'Novel methods of process/agent specification'). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI's current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.

It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below: 

  • Something that seems relatively benign/unexciting - fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn't really change which simulacra are accessible to the simulator. 
  • More worrying things - particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
  • Things which could go either way and seem very high stakes - the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.
Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-08-29T14:23:34.592Z · LW · GW

Idea I want to flesh out into a full post:


Changing board-game rules as a test environment for unscoped consequentialism.
 

  •  The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
    • If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
    • The moment at which I become aware of these other options seems critical.
  • It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
    • My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
    • I think this provides a pretty great analogy to expanding the scope of your action set.
    • There's also some relevance to training/deployment mismatches.
    • If you're teaching a language model to play the game, the specific 'changing the rules' actions could be included in the 'instruction set' for the game. 
  • This might provide insight/the opportunity to experiment on (to flesh out in depth):
    • Myopia
    • Deception (if we select away from agents who make these illegal moves)
    • useful bounds on consequentialism
    • More specific things like, in the language models example above, whether saying 'don't do these things, they're not allowed', works better or worse than not mentioning them at all.
Comment by Alex Lawsen (alex-lszn) on Seeking beta readers who are ignorant of biology but knowledgeable about AI safety · 2022-07-28T04:04:01.492Z · LW · GW

Interested.

AI safety level: don't typically struggle to follow technical conversations with full time researchers, though am not a full time researcher.

Bio: last studied it 14 years ago. Vaguely aware miosis and mitosis are different but couldn't define either without Google.

Comment by alex-lszn on [deleted post] 2022-07-24T02:13:56.108Z

Founders Pledge's research is the best in the game here. If you want to make a recommendation that's for a specific charity rather than a fund, Clean Air Task Force seemed sensible every time I spoke to them, and have been around for a while.

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:43:35.902Z · LW · GW

AXRP - Excellent interviews with a variety of researchers. Daniel's substantial own knowledge means that the questions he asks are often excellent, and the technical depth is far better than anything else that's available in audio, given the difficulty of autoreaders on papers or the alignment forum finding it difficult to handle actual maths.

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:40:59.701Z · LW · GW

This talk from Joe Carlsmith. - Hits at several of the key ideas really directly given the time and technical background constraints. Like Rob's videos, implies an obvious next step for people interested in learning more, or who are suspicious of one of the claims (reading Joe's actual report, maybe even the extensive discussion of it on here).
 

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:37:49.964Z · LW · GW

The Alignment Problem - Easily accessible, well written and full of interesting facts about the development of ML. Unfortunately somewhat light on actual AI x-risk, but in many cases is enough to encourage people to learn more.

Edit: Someone strong-downvoted this, I'd find it pretty useful to know why.  To be clear, by 'why' I mean 'why does this rec seem bad', rather than 'why downvote'. If it's the lightness on x-risk stuff I mentioned, this is useful to know, if my description seems inaccurate, this is very useful for me to know, given that I am in a position to recommend books relatively often. Happy for the reasoning to be via DM if that's easier for any reason.

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:37:10.558Z · LW · GW

Rob Miles's youtube channel, see this intro. Also his video on the stop button problem for Computerphile.
- Easily accessible, entertaining, videos are low cost for many people to watch, and they often end up watching several.

 

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:31:32.886Z · LW · GW

Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k's case, also many other problems).

Noting a conflict of interest - I work for 80,000 hours and know of but haven't used AISS. This post is in a personal capacity, I'm just flagging publicly available information rather than giving an insider take.

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-06-06T21:28:51.463Z · LW · GW

I love this idea mostly because it would hugely improve screen reader options for alignment research.

Comment by Alex Lawsen (alex-lszn) on The case for becoming a black-box investigator of language models · 2022-06-01T06:30:19.457Z · LW · GW

Some initial investigation here, along with a response from the author of the original claim.

Comment by Alex Lawsen (alex-lszn) on The case for becoming a black-box investigator of language models · 2022-05-31T18:52:11.523Z · LW · GW

This recent discovery about DALLE-2 seems like it might provide interesting ideas for experiments in this vein.

Comment by Alex Lawsen (alex-lszn) on Beware boasting about non-existent forecasting track records · 2022-05-27T05:59:47.282Z · LW · GW

Yes, https://metaculusextras.com/points_per_question

It has its own problems in terms of judging ability. But it does exist.

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-05-17T21:07:51.558Z · LW · GW

I think both of those would probably help but expect that the concept graph is very big, especially if you want people to be able to use the process recursively.

There's also value in the workflow being smooth, and this task is sandwiched between two things which seem very useful (and quite straightforward) to automate with an LLM:

  • concept extraction
  • search for and summarise explainer papers/articles

I can however imagine a good wiki with great graph style UX navigation and expandable definitions/paper links solving the last two problems, with then only concept extraction being automated by Elicit, though even in this case initially populating the graph/wiki might be best done using automation of the type described above. It's much easier to maintain something which already exists.

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-05-12T21:11:53.165Z · LW · GW

Make it as easy as possible to generate alignment forum posts and comments.

The rough idea here is that it's much easier to explain an idea out loud, especially to someone who occasionally asks for clarification or for you to repeat an idea, than it is to write a clear, concise post on it. Most of the design of this would be small bits of frontend engineering, but language model capability would be useful, and several of the capabilities are things that Ought is already working on. Ideally, interacting with the tool looks like:

Researcher talks through the thing they're thinking about. Model transcribes ideas[1], suggests splits into paragraphs[2], suggests section headings [3], generates a high level summary/abstract [4]. If researcher says "[name of model] I'm stuck", the response is "What are you stuck on?", and simple replies/suggestions are generated by something like this[5].

Once the researcher has talked through the ideas, they are presented with a piece which contains an abstract at the top, then a series of headed sections, each with paragraphs which rather than containing what they said at that point verbatim, contain clear and concise summaries[6] of what was actually said. Clicking on any generated heading allows the user to select from a list of generated alternatives, or write their own[7], while clicking on any paragraph allows the user to see and select from a list of other generated summaries, the verbatim transcription, and to write their own version of this paragraph.

1 can probably be achieved by just buying an off the shelf transcription bot (though you could train one if you wanted), with the most important criterion being speed. 2-4 can have data trivially generated by scraping the entire alignment forum and removing headings/summaries/abstracts/paragraph breaks. 5 I've generated data for below. An MVP for generating data for 6 is using the transcription software from 1 to autotranscribe AXRP and then comparing to the human-edited summary, though I think suggesting clear rephrasings (which I'll call 6.5) might require a seperate task. 7 is just frontend design, which I suspect is doable in-house by Ought.

Comment by Alex Lawsen (alex-lszn) on The case for becoming a black-box investigator of language models · 2022-05-11T15:24:08.659Z · LW · GW

There's a (set of) experiments I'd be keen to see done in this vein, which I think might produce interesting insights.


How well do capabilites generalise across languages?

Stuart Armstrong recently posted this example of GPT-3 failing to generalise to reversed text. The most natural interpretation, at least in my mind, and which is pointed at by a couple of comments, is that the problem here is that there's just been very little training data which contains things like:

m'I gnitirw ffuts ni esrever tub siht si yllautca ytterp erar
(I'm writing stuff in reverse but this is actually pretty rare)

Especially with translations underneath. In particular, there hasn't been enough data to relate the ~token 'ffuts' to 'stuff'. These are just two different things which have been encoded somewhere, one of them tends to appear near english words, the other tends to appear near other rare things like 'ekil'.

It seems that how much of a capability hit language models take when trying to work in 'backwards writing', as well as other edited systems like pig latin or simple cyphers, and how much fine tuning they would take to restore the same capabilities as English, may provide a few interesting insights into model 'theory of mind'.

The central thing I'm interested in here is trying to identify differences between cases where a LLM is both modelling a situation in some sort of abstract way, and then translating from that situation to language output, and cases where the model is 'only' doing language output.

Models which have some sort of world model, and use that world model to output language, should find it much easier to capably generalise from one situation to another. They also seem meaningfully closer to agentic reasoners. There's also an interesting question about how different models look when fine tuned here. If it is the case that there's a ~separate 'world model' and 'language model', training the model to perform well in a different language should, if done well, only change the second. This may even shed light on which parts of the model are doing what, though again I just don't know if we have any ways of representing the internals which would allow us to catch this yet.

Ideas for specific experiments:

  • How much does grammar matter?
    • Pig latin, reversed english, and simple substitution cyphers all use identical grammar to standard english. This means that generalising to these tasks can be done just by substituting each english word for a different one, without any concept mapping taking place.
    • Capably generalising to French, however, is substantially harder to do without a concept map. You can't just substitute word-by-word.
       
  • How well preserved is fine tuning across 'languages'?
    • Pick some task, fine tune the model until it does well on the task, then fine tune the model to use a different language (using a method that has worked in earlier experiments). How badly is performance on the task (now presented in the new language) hit?
    • What happens if you change the ordering - you do the same fine-tuning, but only after you've trained the model to speak the new language. How badly is performance hit? How much does it matter whether you do the task-specific fine tuning in english or the 'new' language?
    • In all of these cases (and variations of them), how big a difference does it make if the new language is a 1-1 substitution, compared to a full language with different grammar.
       

If anyone does test some of these, I'd be interested to hear the results!

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-05-08T18:34:38.110Z · LW · GW

I actually happen to already have taught elicit to give helpful/obvious advice (not alignment specific, but close enough given the examples were inspired by thinking that lots of the advice I give in my day job as an advisor is obvious)! You can play with it here if you have an elicit account.

Edit:

Here's the training data

Life problem

I need to think of a research topic but I've only had two ideas and both of them aren't that great.

Obvious but useful advice.

  • Have you tried setting a 5-minute timer and, by-the-clock, just trying to write down as many ideas as possible? This can work surprisingly well!

 

Life problem

I've been putting off writing this email, and now every time I think about it I feel bad and don't want to do anything at all, especially write the email!

Obvious but useful advice.

  • This seems like a pretty urgent problem to solve, as it's one that will just keep getting worse otherwise as you get negative reinforcement when you think about it. I have a few ideas for things to try: Can you get a friend to sit with you while you write it, or even to write it for you? If you make it your number one priority, can you get it done right now? Is there a way you can not send it, for example by just emailing to say 'sorry, can't reply now, will explain later'?

 

Life problem

I'm thinking about quitting my job in finance in order to self-study ML and switch to working on alignment. How can I make the final decision?

Obvious but useful advice.

  • That's an exciting decision to be making! It might be worth writing up the pros and cons of both options in a googledoc, and sharing it with some friends with comment access enabled. Getting your thoughts sorted in a way which is clear to others might be helpful itself, and then also your friends might have useful suggestions or additional considerations!

 

Life problem

I'm giving a talk tomorrow, but I'm worried that I'm going to be asked a question that's difficult to answer and then say something which seems stupid if it gets taken out of context.

Obvious but useful advice.

  • Remember that it's fine to say "I don't know", or some variant like "I'm not sure I can give a clear enough answer to that without thinking about it for longer", when people ask you questions! You can even write this on your hand if you need to.

 

Life problem

I'm a bit worried about the next week, there's a few pretty important things happening and I've tried to plan as best I can for them but I'm worried it's all going to go horribly wrong.

Obvious but useful advice.

  • It's ok to be worried before important things! One option you have, as long as it wouldn't be too stressful, is writing a premortem. This is where you think about a specific thing that could go wrong, then decide whether there's any action you can take to prevent or mitigate that scenario, then repeat. At the end of this process, you can then say to yourself something like "I'm pretty sure I've taken all of the actions it's reasonable to take", which might be reassuring. Even if it's not immediately emotionally reassuring, if the actions were sensible, it's now less likely things will go wrong!

 

Life problem

I don't know how to go about planning my career.

Obvious but useful advice.

  • My suggestion would be to write down a few of the most important things you'd like your career to look like in two or three years, and for each of them write down concrete steps which you think will make progress towards achieving that. Then you can start seeing what sorts of jobs / collaborations might be feasible. If one thing that's important to you for your career is impact, you could check out the resources on 80000hours.org, where there's lots of helpful guidance, and you can even book a call with an advisor.

 

Life problem

I'm trying to solve a difficult technical problem at work, but my mind is just blank.

Obvious but useful advice.

  • Do you have to solve the problem on your own? Asking a colleague or friend can be really helpful. You could even write up a description of the problem in a googledoc and share it with a few people asking them to comment with ideas. If you have to solve the problem on your own, you could try imagining a specific person who you think would be able to solve the problem, and asking yourself what they would try.

 

Life problem

I've taken on too many commitments and now I'm feeling really overwhelmed by them

Obvious but useful advice.

  • Remember that saying "no" is fine, there are a lot of good reasons to drop commitments, including simply not having time or being too busy. You don't need to feel guilty about making this decision now, but when you're feeling better it might be worth working out whether you could have noticed dropped some things earlier. If you're finding yourself saying yes too much, it might be worth setting yourself a 'saying no to things' goal which you share with a friend.

 

Life problem

I've got a test next week and I don't know how best to prepare.

Obvious but useful advice.

  • Do you have any practice tests you can sit and then mark to work out where you went wrong? The best way to prepare for a thing is usually just trying to do that thing!

 

Life problem

I'm struggling to maintain focus as I work. I find myself staring into space, and not really resting, but I feel too guilty to stop and take a break.

Obvious but useful advice.

  • If you're too stressed or tired to do useful work, you should stop and rest! It's better to fully rest and gain back some energy than keep struggling when you aren't being productive. You could also try using the pomodoro technique of working for set periods of time and taking breaks in between.
Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-04-30T14:48:28.610Z · LW · GW

The ideal version of the task is decomposable into:

  • find the high level concepts in a paper (high level here meaning 'high level background required')
  • From a concept, generate the highest level prerequisite concepts
  • For a given concept, generate a definition/explanation (either by finding and summarising a paper/article, or just directly producing one)
     

The last of these tasks seems very similar to a few things Elicit is already doing or at least trying to do, so I'll generate instances of the other two. 



Identify some high-level concepts in a paper

Example 1

Input:  This post by Nuno Sempere

Output: Suggestions for high level concepts

  • Counterfactual impact
  • Shapley Value
  • Funging
  • Leverage
  • Computability
     

Notes: In one sense the 'obvious' best suggestion for the above post is 'Shapley value', given that's what the post is about, and it's therefore the most central concept one might want to generate background on. I think I'd be fine with probably prefer the output above though, where there's some list of <10 concepts. In a model which had some internal representation of the entirety of human knowledge, and purely selected the single thing with the most precursors, my (very uncertain) guess is that computability might be the single output produced, even though it's non-central to the post and only appears in a footnote. That's part of the reason why I'd be relatively happy for the output of this first task to roughly be 'complicated vocabulary which gets used in the paper'
 

Example 2
 

Input: Eliciting Latent Knowledge by Mark Xu and Paul Christiano

Output: Suggestions for high level concepts

  • Latent Knowledge
  • Ontology
  • Bayesian Network
  • Imitative Generalisation
  • Regularisation
  • Indirect Normativity
     

Notes:  This is actually a list of terms I noted down as I was reading the paper, so rather than 'highest level' it's just 'what Alex happened to think it was worth looking up',  but for illustrative purposes I think it's fine.

Having been given a high-level concept, generate prerequisite concepts

Notes: I noticed when trying to generate background concepts here that in order to do so it was most useful to have the context of the post. This pushed me in the direction of thinking these concepts were harder to fully decompose than I had thought, and suggested that the input might need to be '[concept], as used in [paper]', rather than just [concept]. All of the examples below come from the examples above. In some cases, I've indicated what I expect a second layer of recursion might produce, though it seems possible that one might just want the model to recurse one or more times by default.

I found the process of generating examples really difficult, and am not happy with them. I notice that what I kept wanting to do was write down 'high-level' concepts. Understanding the entirety of a few high-level concepts is often close to sufficient to understand an idea, but it's not usually necessary. With a smooth recursion UX (maybe by clicking), I think the ideal output almost invariably generates low-mid level concepts with the first few clicks. The advantages of this are that if the user recognises a concept they know they are done with that branch, and narrower concepts are easier to generate definitions for without recursing. Unfortunately, sometimes there are high level prerequisites which aren't obviously going to be generated by recursing on the lower level ones. I don' have a good solution to this yet.


Input: Shapley Value

Output:

  • Expected value
    • Weighted average
    • Elementary probability
    • Utility
  • Marginal contribution
    • Payoff
    • Agent
    • Fixed cost
    • Variable cost
    • Utility

Input: Computability

Output:

  • Computational problem
  • Computable function
  • Turing Machine
  • Computational complexity

Notes: I started recursing, quickly confirmed my hypothesis from earlier about this being by miles the thing with the most prerequisites, and deleted everything except what I had for 'level 1', which I also left unfinished before I got completely lost down a rabbithole.

Input: Bayesian Network

Output:

  • Probabilistic inference
    • Bayes' Theorem
    • Probability distribution
  • Directed Acyclic Graph
    • Directed Graph
      • Graph (Discrete Mathematics)
        • Vertex
        • Edge
    • Cycle
      • Trail
      • Graph (Discrete Mathematics)
        • Vertex
        • Edge

Notes: Added a few more layers of recursion to demonstrate both that you probably want some kind of dynamic tree structure, and also also that not every prerequisite is equally 'high level'.


Conclusions from trying to generate examples

This is a much harder, but much more interesting, problem than I'd originally expected. Which prerequisites seem most important, how narrowly to define them, and how much to second guess myself, all ended up feeling pretty intractable. I may try with some (much) simpler examples later, rather than trying to generate them from papers I legitimately found interesting. If a LLM is able to generalise the idea of 'necessary prerequisites' from easier concepts to harder ones, this itself seems extremely interesting and valuable.

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-04-29T11:59:19.090Z · LW · GW

Task: Identify key background knowledge required to understand a concept

  • Context: Many people are currently self-directing their learning in order to eventually be able to useful contribute to alignment research. Even among experienced researchers, people will sometimes come across concepts that require background they don't have in order to understand. By 'key' background content, I'm imagining that the things which get identified are 'one step back' in the chain, or something like 'the required background concepts which themselves require the most background'. This seems like the best way of making the tool useful, if the background concepts generated are themselves not understood by the user, they can just use the tool again on those concepts.
  • Input type: A paper (with the idea that part of the task is to identify the highest level concepts in the paper). It would also be reasonable to just have the name of a concept, with a separate task of 'generate the highest level concept'.
  • Output type: At minimum, a list of concepts which are key background. Better would be a list of these concepts plus summaries of papers/textbooks/wikipedia entries which explain them.
  • Info considerations: This system is not biased towards alignment over capabilities, though I think it will in practice help alignment work more than capabilities work, due to the former being less well-served by mainstream educational material and courses. This does mean that having scraped LW and the alignment forum, alignment-relevant things on ArXiv, MIRI's site etc. would be particularly useful
     

I don't have capacity today to generate instances, though I plan to come back and do so. I'm happy to share credit if someone else jumps in first and does so though!

Comment by Alex Lawsen (alex-lszn) on Takeoff speeds have a huge effect on what it means to work on AI x-risk · 2022-04-14T06:51:00.267Z · LW · GW

Roughly, "avoid your actions being labelled as bad by humans [or models of humans]" is not quite the same signal as "don't be bad".

Comment by Alex Lawsen (alex-lszn) on Incentive Problems With Current Forecasting Competitions. · 2020-11-09T18:48:00.906Z · LW · GW

Prediction markets also have problems around events with very low probability, as it is unattractive to lay bet low probability events, both qualitatively (people don't like risking a lot of money for a small reward), and also quantitatively (you can often get similar returns just investing the money, usually at lower risk). 

The latter of these problems is in theory solveable by an exchange paying interest on stakes, or by using fractions of stocks as currency, but neither of these options is implemented in a major market.