alex-lawsen

Posts
Comments

Posts

AI x-risk, approximately ordered by embarrassment 2023-04-12T23:01:00.561Z

ELCK might require nontrivial scalable alignment progress, and seems tractable enough to try 2023-04-08T21:49:32.925Z

A tension between two prosaic alignment subgoals 2023-03-19T14:07:53.519Z

Deceptive failures short of full catastrophe. 2023-01-15T19:28:52.531Z

alexrjl's Shortform 2022-08-29T14:23:34.323Z

Thoughts on 'List of Lethalities' 2022-08-17T18:33:31.363Z

An easy win for hard decisions 2022-05-05T07:47:14.398Z

Incentive Problems With Current Forecasting Competitions. 2020-11-09T16:20:06.394Z

Comments

Comment by Alex Lawsen (alex-lszn) on 100+ concrete projects and open problems in evals · 2025-03-23T10:44:14.776Z · LW · GW

Thanks for putting this list together. It seems like a great resource for people interested in the field and looking for somewhere to start.

I think many of the projects described within it seem likely to be within scope for this request for proposals. If you're reading this and are excited about taking on one of these projects but need funding to do so, I encourage you to apply!

Comment by Alex Lawsen (alex-lszn) on The Online Sports Gambling Experiment Has Failed · 2024-11-12T13:46:33.905Z · LW · GW

A similar service exists in the UK - https://www.gamstop.co.uk/

I don't know if "don't even discuss other methods until you've tried this first" seems right to me, but I do think such services seem pretty great, and would guess that expanding/building on them (including e.g. requiring that any gambling advertising included an ad for them) would be a lot more tractable than pursuing harder bans.

What actually works is clearly the most important thing here, but aesthetically I do like the mechanism of "give people the ability to irreversibly self exclude" as a response to predatory/addictive systems.

Comment by Alex Lawsen (alex-lszn) on Rowing vs steering · 2024-08-10T09:42:36.189Z · LW · GW

I think this post is great. Thanks for writing it!

Comment by Alex Lawsen (alex-lszn) on What are examples of someone doing a lot of work to find the best of something? · 2023-07-28T05:42:04.801Z · LW · GW

James Hoffman's coffee videos have this kind of vibe. The "tasting every Nespresso pod" one is a clear example, but I also really appreciate e.g. the explanations of how to blind taste

https://youtube.com/@jameshoffmann

Comment by Alex Lawsen (alex-lszn) on AI x-risk, approximately ordered by embarrassment · 2023-04-13T08:54:38.997Z · LW · GW

Thanks, both for the thoughts and encouragement!

I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:

Appreciate you doing a quick version. I'm excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.

P(Doom) for each scenario would also be useful.

Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like:
- P(scenario B leads to doom | scenario A turns out not to be an issue by default)
- P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve)
- P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)

etc.

Comment by Alex Lawsen (alex-lszn) on Deceptive failures short of full catastrophe. · 2023-03-11T18:35:58.448Z · LW · GW

I think my suggest usage is slightly better but I'm not sure it's worth the effort of trying to make people change, though I find 'camouflage' as a term useful when I'm trying to explain to people.

Comment by Alex Lawsen (alex-lszn) on Deceptive failures short of full catastrophe. · 2023-03-08T16:32:31.081Z · LW · GW

Good question. I think there's a large overlap between them, including most of the important/scary cases that don't involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart:
- There's some kinds of 'oversight failure' which aren't 'scalable oversight failure' e.g. the ball grabbing robot hand thing. I don't think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage.
- There's also some kinds of scalable oversight failure where the issue looks more like 'we didn't try at all' than 'we tried, but selecting based only on what we could see screwed us'. Someone just deciding to deploy a system and essentially just hoping that it's aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a 'failure of scalable oversight', and instead want to call it a 'failure to even try scalable oversight', but I think the line is blurry - maybe people tried some scalable oversight stuff, it didn't really work, and then they gave up and said 'short term is probably fine'.
- I think most failures of scalable oversight have some story which roughly goes "people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer". These count as both.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-20T08:23:35.024Z · LW · GW

Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.

This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn't yet capable of gradient hacking, we shouldn't expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).

(Intuitions here mostly coming from the "cleanup" stage that Neel found in his grokking paper)

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-04T11:07:06.576Z · LW · GW

[epistemic status: showerthought] Getting language models to never simulate a character which does objectionable stuff is really hard. But getting language models to produce certain behaviours for a wide variety of prompts is much easier.

If we're worried about conditioning generative models getting more dangerous the more powerful LMs get, what if we fine tuned in an association between [power seeking/deception] and [wild overconfidence, including missing obvious flaws in plans, doing long "TV supervillain" style speeches before carrying out the final stage of a plan, etc.].

If the world model that gets learned is one where power seeking has an extremely strong association with poor cognition, maybe we either get bad attempts at treacherous turns before good ones, or models learning to be extremely suspicious of instrumental reasoning leading to power seeking given it's poor track record.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-03T10:41:59.111Z · LW · GW

I still think there's something here and still think that it's interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. 'write access to external memory of which there is no oversight', could bound the potential drift problem at lower capability levels than I was initially thinking for a 'pure' gradient-hack of the sort described here.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-01T11:22:53.523Z · LW · GW

I think there's quite a big difference between 'bad looking stuff gets selected away' and 'design a poisoned token' and I was talking about the former in the top level comment, but as it happens I don't think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-01T11:16:58.657Z · LW · GW

The different kinds of deception thing did eventually get written up and posted!

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2023-02-01T11:10:26.772Z · LW · GW

If goal-drift prevention comes after perfect deception in the capabilities ladder, treacherous turns are a bad idea.

Prompted by a thought from a colleague, here's a rough sketch of something that might turn out to be interesting once I flesh it out.

- Once a model is deceptively aligned, it seems like SGD is most just going to improve search/planning ability rather than do anything with the mesa-objective.

- But because 'do well according the overseers' is the correct training strategy irrespective of the mesa-objective, there's also no reason that SGD would preserve the mesa objective.

- I think this means we should expect it to 'drift' over time.

- Gradient hacking seems hard, plausibly harder than fooling human oversight.

- If gradient hacking is hard, and I'm right about the drift thing, then I think there are setups where something that looks more like "trade with humans and assist with your own oversight" beats "deceptive alignment + eventual treacherous turn" as a strategy.

- In particular, it feels like this points slightly in the direction of a "transparency is self-promoting/unusually stable" hypothesis, which is exciting.

Comment by Alex Lawsen (alex-lszn) on Wentworth and Larsen on buying time · 2023-01-10T09:35:21.423Z · LW · GW

Could you explain your model here of how outreach to typical employees becomes net negative?
The path of: [low level OpenAI employees think better about x-risk -> improved general OpenAI reasoning around x-risk -> improved decisions] seems high EV to me.

I think the obvious way this becomes net negative is if the first (unstated) step in the causal chain is actually false:
[People who don't have any good ideas for making progress on alignment try to 'buy time' by pitching people who work at big ML labs on AI x-risk -> low level OpenAI employees think better about x-risk]

A concern of mine, especially when ideas about this kind of untargeted outreach are framed as "this is the thing to do if you can't make technical progress", is that [low level OpenAI employees think better about x-risk] will often instead be something like [low level employees' suspicion that the "AI doomer crowd" doesn't really know what it's talking about is reinforced], or [low level employee now thinks worse about x-risk].

Comment by Alex Lawsen (alex-lszn) on AGI Timelines in Governance: Different Strategies for Different Timeframes · 2022-12-20T09:10:35.989Z · LW · GW

[crossposting my comment from the EA forum as I expect it's also worth discussing here]

whether you have a 5-10 year timeline or a 15-20 year timeline

Something that I'd like this post to address that it doesn't is that to have "a timeline" rather than a distribution seems ~indefensible given the amount of uncertainty involved. People quote medians (or modes, and it's not clear to me that they reliability differentiate between these) ostensibly as a shorthand for their entire distribution, but then discussion proceeds based only on the point estimates.

I think a shift of 2 years in the median of your distribution looks like a shift of only a few % in your P(AGI by 20XX) numbers for all 20XX, and that means discussion of what people who "have different timelines" should do is usually better framed as "what strategies will turn out to have been helpful if AGI arrives in 2030".

While this doesn't make discussion like this post useless, I don't think this is a minor nitpick. I'm extremely worried by "plays for variance", some of which are briefly mentioned above (though far from the worst I've heard). I think these tend to look good only on worldviews which are extremely overconfident, and treat timelines as point estimates/extremely sharp peaks). More balanced views, even those with a median much sooner than mine, should typically realise that the EV gained in the worlds where things move quickly is not worth the expected cost in worlds where they don't. This is in addition to the usual points about co-operative behaviour when uncertain about the state of the world, adverse selection, the unilateralist's curse etc.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-10-15T15:17:46.877Z · LW · GW

(Written up from a Twitter conversation here. Few/no original ideas, but maybe some original presentation of them.)

'Consequentialism' in AI systems.

When I think about the potential future capabilities of AI systems, one pattern is especially concerning. The pattern is simple, will produce good performance in training, and by default is extremely dangerous. It is often referred to as consequentialism, but as this term has several other meanings, I'll spell it out explicitly here*:

1. Generate plans

2. Predict the consequences of those plans

3. Evaluate the expected consequences of those plans

4. Execute the one with the best expected consequences

Preventing this pattern from emerging is, in my view, a large part of the problem we face.

There is a disconnect between my personal beliefs and the algorithm I described. I believe, like many others thinking about AI alignment, that the most plausible moral theories are Consequentialist. That is, policies are good if and only if, in expectation, they lead to good outcomes. This moral position is separate from my worry about consequentialist reasoning in models, in fact, in most cases I think the best policy for me to have looks nothing like the algorithm above. My problem with "consequentialist" agents is not that they might have my personal values as their "evaluate" step. It is that, by default, they will be deceptive until they are powerful enough, and then kill me.

The reason this pattern is so concerning is because, once such a system can model itself as being part of a training process, then plans which look like 'do exactly what the developers want until they can't turn you off, then make sure they'll never be able to again' will score perfectly on training, regardless of the evaluation function being used in step 3. In other words, the system will be deceptively aligned, and once a system is deceptively aligned, it will score perfectly on training.

This only matters for models which are sufficiently intelligent, but the term "intelligent" is loaded and means different things to different people, so I'll avoid using it. In the context I care about, intelligence is about the ability to execute the first two steps of the algorithm I'm worried about. Per my definition, being able to generate many long and/or complicated plans, and being able to accurately predict the consequences of these plans, both contribute to "intelligence", and the way they contribute to dangerous capabilities is different. Consider an advanced chess-playing AI, which has control of a robot body in order to play over-the-board. If the relevant way in which it's advanced corresponds to step 2, you won't be able to win, but you'll probably be safe. If the relevant way in which it's advanced corresponds to step 1, it might discover the strategy: "threaten my opponent with physical violence unless they resign".

*The 4-step algorithm I described will obviously not be linear in practice, in particular, which plans get generated will likely be informed by predictions and evaluations of their consequences, so 1-3 are all mixed up. I don't think this matters much to the argument.

Parts of my model I'm yet to write up but which fit into this:

- Different kinds of deception and the capabilities required.

- Different kinds of myopia and how fragile they are

- What winning might look like (not a strategy, just a north star)

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-09-08T06:07:58.754Z · LW · GW

I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I'm skeptical about preventing this.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-09-07T10:31:25.326Z · LW · GW

Clarifying something based on a helpful DM:

I know Steganography implies something like deliberate intent.
I don't think there's currently deliberate intent.
I also don't think deliberate intent matters, but I know this is controversial and haven't yet articulated why.
Fleshing out why I want to push hard against 'deliberateness' mattering is going to take more time than I expect to have this week, and it's possible writing anything about this before I had time to do it was a mistake.
I think it's pretty non-obvious, and would put more than 20% on me concluding that I'm totally wrong about this specific thing in the process of trying to write out my intuition.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-09-07T08:38:33.371Z · LW · GW

Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:

Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I've been thinking about this in the abstract for a while but somehow I didn't see what was right in front of me until now.

I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don't want this to drop out of my head.

I'll try to write something proper up soon, but the sketch of what happened is:

It's been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
And that this could be done in such a way that it wouldn't look obvious to readers.
And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
Then today I was thinking about LLM+scratchpad as my MVP for 'scale is all you need', and I was wondering about how small a scratchpad you'd need, where one possibility was you don't need one at all you just remove the first words in the context window as you run out of space.
And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
The whole discipline of prompt engineering is a demonstration that most of the instructions we're giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
Which is obviously miles away from being the case. I haven't thought about how to formalise it, but I suspect that almost all of the instructions we're actually giving in prompts are 'under the surface', rather than being 'natural language interpretable'.
I'm embarrassed by how obvious this realisation is in hindsight, but whatever, it's there now. Now to work out what to do about it.

(I don't think I'd have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)

Comment by Alex Lawsen (alex-lszn) on Simulators · 2022-09-04T13:16:32.483Z · LW · GW

Yeah this is the impression I have of their views too, but I think there are good reasons to discuss what this kind of theoretical framework says about RL anyway, even if you're very against pushing the RL SoTA.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-09-04T09:47:41.607Z · LW · GW

I'm confused about how valuable Language models are multiverse generators is as a framing. On the one hand, I find thinking in this way very natural, and did end up having what I thought were useful ideas to pursue further as I was reading it. I also think loom is really interesting, and it's clearly a product of the same thought process (and mind(s?)).

On the other hand, I worry that the framing is so compelling mostly just because of our ability to read into text. Lots of things have high branching factor, and I think there's a very real sense in which we could replace the post with Stockfish is a multiverse generator, Alphazero is a multiverse generator, or Piosolver is a multiverse generator, and the post would look basically the same, except it would seem much less beautiful/insightful, and instead just provoke a response of 'yes, when you can choose a bunch of options at each step in some multistep process, the goodness of different options is labelled with some real, and you can softmax those reals to turn them into probabilities, your process looks like a massive tree getting split into finer and finer structure.'

There's a slight subtlety here in that in the chess and go cases, the structure won't strictly be a tree because some positions can repeat, and in the poker case the number of times the tree can branch is limited (unless you consider multiple hands, but in that case you also have possible loops because of split pots). I don't know how much this changes things.

Comment by Alex Lawsen (alex-lszn) on Simulators · 2022-09-04T09:28:29.420Z · LW · GW

Thanks for writing this up! I've found this frame to be a really useful way of thinking about GPT-like models since first discussing it.

In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the 'other methods' section of 'Novel methods of process/agent specification'). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI's current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.

It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below:

Something that seems relatively benign/unexciting - fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn't really change which simulacra are accessible to the simulator.
More worrying things - particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
Things which could go either way and seem very high stakes - the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.

Comment by Alex Lawsen (alex-lszn) on alexrjl's Shortform · 2022-08-29T14:23:34.592Z · LW · GW

Idea I want to flesh out into a full post:

Changing board-game rules as a test environment for unscoped consequentialism.

The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
- If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
- The moment at which I become aware of these other options seems critical.
It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
- My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
- I think this provides a pretty great analogy to expanding the scope of your action set.
- There's also some relevance to training/deployment mismatches.
- If you're teaching a language model to play the game, the specific 'changing the rules' actions could be included in the 'instruction set' for the game.
This might provide insight/the opportunity to experiment on (to flesh out in depth):
- Myopia
- Deception (if we select away from agents who make these illegal moves)
- useful bounds on consequentialism
- More specific things like, in the language models example above, whether saying 'don't do these things, they're not allowed', works better or worse than not mentioning them at all.

Comment by Alex Lawsen (alex-lszn) on Seeking beta readers who are ignorant of biology but knowledgeable about AI safety · 2022-07-28T04:04:01.492Z · LW · GW

Interested.

AI safety level: don't typically struggle to follow technical conversations with full time researchers, though am not a full time researcher.

Bio: last studied it 14 years ago. Vaguely aware miosis and mitosis are different but couldn't define either without Google.

Comment by alex-lszn on [deleted post] 2022-07-24T02:13:56.108Z

Founders Pledge's research is the best in the game here. If you want to make a recommendation that's for a specific charity rather than a fund, Clean Air Task Force seemed sensible every time I spoke to them, and have been around for a while.

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:43:35.902Z · LW · GW

AXRP - Excellent interviews with a variety of researchers. Daniel's substantial own knowledge means that the questions he asks are often excellent, and the technical depth is far better than anything else that's available in audio, given the difficulty of autoreaders on papers or the alignment forum finding it difficult to handle actual maths.

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:40:59.701Z · LW · GW

This talk from Joe Carlsmith. - Hits at several of the key ideas really directly given the time and technical background constraints. Like Rob's videos, implies an obvious next step for people interested in learning more, or who are suspicious of one of the claims (reading Joe's actual report, maybe even the extensive discussion of it on here).

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:37:49.964Z · LW · GW

The Alignment Problem - Easily accessible, well written and full of interesting facts about the development of ML. Unfortunately somewhat light on actual AI x-risk, but in many cases is enough to encourage people to learn more.

Edit: Someone strong-downvoted this, I'd find it pretty useful to know why. To be clear, by 'why' I mean 'why does this rec seem bad', rather than 'why downvote'. If it's the lightness on x-risk stuff I mentioned, this is useful to know, if my description seems inaccurate, this is very useful for me to know, given that I am in a position to recommend books relatively often. Happy for the reasoning to be via DM if that's easier for any reason.

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:37:10.558Z · LW · GW

Rob Miles's youtube channel, see this intro. Also his video on the stop button problem for Computerphile.
- Easily accessible, entertaining, videos are low cost for many people to watch, and they often end up watching several.

Comment by Alex Lawsen (alex-lszn) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T10:31:32.886Z · LW · GW

Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k's case, also many other problems).

Noting a conflict of interest - I work for 80,000 hours and know of but haven't used AISS. This post is in a personal capacity, I'm just flagging publicly available information rather than giving an insider take.

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-06-06T21:28:51.463Z · LW · GW

I love this idea mostly because it would hugely improve screen reader options for alignment research.

Comment by Alex Lawsen (alex-lszn) on The case for becoming a black-box investigator of language models · 2022-06-01T06:30:19.457Z · LW · GW

Some initial investigation here, along with a response from the author of the original claim.

Comment by Alex Lawsen (alex-lszn) on The case for becoming a black-box investigator of language models · 2022-05-31T18:52:11.523Z · LW · GW

This recent discovery about DALLE-2 seems like it might provide interesting ideas for experiments in this vein.

Comment by Alex Lawsen (alex-lszn) on Beware boasting about non-existent forecasting track records · 2022-05-27T05:59:47.282Z · LW · GW

Yes, https://metaculusextras.com/points_per_question

It has its own problems in terms of judging ability. But it does exist.

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-05-17T21:07:51.558Z · LW · GW

I think both of those would probably help but expect that the concept graph is very big, especially if you want people to be able to use the process recursively.

There's also value in the workflow being smooth, and this task is sandwiched between two things which seem very useful (and quite straightforward) to automate with an LLM:

concept extraction
search for and summarise explainer papers/articles

I can however imagine a good wiki with great graph style UX navigation and expandable definitions/paper links solving the last two problems, with then only concept extraction being automated by Elicit, though even in this case initially populating the graph/wiki might be best done using automation of the type described above. It's much easier to maintain something which already exists.

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-05-12T21:11:53.165Z · LW · GW

Make it as easy as possible to generate alignment forum posts and comments.

The rough idea here is that it's much easier to explain an idea out loud, especially to someone who occasionally asks for clarification or for you to repeat an idea, than it is to write a clear, concise post on it. Most of the design of this would be small bits of frontend engineering, but language model capability would be useful, and several of the capabilities are things that Ought is already working on. Ideally, interacting with the tool looks like:

Researcher talks through the thing they're thinking about. Model transcribes ideas[1], suggests splits into paragraphs[2], suggests section headings [3], generates a high level summary/abstract [4]. If researcher says "[name of model] I'm stuck", the response is "What are you stuck on?", and simple replies/suggestions are generated by something like this[5].

Once the researcher has talked through the ideas, they are presented with a piece which contains an abstract at the top, then a series of headed sections, each with paragraphs which rather than containing what they said at that point verbatim, contain clear and concise summaries[6] of what was actually said. Clicking on any generated heading allows the user to select from a list of generated alternatives, or write their own[7], while clicking on any paragraph allows the user to see and select from a list of other generated summaries, the verbatim transcription, and to write their own version of this paragraph.

1 can probably be achieved by just buying an off the shelf transcription bot (though you could train one if you wanted), with the most important criterion being speed. 2-4 can have data trivially generated by scraping the entire alignment forum and removing headings/summaries/abstracts/paragraph breaks. 5 I've generated data for below. An MVP for generating data for 6 is using the transcription software from 1 to autotranscribe AXRP and then comparing to the human-edited summary, though I think suggesting clear rephrasings (which I'll call 6.5) might require a seperate task. 7 is just frontend design, which I suspect is doable in-house by Ought.

Comment by Alex Lawsen (alex-lszn) on The case for becoming a black-box investigator of language models · 2022-05-11T15:24:08.659Z · LW · GW

There's a (set of) experiments I'd be keen to see done in this vein, which I think might produce interesting insights.

How well do capabilites generalise across languages?

Stuart Armstrong recently posted this example of GPT-3 failing to generalise to reversed text. The most natural interpretation, at least in my mind, and which is pointed at by a couple of comments, is that the problem here is that there's just been very little training data which contains things like:

m'I gnitirw ffuts ni esrever tub siht si yllautca ytterp erar
(I'm writing stuff in reverse but this is actually pretty rare)

Especially with translations underneath. In particular, there hasn't been enough data to relate the ~token 'ffuts' to 'stuff'. These are just two different things which have been encoded somewhere, one of them tends to appear near english words, the other tends to appear near other rare things like 'ekil'.

It seems that how much of a capability hit language models take when trying to work in 'backwards writing', as well as other edited systems like pig latin or simple cyphers, and how much fine tuning they would take to restore the same capabilities as English, may provide a few interesting insights into model 'theory of mind'.

The central thing I'm interested in here is trying to identify differences between cases where a LLM is both modelling a situation in some sort of abstract way, and then translating from that situation to language output, and cases where the model is 'only' doing language output.

Models which have some sort of world model, and use that world model to output language, should find it much easier to capably generalise from one situation to another. They also seem meaningfully closer to agentic reasoners. There's also an interesting question about how different models look when fine tuned here. If it is the case that there's a ~separate 'world model' and 'language model', training the model to perform well in a different language should, if done well, only change the second. This may even shed light on which parts of the model are doing what, though again I just don't know if we have any ways of representing the internals which would allow us to catch this yet.

Ideas for specific experiments:

How much does grammar matter?
- Pig latin, reversed english, and simple substitution cyphers all use identical grammar to standard english. This means that generalising to these tasks can be done just by substituting each english word for a different one, without any concept mapping taking place.
- Capably generalising to French, however, is substantially harder to do without a concept map. You can't just substitute word-by-word.
How well preserved is fine tuning across 'languages'?
- Pick some task, fine tune the model until it does well on the task, then fine tune the model to use a different language (using a method that has worked in earlier experiments). How badly is performance on the task (now presented in the new language) hit?
- What happens if you change the ordering - you do the same fine-tuning, but only after you've trained the model to speak the new language. How badly is performance hit? How much does it matter whether you do the task-specific fine tuning in english or the 'new' language?
- In all of these cases (and variations of them), how big a difference does it make if the new language is a 1-1 substitution, compared to a full language with different grammar.

If anyone does test some of these, I'd be interested to hear the results!

Comment by Alex Lawsen (alex-lszn) on Prize for Alignment Research Tasks · 2022-05-08T18:34:38.110Z · LW · GW

I actually happen to already have taught elicit to give helpful/obvious advice (not alignment specific, but close enough given the examples were inspired by thinking that lots of the advice I give in my day job as an advisor is obvious)! You can play with it here if you have an elicit account.

Edit:

Here's the training data

Life problem

I need to think of a research topic but I've only had two ideas and both of them aren't that great.