abergal

Thanks for writing this up-- at least for myself, I think I agree with the majority of this, and it articulates some important parts of how I live my life in ways that I hadn't previously made explicit for myself.

Comment by abergal on Don't you think RLHF solves outer alignment? · 2022-11-05T18:31:42.423Z · LW · GW

I think your first point basically covers why-- people are worried about alignment difficulties in superhuman systems, in particular (because those are the dangerous systems which can cause existential failures). I think a lot of current RLHF work is focused on providing reward signals to current systems in ways that don't directly address the problem of "how do we reward systems with behaviors that have consequences that are too complicated for humans to understand".

Comment by abergal on Interpretability · 2021-11-10T01:58:35.045Z · LW · GW

Chris Olah wrote this topic prompt (with some feedback from me (Asya) and Nick Beckstead). We didn’t want to commit him to being responsible for this post or responding to comments on it, so we submitted this on his behalf. (I've changed the by-line to be more explicit about this.)

Comment by abergal on Call for research on evaluating alignment (funding + advice available) · 2021-09-02T15:48:45.970Z · LW · GW

Thanks for writing this! Would "fine-tune on some downstream task and measure the accuracy on that task before and after fine-tuning" count as measuring misalignment as you're imagining it? My sense is that there might be a bunch of existing work like that.

Comment by abergal on Provide feedback on Open Philanthropy’s AI alignment RFP · 2021-08-28T02:46:10.026Z · LW · GW

This RFP is an experiment for us, and we don't yet know if we'll be doing more of them in the future. I think we'd be open to including research directions we think that are promising that apply equally well to both DL and non-DL systems-- I'd be interested in hearing any particular suggestions you have.

(We'd also be happy to fund particular proposals in the research directions we've already listed that apply to both DL and non-DL systems, though we will be evaluating them on how well they address the DL-focused challenges we've presented.)

Comment by abergal on Provide feedback on Open Philanthropy’s AI alignment RFP · 2021-08-22T01:35:17.967Z · LW · GW

Getting feedback in the next week would be ideal; September 15th will probably be too late.

Different request for proposals!

Comment by abergal on Discussion: Objective Robustness and Inner Alignment Terminology · 2021-06-24T22:40:26.744Z · LW · GW

Thank you so much for writing this! I've been confused about this terminology for a while and I really like your reframing.

An additional terminological point that I think it would be good to solidify is what people mean when they refer to "inner alignment" failures. As you alude to, my impression is that some people use it to refer to objective robustness failures, broadly, whereas others (e.g. Evan) use it to refer to failures that involve mesa optimization. There is then additional confusion around whether we should think "inner alignment" failures that don't involve mesa optimization will be catastrophic and, relatedly, around whether humans count as mesa optimizers.

I think I'd advocate for letting "inner alignment" failures refer to objective robustness failures broadly, talking about "mesa optimization failures" as such, and then leaving the question about whether there are problematic inner alignment failures that aren't mesa optimization-related on the table.

Comment by abergal on MIRI location optimization (and related topics) discussion · 2021-05-12T00:17:51.100Z · LW · GW

I feel pretty bad about both of your current top two choices (Bellingham or Peekskill) because they seem too far from major cities. I worry this distance will seriously hamper your ability to hire good people, which is arguably the most important thing MIRI needs to be able to do. [Speaking personally, not on behalf of Open Philanthropy.]

Comment by abergal on April drafts · 2021-04-05T22:23:23.323Z · LW · GW

Announcement: "How much hardware will we need to create AGI?" was actually inspired by a conversation I had with Ronny Fernandez and forgot about, credit goes to him for the original idea of using 'the weights of random objects' as a reference class.

https://i.imgflip.com/1xvnfi.jpg

Comment by abergal on abergal's Shortform · 2021-03-01T15:56:43.642Z · LW · GW

I think it would be kind of cool if LessWrong had built-in support for newsletters. I would love to see more newsletters about various tech developments, etc. from LessWrongers.

Comment by abergal on Extrapolating GPT-N performance · 2021-01-28T01:36:19.711Z · LW · GW

Planned summary for the Alignment Newsletter:

This post describes the author’s insights from extrapolating the performance of GPT on the benchmarks presented in the <@GPT-3 paper@>(@Language Models are Few-Shot Learners@). The author compares cross-entropy loss (which measures how good a model is at predicting the next token) with benchmark performance normalized to the difference between random performance and the maximum possible performance. Since <@previous work@>(@Scaling Laws for Neural Language Models@) has shown that cross-entropy loss scales smoothly with model size, data, and FLOP requirements, we can then look at the overall relationship between those inputs and benchmark performance.

The author finds that most of the benchmarks scale smoothly and similarly with respect to cross-entropy loss. Three exceptions are arithmetic, scramble (shuffling letters around the right way), and ANLI (a benchmark generated adversarially against transformer-based language models), which don't improve until the very end of the cross-entropy loss range. The author fits linear and s-shaped curves to these relationships, and guesses that:

- Performance improvements are likely to slow down closer to maximum performance, making s-curves a better progress estimate than linear curves.
- Machine learning models may use very different reasoning from humans to get good performance on a given benchmark, so human-level performance on any single benchmark would likely not be impressive, but human-level performance on almost all of them with few examples might be.
- We might care about the point where we can achieve human-level performance on all tasks with a 1 token "horizon length"-- i.e., all tasks where just 1 token is enough of a training signal to understand how a change in the model affects its performance. (See <@this AI timelines report draft@>(@Draft report on AI timelines@) for more on horizon length.) Achieving this milestone is likely to be _more_ difficult than getting to human-level performance on these benchmarks, but since scaling up GPT is just one way to do these tasks, the raw number of parameters required for this milestone could just as well be _less_ than the number of parameters that GPT needs to beat the benchmarks.
- Human-level performance on these benchmarks would likely be enough to automate lots of particular short horizon length tasks, such as customer service, PA and RA work, and writing routine sections of code.

The author augments his s-curves graph with references to certain data, FLOP, and parameter levels, including the number of words in common crawl, the number of FLOPs that could currently be bought for $1B, the point where reading or writing one word would cost 1 cent, and the number of parameters in a transformative model according to <@this AI timelines report draft@>(@Draft report on AI timelines@). (I recommend looking at the graph of these references to see their relationship to the benchmark trends.)

Overall, the author concludes that:

- GPT-3 is in line with smooth performance on benchmarks predicted by smaller models. It sharply increases performance on arithmetic and scramble tasks, which the author thinks is because the tasks are 'narrow' in that they are easy once you understand their one trick. The author now finds it less likely that a small amount of scaling will suddenly lead to a large jump in performance on a wide range of tasks.
- Close to optimal performance on these benchmarks seems like it's at least ~3 orders of magnitude away ($1B at current prices). The author thinks more likely than not, we'd get there after increasing the training FLOP by ~5-6 orders of magnitude ($100B -$1T at current prices, $1B - $10B given estimated hardware and software improvements over the next 5 - 10 years). The author thinks this would probably not be enough to be transformative, but thinks we should prepare for that eventuality anyway.
- The number of parameters estimated for human-equivalent performance on these benchmarks (~1e15) is close to the median number of parameters given in <@this AI timelines report draft@>(@Draft report on AI timelines@), which is generated via comparison to the computation done in the human brain.

Planned opinion:

Ask and ye shall receive! In my <@last summary@>(@Scaling Laws for Autoregressive Generative Modeling@), I mentioned that I was uncertain about how cross-entropy loss translates to transformative progress that we care about, and here is an excellent post exploring just that question. I'm sure I'll end up referencing this many times in the future.

The post discusses both what benchmarks might suggest for forecasting "human equivalence" and how benchmarks might relate to economic value via concrete task automation. I agree with the tasks the author suggests for the latter, and continuing my "opinions as calls for more work" trend, I'd be interested in seeing even more work on this-- i.e. attempts to decompose tasks into a set of concrete benchmark performances which would be sufficient for economically valuable automation. This comment thread discusses whether current benchmarks are likely to capture a substantial portion of what is necessary for economic value, given that many jobs end up requiring a diverse portfolio of skills and reasoning ability. It seems plausible to me that AI-powered automation will be "discontinuous" in that a lot of it will be unlocked only when we have a system that's fairly general.

It seems quite noteworthy that the parameter estimates here and in the AI timelines report draft are close together, even though one is anchored to human-level benchmark performance, and the other is anchored to brain computation. That updates me in the direction of those numbers being in the right range for human-like abilities.

People interested in this post maybe also be interested in [BIG-bench](https://github.com/google/BIG-Bench), a project to crowdsource the mother of all benchmarks for language models.

Comment by abergal on 2020 AI Alignment Literature Review and Charity Comparison · 2020-12-22T06:35:44.341Z · LW · GW

AI Impacts now has a 2020 review page so it's easier to tell what we've done this year-- this should be more complete / representative than the posts listed above. (I appreciate how annoying the continuously updating wiki model is.)

Comment by abergal on Draft report on AI timelines · 2020-09-20T05:28:21.756Z · LW · GW

From Part 4 of the report:

Nonetheless, this cursory examination makes me believe that it’s fairly unlikely that my current estimates are off by several orders of magnitude. If the amount of computation required to train a transformative model were (say) ~10 OOM larger than my estimates, that would imply that current ML models should be nowhere near the abilities of even small insects such as fruit flies (whose brains are 100 times smaller than bee brains). On the other hand, if the amount of computation required to train a transformative model were ~10 OOM smaller than my estimate, our models should be as capable as primates or large birds (and transformative AI may well have been affordable for several years).

I'm not sure I totally follow why this should be true-- is this predicated on already assuming that the computation to train a neural network equivalent to a brain with N neurons scales in some particular way with respect to N?

Comment by abergal on Draft report on AI timelines · 2020-09-19T13:46:01.130Z · LW · GW

So exciting that this is finally out!!!

I haven't gotten a chance to play with the models yet, but thought it might be worth noting the ways I would change the inputs (though I haven't thought about it very carefully):

I think I have a lot more uncertainty about neural net inference FLOP/s vs. brain FLOP/s, especially given that the brain is significantly more interconnected than the average 2020 neural net-- probably closer to 3 - 5 OOM standard deviation.
I think I also have a bunch of uncertainty about algorithmic efficiency progress-- I could imagine e.g. that the right model would be several independent processes all of which constrain progress, so probably would make that some kind of broad distribution as well.

Comment by abergal on OpenAI announces GPT-3 · 2020-05-29T12:05:25.498Z · LW · GW

I'm a bit confused about this as a piece of evidence-- naively, it seems to me like not carrying the 1 would be a mistake that you would make if you had memorized the pattern for single-digit arithmetic and were just repeating it across the number. I'm not sure if this counts as "memorizing a table" or not.

Comment by abergal on Algorithms vs Compute · 2020-05-05T17:24:50.249Z · LW · GW

This recent post by OpenAI is trying to shed some light in this question: https://openai.com/blog/ai-and-efficiency/

Comment by abergal on What's Hard About Long Tails? · 2020-04-25T07:40:11.256Z · LW · GW

I really like this post.

Self-driving cars are currently illegal, I assume largely because of these unresolved tail risks. But I think excluding illegality I'm not sure their economic value is zero-- I could imagine cases where people would use self-driving cars if they wouldn't be caught doing it. Does this seem right to people?

Intuitively it doesn't seem like economic value tails and risk tails should necessarily go together, which makes me concerned about cases similar to self-driving cars that are harder to regulate legally.

Comment by abergal on What's Hard About Long Tails? · 2020-04-25T07:38:28.255Z · LW · GW

What's the corresponding story here for trading bots? Are they designed in a sufficiently high-assurance way that new tail problems don't come up, or do they not operate in the tails?

Comment by abergal on Could you save lives in your community by buying oxygen concentrators from Alibaba? · 2020-03-16T19:35:19.137Z · LW · GW

I rewrote the question-- I think I meant 'counterfactual' in that this isn't a super promising idea if in fact we are just taking medical supplies from one group of people and transferring them to another.

I don't know anything about maintenance/cleaning, was thinking it would be in particular useful if we straight up run out of ICU space-- i.e., there is no alternative to go to an ICU. (Maybe this is a super unlikely class of scenarios?)

Comment by abergal on Tessellating Hills: a toy model for demons in imperfect search · 2020-03-06T07:06:05.743Z · LW · GW

You're totally not obligated to do this, but I think it might be cool if you generated a 3D picture of hills representing your loss function-- I think it would make the intuition for what's going on clearer.

Comment by abergal on Rohin Shah on reasons for AI optimism · 2019-11-06T16:28:22.867Z · LW · GW

We're not going to do this because we weren't planning on making these public when we conducted the conversations, so we want to give people a chance to make edits to transcripts before we send them out (which we can't do with audio).

Comment by abergal on Rohin Shah on reasons for AI optimism · 2019-11-06T15:33:46.421Z · LW · GW

The claim is a personal impression that I have from conversations, largely with people concerned about AI risk in the Bay Area. (I also don't like information cascades, and may edit the post to reflect this qualification.) I'd be interested in data on this.

User info

Posts

Comments