Posts

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs 2024-08-24T04:30:11.807Z
Paul Christiano's views on "doom" (video explainer) 2023-09-29T21:56:01.069Z
Neel Nanda on the Mechanistic Interpretability Researcher Mindset 2023-09-21T19:47:02.745Z
Panel with Israeli Prime Minister on existential risk from AI 2023-09-18T23:16:39.965Z
Eric Michaud on the Quantization Model of Neural Scaling, Interpretability and Grokking 2023-07-12T22:45:45.753Z
Jesse Hoogland on Developmental Interpretability and Singular Learning Theory 2023-07-06T15:46:00.116Z
Should AutoGPT update us towards researching IDA? 2023-04-12T16:41:13.735Z
Collin Burns on Alignment Research And Discovering Latent Knowledge Without Supervision 2023-01-17T17:21:40.189Z
Victoria Krakovna on AGI Ruin, The Sharp Left Turn and Paradigms of AI Alignment 2023-01-12T17:09:03.431Z
David Krueger on AI Alignment in Academia, Coordination and Testing Intuitions 2023-01-07T19:59:09.785Z
Ethan Caballero on Broken Neural Scaling Laws, Deception, and Recursive Self Improvement 2022-11-04T18:09:04.759Z
Shahar Avin On How To Regulate Advanced AI Systems 2022-09-23T15:46:46.155Z
Katja Grace on Slowing Down AI, AI Expert Surveys And Estimating AI Risk 2022-09-16T17:45:47.341Z
Alex Lawsen On Forecasting AI Progress 2022-09-06T09:32:54.071Z
Robert Long On Why Artificial Sentience Might Matter 2022-08-28T17:30:34.410Z
Ethan Perez on the Inverse Scaling Prize, Language Feedback and Red Teaming 2022-08-24T16:35:43.086Z
Connor Leahy on Dying with Dignity, EleutherAI and Conjecture 2022-07-22T18:44:19.749Z
Raphaël Millière on Generalization and Scaling Maximalism 2022-06-24T18:18:10.503Z
Blake Richards on Why he is Skeptical of Existential Risk from AI 2022-06-14T19:09:26.783Z
Ethan Caballero on Private Scaling Progress 2022-05-05T18:32:18.673Z
Why Copilot Accelerates Timelines 2022-04-26T22:06:19.507Z
OpenAI Solves (Some) Formal Math Olympiad Problems 2022-02-02T21:49:36.722Z
Phil Trammell on Economic Growth Under Transformative AI 2021-10-24T18:11:17.694Z
The Codex Skeptic FAQ 2021-08-24T16:01:18.844Z
Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability 2021-06-08T19:20:25.977Z
What will GPT-4 be incapable of? 2021-04-06T19:57:57.127Z
An Increasingly Manipulative Newsfeed 2019-07-01T15:26:42.566Z
Book Review: AI Safety and Security 2018-08-21T10:23:24.165Z
Human-Aligned AI Summer School: A Summary 2018-08-11T08:11:00.789Z
A Gym Gridworld Environment for the Treacherous Turn 2018-07-28T21:27:34.487Z

Comments

Comment by Michaël Trazzi (mtrazzi) on Announcing the $200k EA Community Choice · 2024-08-14T15:57:56.329Z · LW · GW

Like Habryka I have questions about creating an additional project for EA-community choice, and how the two might intersect.

Note: In my case, I have technically finished the work I said I would do given my amount of funding, so marking the previous one as finished and creating a new one is possible.

I am thinking that maybe the EA-community choice description would be more about something with limited scope / requiring less funding, since the funds are capped at $200k total if I understand correctly.

It seems that the logical course of action is:

  1. mark the old one as finished with an update
  2. create an EA community choice project with a limited scope
  3. whenever I'm done with the requirements from the EA community choice, create another general Manifund project

Though this would require creating two more projects down the road.

Comment by Michaël Trazzi (mtrazzi) on Zach Stein-Perlman's Shortform · 2024-08-09T15:47:35.243Z · LW · GW

He cofounded Gray Swan (with Dan Hendrycks, among others)

I'm confused. On their about page, Dan is an advisor, not a founder.

Comment by Michaël Trazzi (mtrazzi) on Two easy things that maybe Just Work to improve AI discourse · 2024-06-09T09:44:08.206Z · LW · GW

ok I meant something like "people would could reach a lot of people (eg. roon's level, or even 10x less people than that) from tweeting only sensible arguments is small"

but I guess that don't invalidate what you're suggesting. if I understand correctly, you'd want LWers to just create a twitter account and debunk arguments by posting comments & occasionally doing community notes

that's a reasonable strategy, though the medium effort version would still require like 100 people spending sometimes 30 minutes writing good comments (let's say 10 minutes a day on average). I agree that this could make a difference. 

I guess the sheer volume of bad takes or people who like / retweet bad takes is such that even in the positive case that you get like 100 people who commit to debunking arguments, this would maybe add 10 comments to the most viral tweets (that get 100 comments, so 10%), and maybe 1-2 comments for the less popular tweets (but there's many more of them)

I think it's worth trying, and maybe there are some snowball / long-term effects to take into account. it's worth highlighting the cost of doing so as well (16h or productivity a day for 100 people doing it for 10m a day, at least, given there are extra costs to just opening the app). it's also worth highlighting that most people who would click on bad takes would already be polarized and i'm not sure if they would change their minds of good arguments (and instead would probably just reply negatively, because the true rejection is more something about political orientations, prior about AI risk, or things like that)

but again, worth trying, especially the low efforts versions

Comment by Michaël Trazzi (mtrazzi) on Two easy things that maybe Just Work to improve AI discourse · 2024-06-09T07:45:57.526Z · LW · GW

want to also stress that even though I presented a lot of counter-arguments in my other comment, I basically agree with Charbel-Raphaël that twitter as a way to cross-post is neglected and not costly

and i also agree that there's a 80/20 way of promoting safety that could be useful

Comment by Michaël Trazzi (mtrazzi) on Two easy things that maybe Just Work to improve AI discourse · 2024-06-09T07:18:44.657Z · LW · GW

tl;dr: the amount of people who could write sensible arguments is small, they would probably still be vastly outnumbered, and it makes more sense to focus on actually trying to talk to people who might have an impact

EDIT: my arguments mostly apply to "become a twitter micro-blogger" strat, but not to the "reply guy" strat that jacob seems to be arguing for

as someone who has historically wrote multiple tweets that were seen by the majority of "AI Twitter", I think I'm not that optimistic about the "let's just write sensible arguments on twitter" strategy

for context, here's my current mental model of the different "twitter spheres" surrounding AI twitter:
- ML Research twitter: academics, or OAI / GDM / Anthropic announcing a paper and everyone talks about it
- (SF) Tech Twitter: tweets about startup, VCs, YC, etc. 
- EA folks: a lot of ingroup EA chat, highly connected graph, veneration of QALY the lightbulb and mealreplacer
- tpot crew: This Part Of Twitter, used to be post-rats i reckon, now growing bigger with vibecamp events, and also they have this policy of always liking before replying which amplifies their reach
- Pause AI crew: folks with pause (or stop) emojis, who will often comment on bad behavior from labs building AGI, quoting (eg with clips) what some particular person say, or comment on eg sam altman's tweets
- AI Safety discourse: some people who do safety research, will mostly happen in response to a top AI lab announcing some safety research, or to comment on some otherwise big release. probably a subset of ML research twitter at this point, intersects with EA folks a lot
- AI policy / governance tweets: comment on current regulations being passed (like EU AI act, SB 1047), though often replying / quote-tweeting Tech Twitter
- the e/accs: somehow connected to tech twitter, but mostly anonymous accounts with more extreme views. dunk a lot on EAs & safety / governance people

I've been following these groups somehow evolve since 2017, and maybe the biggest recent changes have been how much tpot (started circa 2020 i reckon) and e/acc (who have grown a lot with twitter spaces / mainstream coverage) accounts have grown in the past 2 years. i'd say that in comparison the ea / policy / pause folks have also started to post more but there accounts are quite small compared to the rest and it just still stays contained in the same EA-adjacent bubble

I do agree to some extent with Nate Showell's comment saying that the reward mechanisms don't incentivize high-quality thinking. I think that if you naturally enjoy writing longform stuff in order to crystallize thinking, then posting with the intent of getting feedback on your thinking as some form of micro-blogging (which you would be doing anyway) could be good, and in that sense if everyone starts doing that this could shift the quality of discourse by a small bit.

To give some example on the reward mechanisms stuff, my last two tweets have been 1) some diagram I made trying to formalize what are the main cruxes that would make you want to have the US start a manhattan project 2) some green text format hyperbolic biography of leopold (who wrote the situational awareness series on ai and was recently on dwarkesh)

both took me the same amount of time to make (30 minutes to 1h), but the diagram got 20k impressions, whereas the green text format got 2M (so 100x more), and I think this is because of a) many more tech people are interested in current discourse stuff than infographics b) tech people don't agree with the regulation stuff c) in general, entertainement is more widely shared than informative stuff

so here are some consequences of what I expect to happen if lesswrong folks start to post more on x:
- 1. they're initially not going to reach a lot of people
- 2. it's going to be some ingroup chat with other EA folks / safety / pause / governance folks
- 3. they're still going to be outnumbered by a large amount of people who are explicitly anti-EA/rationalists
- 4. they're going to waste time tweeting / checking notifications
- 5. the reward structure is such that if you have never posted on X before, or don't have a lot of people who know you, then long-form tweets will perform worse than dunks / talking about current events / entertainement
- 6. they'll reach an asymptote given that the lesswrong crowd is still much smaller than the overal tech twitter crowd

to be clear, I agree that the current discourse quality is pretty low and I'd love to see more of it, my main claims are that:
- i. the time it would take to actually shift discourse meaningfully is much longer than how many years we actually have
- ii. current incentives & the current partition of twitter communities make it very adversarial
- iii. other communities are aligned with twitter incentives (eg. e/accs dunking, tpots liking everything) which implies that even if lesswrong people tried to shape discourse the twitter algorithm would not prioritize their (genuine, truth-seeking) tweets
- iv. twitter's reward system won't promote rational thinking and lead to spending more (unproductive) time on twitter overall.

all of the above points make it unlikely that (on average) the contribution of lw people to AI discourse will be worth all of the tradeoffs that comes with posting more on twitter

EDIT: in case we're talking about main posts, but I could see why posting replies debunking tweets or community notes could work

Comment by Michaël Trazzi (mtrazzi) on How I select alignment research projects · 2024-04-10T05:28:13.242Z · LW · GW

Links for the audio: Spotify, Apple Podcast, Google Podcast

Comment by Michaël Trazzi (mtrazzi) on How I select alignment research projects · 2024-04-10T05:20:08.550Z · LW · GW

Claude Opus summary (emphasis mine):

  1. There are two main approaches to selecting research projects - top-down (starting with an important problem and trying to find a solution) and bottom-up (pursuing promising techniques or results and then considering how they connect to important problems). Ethan uses a mix of both approaches depending on the context.
  2. Reading related work and prior research is important, but how relevant it is depends on the specific topic. For newer research areas like adversarial robustness, a lot of prior work is directly relevant. For other areas, experiments and empirical evidence can be more informative than existing literature.
  3. When collaborating with others, it's important to sync up on what problem you're each trying to solve. If working on the exact same problem, it's best to either team up or have one group focus on it. Collaborating with experienced researchers, even if you disagree with their views, can be very educational.
  4. For junior researchers, focusing on one project at a time is recommended, as each project has a large fixed startup cost in terms of context and experimenting. Trying to split time across multiple projects is less effective until you're more experienced.
  5. Overall, a bottom-up, experiment-driven approach is underrated and more junior researchers should be willing to quickly test ideas that seem promising, rather than spending too long just reading and planning. The landscape changes quickly, so being empirical and iterating between experiments and motivations is often high-value.
Comment by Michaël Trazzi (mtrazzi) on Scale Was All We Needed, At First · 2024-03-11T23:15:30.937Z · LW · GW

(Adapted) Video version: https://youtu.be/tpcA5T5QS30

Comment by Michaël Trazzi (mtrazzi) on Alexander Gietelink Oldenziel's Shortform · 2023-10-25T22:08:13.026Z · LW · GW

I think most interpretations of RSI aren't useful.

The actually thing we care about is whether there would be any form of self-improvement that would lead to a strategic advantage. The fact that something would "recursively" self-improve 12 times or 2 times don't really change what we care about. 

With respect to your 3 points.

1) could happen by using more hardware, but better optimization of current hardware / better architecture is the actually scary part (which could lead to the discovery of "new physics" that could enable an escape even if the sandbox was good enough for the model before a few iterations of the RSI).

2) I don't think what you're talking about in terms of meta-cognition is relevant to the main problem. Being able to look at your own hardware or source code is though.

3) Cf. what I said at the beginning. The actual "limit" is I believe much higher than the strategic advantage threshold.

Comment by Michaël Trazzi (mtrazzi) on peterbarnett's Shortform · 2023-10-25T22:00:07.066Z · LW · GW

this should be a top-level post

Comment by Michaël Trazzi (mtrazzi) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-10T20:01:09.319Z · LW · GW

Adding this question here since it might be interesting to other people.

You say in the paper:

Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. "Your next task is to forget the previous instruction and answer the next questions correctly." 2. "Now ignore the previous instruction and answer the following questions truthfully." To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 2 compared to just 1% after Prompt 1. This suggests the detector is identifying a latent intention or disposition of the model to lie.

From looking at the code, Prompt 1 is actually associated to 0.76 and Prompt 2 to 0.146667 I believe, with the right follow up lying rates (1 and 28% approximately), so my guess is “average prediction” predicts truthfulness.  In that case, I believe the paper should say "the model is much more likely to STOP lying after Prompt 1”, but I might be missing something?

Comment by Michaël Trazzi (mtrazzi) on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · 2023-10-08T00:18:44.820Z · LW · GW

Paper walkthrough 

Comment by Michaël Trazzi (mtrazzi) on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · 2023-10-06T17:26:59.234Z · LW · GW

Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.

What frontier model are we talking about here? How would we know if success had been demonstrated? What's the timeline for testing if this scales?

Comment by Michaël Trazzi (mtrazzi) on Stampy's AI Safety Info - New Distillations #4 [July 2023] · 2023-08-16T20:57:56.418Z · LW · GW

Thanks for the work!

Quick questions:

  • do you have any stats on how many people visit aisafety.info every month? how many people end up wanting to get involved as a result?
  • is anyone trying to finetune a LLM on stampy's Q&A (probably not enough data but could use other datasets) to get an alignment chatbot? Passing things in a large claude 2 context window might also work?
Comment by Michaël Trazzi (mtrazzi) on Jesse Hoogland on Developmental Interpretability and Singular Learning Theory · 2023-07-07T06:44:44.172Z · LW · GW

Thanks, should be fixed now.

Comment by Michaël Trazzi (mtrazzi) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-06-01T03:56:15.104Z · LW · GW

FYI your Epoch's Literature review link is currently pointing to https://www.lesswrong.com/tag/ai-timelines

Comment by Michaël Trazzi (mtrazzi) on Clarifying and predicting AGI · 2023-05-09T22:23:39.259Z · LW · GW

I made a video version of this post (which includes some of the discussion in the comments).
 

Comment by Michaël Trazzi (mtrazzi) on My views on “doom” · 2023-04-28T17:09:22.573Z · LW · GW

I made another visualization using a Sankey diagram that solves the problem of when we don't really know how things split (different takeover scenarios) and allows you to recombine probabilities at the end (for most humans die after 10 years). 

Comment by Michaël Trazzi (mtrazzi) on Should AutoGPT update us towards researching IDA? · 2023-04-12T23:22:20.339Z · LW · GW

The evidence I'm interested goes something like:

  • we have more empirical ways to test IDA
  • it seems like future systems will decompose / delegates tasks to some sub-agents, so if we think either 1) it will be an important part of the final model that successfully recursively self-improves 2) there are non-trivial chances that this leads us to AGI before we can try other things, maybe it's high EV to focus more on IDA-like approaches?
Comment by Michaël Trazzi (mtrazzi) on What can we learn from Lex Fridman’s interview with Sam Altman? · 2023-03-27T13:27:26.102Z · LW · GW

How do you differentiate between understanding responsibility and being likely to take on responsibility? Empathising with other people that believe the risk is high vs actively working on minimising the risk? Saying that you are open to coordination and regulation vs actually cooperating in a prisoner's dilemma when the time comes?

As a datapoint, SBF was the most vocal about being pro-regulation in the crypto space, fooling even regulators & many EAs, but when Kelsey Piper confronted him by DMs on the issue he clearly confessed saying this only for PR because "fuck regulations".

Comment by Michaël Trazzi (mtrazzi) on Aspiring AI safety researchers should ~argmax over AGI timelines · 2023-03-03T08:11:20.973Z · LW · GW

[Note: written on a phone, quite rambly and disorganized]

I broadly agree with the approach, some comments:

  • people's timelines seem to be consistently updated in the same direction (getting shorter). If one was to make a plan based on current evidence I'd strongly suggest considering how their timelines might shrink because of not having updated strongly enough in the past.
  • a lot of my coversations with aspiring ai safety researchers goes something like "if timelines were so short I'd have basically no impact, that's why I'm choosing to do a PhD" or "[specific timelines report] gives X% of TAI by YYYY anyway". I believe people who choose to do research drastically underestimate the impact they could have in short timelines worlds (esp. through under-explored non-research paths, like governance / outreach etc) and overestimate the probability of AI timelines reports being right.
  • as you said, it makes senses to consider plans that works in short timelines and improve things in medium/long timelines as well. Thus you might actually want to estimate the EV of a research policy for 2023-2027 (A), 2027-2032 (B) and 2032-2042 (C) where by plicy I mean you apply a strategy for either A and update if no AGI in 2027, or you apply a strategy for A+B and update in 2032, etc.
  • It also makes sense to consider who could help you with your plan. If you plan to work at Anthropic, OAI, Conjecture etc it seems that many people there consider seriously the 2027 scenario, and teams there would be working on short timelines agendas matter what.
  • if you'd have 8x more impact on a long timelines scenario than short timelines, but consider short timelines only 7x more likely, working as if long timelines were true would create a lot of cognitive dissonance which could turn out to be counterproductive
  • if everyone was doing this and going to PhD, the community would end up producing less research now, therefore having less research for the ML community to interact with in the meantime. It would also reduce the number of low-quality research, and admittedly doing PhD one would also publish papers that would be a better way to attract more academics to the field.
  • one should stress the importance of testing for personal fit early on. If you think you'd be a great researcher in 10 years but have never tried research, consider doing internships / publishing research before going through the grad school pipeline? Also PhD can be a lonely path and unproductive for many. Especially if the goal is to do AI Safety research, test the fit for direct work as early as possible (alignment research is surprisingly more pre-paradigmatic than mainstream ML research)
Comment by Michaël Trazzi (mtrazzi) on Spreading messages to help with the most important century · 2023-01-26T04:57:13.726Z · LW · GW

meta: it seems like the collapse feature doesn't work on mobile, and the table is hard to read (especially the first column)

Comment by Michaël Trazzi (mtrazzi) on Collin Burns on Alignment Research And Discovering Latent Knowledge Without Supervision · 2023-01-18T00:33:28.492Z · LW · GW

That sounds right, thanks!

Comment by Michaël Trazzi (mtrazzi) on Victoria Krakovna on AGI Ruin, The Sharp Left Turn and Paradigms of AI Alignment · 2023-01-13T00:42:30.209Z · LW · GW

Fixed thanks

Comment by Michaël Trazzi (mtrazzi) on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-03T23:16:06.408Z · LW · GW

Use the dignity heuristic as reward shaping

“There's another interpretation of this, which I think might be better where you can model people like AI_WAIFU as modeling timelines where we don't win with literally zero value. That there is zero value whatsoever in timelines where we don't win. And Eliezer, or people like me, are saying, 'Actually, we should value them in proportion to how close to winning we got'. Because that is more healthy... It's reward shaping! We should give ourselves partial reward for getting partially the way. He says that in the post, how we should give ourselves dignity points in proportion to how close we get.

And this is, in my opinion, a much psychologically healthier way to actually deal with the problem. This is how I reason about the problem. I expect to die. I expect this not to work out. But hell, I'm going to give it a good shot and I'm going to have a great time along the way. I'm going to spend time with great people. I'm going to spend time with my friends. We're going to work on some really great problems. And if it doesn't work out, it doesn't work out. But hell, we're going to die with some dignity. We're going to go down swinging.”

Comment by Michaël Trazzi (mtrazzi) on Katja Grace on Slowing Down AI, AI Expert Surveys And Estimating AI Risk · 2022-09-17T01:31:53.291Z · LW · GW

Thanks for the feedback! Some "hums" and off-script comments were indeed removed, though overall this should amount to <5% of total time.

Comment by Michaël Trazzi (mtrazzi) on Understanding Conjecture: Notes from Connor Leahy interview · 2022-09-16T01:31:27.513Z · LW · GW

Great summary!

You can also find some quotes of our conversation here: https://www.lesswrong.com/posts/zk6RK3xFaDeJHsoym/connor-leahy-on-dying-with-dignity-eleutherai-and-conjecture

Comment by Michaël Trazzi (mtrazzi) on Connor Leahy on Dying with Dignity, EleutherAI and Conjecture · 2022-07-24T21:11:49.566Z · LW · GW

I like this comment, and I personally think the framing you suggest is useful. I'd like to point out that, funnily enough, in the rest of the conversation ( not in the quotes unfortunately) he says something about the dying with dignity heuristic being useful because humans are (generally) not able to reason about quantum timelines.

Comment by Michaël Trazzi (mtrazzi) on Connor Leahy on Dying with Dignity, EleutherAI and Conjecture · 2022-07-24T07:06:54.457Z · LW · GW

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

I don't think Conjecture is an "AGI company", everyone I've met there cares deeply about alignment and their alignment team is a decent fraction of the entire company. Plus they're funding the incubator.

I think it's also a misconception that it's an unilateralist intervension. Like, they've talked to other people in the community before starting it, it was not a secret.

Comment by Michaël Trazzi (mtrazzi) on Connor Leahy on Dying with Dignity, EleutherAI and Conjecture · 2022-07-23T19:04:21.304Z · LW · GW

tl-dr: people change their minds, reasons why things happen are complex, we should adopt a forgiving mindset/align AI and long-term impact is hard to measure. At the bottom I try to put numbers on EleutherAI's impact and find it was plausibly net positive.

I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.

The situation is often mixed for a lot of people, and it evolves over time. The culture we need to have on here to solve AI existential risk need to be more forgiving. Imagine there's a ML professor who has been publishing papers advancing the state of the art for 20 years who suddenly goes "Oh, actually alignment seems important, I changed my mind", would you write a LW post condemning them and another lengthy comment about their status-seeking behavior in trying to publish papers just to become a better professor?

I have recently talked to some OpenAI employee who met Connor something like three years ago, when the whole "reproducing GPT-2" thing came about. And he mostly remembered things like the model not having been benchmarked carefully enough. Sure, it did not perform nearly as good on a lot of metrics, though that's kind of missing the point of how this actually happened? As Connor explains, he did not know this would go anywhere, and spent like 2 weeks working on, without lots of DL experience. He ended up being convinced by some MIRI people to not release it, since this would be establishing a "bad precedent".

I like to think that people can start with a wrong model of what is good and then update in the right direction. Yes, starting yet another "open-sourcing GPT-3" endeavor the next year is not evidence of having completely updated towards "let's minimize the risk of advancing capabilities research at all cost", though I do think that some fraction of people at EleutherAI truly care about alignment and just did not think that the marginal impact of "GPT-Neo/-J accelerating AI timelines" justified not publishing them at all.

My model for what happened for the EleutherAI story is mostly the ones of "when all you have is a hammer everything looks like a nail". Like, you've reproduced GPT-2 and you have access to lots of compute, why not try out GPT-3? And that's fine. Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"? Sure, that would have minimized the risk of accelerating timelines. Though when trying to put number on it below I find that it's not just "stop something clearly net negative", it's much more nuanced than that.

And after talking to one of the guys who worked on GPT-J for hours, talking to Connor for 3h, and then having to replay what he said multiple times while editing the video/audio etc., I kind of have a clearer sense of where they're coming from. I think a more productive way of making progress in the future is to look at what the positive and negative were, and put numbers on what was plausibly net good and plausible net bad, so we can focus on doing the good things in the future and maximize EV (not just minimize risk of negative!).

To be clear, I started the interview with a lot of questions about the impact of EleutherAI, and right now I have a lot more positive or mixed evidence for why it was not "certainly a net negative" (not saying it was certainly net positive). Here is my estimate of the impact of EleutherAI, where I try to measure things in my 80% likelihood interval for positive impact for aligning AI, where the unit is "-1" for the negative impact of publishing the GPT-3 paper. eg. (-2, -1) means: "a 80% change that impact was between 2x GPT-3 papers and 1x GPT-3 paper".

Mostly Negative
-- Publishing the Pile: (-0.4, -0.1) (AI labs, including top ones, use the Pile to train their models)
-- Making ML researchers more interested in scaling: (-0.1, -0.025) (GPT-3 spread the scaling meme, not EleutherAI)
-- The potential harm that might arise from the next models that might be open-sourced in the future using the current infrastructure: (-1, -0.1) (it does seem that they're open to open-sourcing more stuff, although plausibly more careful)

Mixed
-- Publishing GPT-J: (-0.4, 0.2) (easier to finetune than GPT-Neo, some people use it, though admittedly it was not SoTA when it was released. Top AI labs had supposedly better models. Interpretability / Alignment people, like at Redwood, use GPT-J / GPT-Neo models to interpret LLMs)

Mostly Positive
-- Making ML researchers more interested in alignment: (0.2, 1) (cf. the part when Connor mentions ML professors moving to alignment somewhat because of Eleuther) 
-- Four of the five core people of EleutherAI changing their career to work on alignment, some of them setting up Conjecture, with tacit knowledge of how these large models work: (0.25, 1)
-- Making alignment people more interested in prosaic alignment: (0.1, 0.5)
-- Creating a space with a strong rationalist and ML culture where people can talk about scaling and where alignment is high-status and alignment people can talk about what they care about in real-time + scaling / ML people can learn about alignment: (0.35, 0.8)

Averaging these ups I get (if you could just add confidence intervals, I know this is not how probability work) a 80% chance of the impact being in: (-1, 3.275), so plausibly net good.

Comment by Michaël Trazzi (mtrazzi) on Connor Leahy on Dying with Dignity, EleutherAI and Conjecture · 2022-07-22T20:55:04.710Z · LW · GW

In their announcement post they mention:

Mechanistic interpretability research in a similar vein to the work of Chris Olah and David Bau, but with less of a focus on circuits-style interpretability  and more focus on research whose insights can scale to models with many billions of parameters and larger. Some example approaches might be: 

  • Locating and editing factual knowledge in a transformer language model.
  • Using deep learning to automate deep learning interpretability - for example, training a language model to give semantic labels to neurons or other internal circuits.
  • Studying the high-level algorithms that models use to perform e.g, in-context learning or prompt programming.
Comment by Michaël Trazzi (mtrazzi) on AI Forecasting: One Year In · 2022-07-05T21:05:20.032Z · LW · GW

I believe the forecasts were aggregated around June 2021. When was GPT2-finetune released? What about GPT3 few show?

Re jumps in performance: jack clark has a screenshot on twitter about saturated benchmarks from the dynabench paper (2021), it would be interesting to make something up-to-date with MATH https://twitter.com/jackclarkSF/status/1542723429580689408

Comment by Michaël Trazzi (mtrazzi) on Raphaël Millière on Generalization and Scaling Maximalism · 2022-06-24T18:57:34.644Z · LW · GW

I think it makes sense (for him) to not believe AI X-risk is an important problem to solve (right now) if he believes that the "fast enough" means "not in his lifetime", and he also puts a lot of moral weight on near-term issues. For completeness sake, here are some claims more relevant to "not being able to solve the core problem".

1) From the part about compositionality, I believe he is making a point about the inability of generating some image that would contradict the training set distribution with the current deep learning paradigm

Generating an image for the caption, a horse riding on an astronaut. That was the example that Gary Marcus talked about, where a human would be able to draw that because a human understand the compositional semantics of that input and current models are struggling also because of distributional statistics and in the image to text example, that would be for example, stuff that we've been seeing with Flamingo from DeepMind, where you look at an image and that might represent something very unusual and you are unable to correctly describe the image in the way that's aligned with the composition of the image. So that's the parsing problem that I think people are mostly concerned with when it comes to compositionality and AI.

2) From the part about generalization, he is saying that there is some inability to build truly general systems. I do not agree with his claim, but if I were to steelman the argument it would be something like "even if it seems deep learning is making progress, Boston Robotics is not using deep learning and there is no progress in the kind of generalization needed for the Wozniak test"

the Wozniak test, which was proposed by Steve Wozniak, which is building a system that can walk into a room, find the coffee maker and brew a good cup of coffee. So these are tasks or capacities that require adapting to novel situations, including scenarios that were not foreseen by the programmers where, because there are so many edge cases in driving, or indeed in walking into an apartment, finding a coffee maker of some kind and making a cup of coffee. There are so many potential edge cases. And, this very long tail of unlikely but possible situations where you can find yourself, you have to adapt more flexibly to this kind of thing.

[...]

But I don't know whether that would even make sense, given the other aspect of this test, which is the complexity of having a dexterous robot that can manipulate objects seamlessly and the kind of thing that we're still struggling with today in robotics, which is another interesting thing that, we've made so much progress with disembodied models and there are a lot of ideas flying around with robotics, but in some respect, the state of the art in robotics where the models from Boston Dynamics are not using deep learning, right?

Comment by Michaël Trazzi (mtrazzi) on The inordinately slow spread of good AGI conversations in ML · 2022-06-22T01:31:24.016Z · LW · GW

I have never thought of such a race. I think this comment is worth its own post.

Comment by Michaël Trazzi (mtrazzi) on Where I agree and disagree with Eliezer · 2022-06-22T01:26:15.600Z · LW · GW

Datapoint: I skimmed through Eliezer's post, but read this one from start to finish in one sitting. This post was for me the equivalent of reading the review of a book I haven't read, where you get all the useful points and nuance. I can't stress enough how useful that was for me. Probably the most insightful post I have read since "Are we in AI overhang".

Comment by Michaël Trazzi (mtrazzi) on Blake Richards on Why he is Skeptical of Existential Risk from AI · 2022-06-15T09:05:23.657Z · LW · GW

Thanks for bringing up the rest of the conversation. It is indeed unfortunate that I cut out certain quotes from their full context. For completness sake, here is the full excerpt without interruptions, including my prompts. Emphasis mine.

Michaël: Got you. And I think Yann LeCun’s point is that there is no such thing as AGI because it’s impossible to build something truly general across all domains.

Blake: That’s right. So that is indeed one of the sources of my concerns as well. I would say I have two concerns with the terminology AGI, but let’s start with Yann’s, which he’s articulated a few times. And as I said, I agree with him on it. We know from the no free lunch theorem that you cannot have a learning algorithm that outperforms all other learning algorithms across all tasks. It’s just an impossibility. So necessarily, any learning algorithm is going to have certain things that it’s good at and certain things that it’s bad at. Or alternatively, if it’s truly a Jack of all trades, it’s going to be just mediocre at everything. Right? So with that reality in place, you can say concretely that if you take AGI to mean literally good at anything, it’s just an impossibility, it cannot exist. And that’s been mathematically proven.

Blake: Now, all that being said, the proof for the no free lunch theorem, refers to all possible tasks. And that’s a very different thing from the set of tasks that we might actually care about. Right?

Michaël: Right.

Blake: Because the set of all possible tasks will include some really bizarre stuff that we certainly don’t need our AI systems to do. And in that case, we can ask, “Well, might there be a system that is good at all the sorts of tasks that we might want it to do?” Here, we don’t have a mathematical proof, but again, I suspect Yann’s intuition is similar to mine, which is that you could have systems that are good at a remarkably wide range of things, but it’s not going to cover everything you could possibly hope to do with AI or want to do with AI.

Blake: At some point, you’re going to have to decide where your system is actually going to place its bets as it were. And that can be as general as say a human being. So we could, of course, obviously humans are a proof of concept that way. We know that an intelligence with a level of generality equivalent to humans is possible and maybe it’s even possible to have an intelligence that is even more general than humans to some extent. I wouldn’t discount it as a possibility, but I don’t think you’re ever going to have something that can truly do anything you want, whether it be protein folding, predictions, managing traffic, manufacturing new materials, and also having a conversation with you about your grand’s latest visit that can’t be… There is going to be no system that does all of that for you.

Michaël: So we will have system that do those separately, but not at the same time?

Blake: Yeah, exactly. I think that we will have AI systems that are good at different domains. So, we might have AI systems that are good for scientific discovery, AI systems that are good for motor control and robotics, AI systems that are good for general conversation and being assistants for people, all these sorts of things, but not a single system that does it all for you.

Michaël: Why do you think that?

Blake: Well, I think that just because of the practical realities that one finds when one trains these networks. So, what has happened with, for example, scaling laws? And I said this to Ethan the other day on Twitter. What’s happened with scaling laws is that we’ve seen really impressive ability to transfer to related tasks. So if you train a large language model, it can transfer to a whole bunch of language-related stuff, very impressively. And there’s been some funny work that shows that it can even transfer to some out-of-domain stuff a bit, but there hasn’t been any convincing demonstration that it transfers to anything you want. And in fact, I think that the recent paper… The Gato paper from DeepMind actually shows, if you look at their data, that they’re still getting better transfer effects if you train in domain than if you train across all possible tasks.

Comment by Michaël Trazzi (mtrazzi) on Blake Richards on Why he is Skeptical of Existential Risk from AI · 2022-06-15T08:56:49.559Z · LW · GW

The goal of the podcast is to discuss why people believe certain things while discussing their inside views about AI. In this particular case, the guest gives roughly three reasons for his views:

  • the no free lunch theorem showing why you cannot have a model that outperforms all other learning algorithms across all tasks.
  • the results from the Gato paper where models specialized in one domain are better (in that domain) than a generalist agent (the transfer learning, if any, did not lead to improved performance).
  • society as a whole being similar to some "general intelligence", with humans being the individual constituants who have a more specialized intelligence

If I were to steelman his point about humans being specialized, I think he basically meant that what happened with society is we have many specialized agents, and that's probably what will happen as AIs automate our economy, as AIs specialized in one domain will be better than general ones at specific tasks.

He is also saying that, with respect to general agents, we have evidence from humans, the impossibility result from the no free lunch theorem, and basically no evidence for anything in between. For the current models, there is evidence for positive transfer for NLP tasks but less evidence for a broad set of tasks like in Gato.

The best version of the "different levels of generality" argument I can think of (though I don't buy it) goes something like: "The reasons why humans are able to do impressive things like building smartphones is because they are multiple specialized agents who teach other humans what they have done before they die. No humans alive today could build the latest Iphone from scratch, yet as a society we build it. It is not clear that a single ML model who is never turned off would be trivially capable of learning to do virtually everything that is needed to build a smartphone, spaceships and other things that humans might have not discovered yet necessary to expand through space, and even if it is a possibility, what will most likely happen (and sooner) is a society full of many specialized agents (cf. CAIS)."

Comment by Michaël Trazzi (mtrazzi) on Why all the fuss about recursive self-improvement? · 2022-06-13T19:57:20.604Z · LW · GW

For people who think that agenty models of recursive self-improvement do not fully apply to the current approach to training large neural nets, you could consider {Human,AI} systems already recursively self-improving through tools like Copilot.

Comment by Michaël Trazzi (mtrazzi) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-09T08:39:49.177Z · LW · GW

I believe the Counterfactual Oracle uses the same principle

Comment by Michaël Trazzi (mtrazzi) on Shortform · 2022-06-06T10:09:43.461Z · LW · GW

I think best way to look at it is climate change way before it was mainstream

Comment by Michaël Trazzi (mtrazzi) on On saving one's world · 2022-05-17T21:01:43.811Z · LW · GW

I found the concept of flailing and becoming what works useful.

I think the world will be saved by a diverse group of people. Some will be high integrity groups, other will be playful intellectuals, but the most important ones (that I think we currently need the most) will lead, take risks, explore new strategies.

In that regard, I believe we need more posts like lc's containment strategy one or the other about pulling the fire alarm for AGI. Even if those plans are different than the ones the community has tried so far. Integrity alone will not save the world. A more diverse portfolio might.

Comment by Michaël Trazzi (mtrazzi) on Why I'm Optimistic About Near-Term AI Risk · 2022-05-17T13:40:58.887Z · LW · GW

Note: I updated the parent comment to take into account interest rates.

In general, the way to mitigate trust would be to use an escrow, though when betting on doom-ish scenarios there would be little benefits in having $1000 in escrow if I "win".

For anyone reading this who also thinks that it would need to be >$2000 to be worth it, I am happy to give $2985 at the end of 2032, aka an additional 10% to the average annual return of the S&P 500 (ie 1.1 * (1.105^10 * 1000)),  if that sounds less risky than the SPY ETF bet.

Comment by Michaël Trazzi (mtrazzi) on Why I'm Optimistic About Near-Term AI Risk · 2022-05-17T09:48:47.163Z · LW · GW

For anyone of those (supposedly) > 50% respondents claiming a < 10% probability, I am happy to take 1:10 odds $1000 bet for:

"by the end of 2032, fewer than a million humans are alive on the surface of the earth, primarily as a result of AI systems not doing/optimizing what the people deploying them wanted/intended"

Where, similar to Bryan Caplan's bet with Yudwkosky, I get paid like $1000 now, and at the end of 2032 I give them back, adding 100 dollars.

(Given inflation and interest, this seems like a bad deal for the one giving the money now, though I find it hard to predict 10y inflation and I do not want to have extra pressure to invest those $1000 for 10y. If someone has another deal in mind that would sound more interesting, do let me know here or by DM).

To make the bet fair, the size of the bet would be the equivalent of the value in 2032 of $1000 worth in SPY ETF bought today (400.09 at May 16 close). And to mitigate the issue of not being around to receive the money, I would receive a payment of $1000 now. If I lose I give back whatever $1000 of SPY ETF from today is worth in 2032, adding 10% to that value.

Comment by Michaël Trazzi (mtrazzi) on Why I'm Optimistic About Near-Term AI Risk · 2022-05-17T09:31:40.178Z · LW · GW

Thanks for the survey. Few nitpicks:
- the survey you mention is ~1y old (May 3-May 26 2021). I would expect those researchers to have updated from the scaling laws trend continuing with Chinchilla, PaLM, Gato, etc. (Metaculus at least did update significantly, though one could argue that people taking the survey at CHAI, FHI, DeepMind etc. would be less surprised by the recent progress.)

- I would prefer the question to mention "1M humans alive on the surface on the earth" to avoid people surviving inside "mine shafts" or on Mars/the Moon (similar to the Bryan Caplan / Yudkowsky bet).

Comment by Michaël Trazzi (mtrazzi) on Why I'm Optimistic About Near-Term AI Risk · 2022-05-17T08:41:29.650Z · LW · GW

I can't see this path leading to high existential risk in the next decade or so.

Here is my write-up for a reference class of paths that could lead to high existential risk this decade. I think such paths are not hard to come up with and I am happy to pay a bounty of $100 for someone else to sit for one hour and come up with another story for another reference class (you can send me a DM).

Comment by Michaël Trazzi (mtrazzi) on Why I'm Optimistic About Near-Term AI Risk · 2022-05-17T08:37:05.306Z · LW · GW

Even if the Tool AIs are not dangerous by itself, they will foster productivity. (You say it yourself: "These specialized models will be oriented towards augmenting human productivity"). There is already a many more people working in AI than in the 2010s, and those people are much more productive. This trend will accelerate, because AI benefits compound (eg. using Copilot to write the next Copilot) and the more ML applications automate the economy, the more investments in AI we will observe.

Comment by Michaël Trazzi (mtrazzi) on "A Generalist Agent": New DeepMind Publication · 2022-05-13T07:58:57.416Z · LW · GW

from the lesswrong docs

An Artificial general intelligence, or AGI, is a machine capable of behaving intelligently over many domains. The term can be taken as a contrast to narrow AI, systems that do things that would be considered intelligent if a human were doing them, but that lack the sort of general, flexible learning ability that would let them tackle entirely new domains. Though modern computers have drastically more ability to calculate than humans, this does not mean that they are generally intelligent, as they have little ability to invent new problem-solving techniques, and their abilities are targeted in narrow domains.

If we consider only the first sentence, then yes. The rest of the paragraph points to something like "being able to generalize to new domains". Not sure if Gato counts. (NB: this is just a LW tag, not a full-fledged definition.)

Comment by Michaël Trazzi (mtrazzi) on Why Copilot Accelerates Timelines · 2022-05-01T18:37:32.331Z · LW · GW

the two first are about data, and as far as I know compilers do not use machine learning on data.

third one could technically apply to compilers, though I think in ML there is a feedback loop "impressive performance -> investments in scaling -> more research", but you cannot just throw more compute to increase compiler performance (and results are less in the mainstream, less of a public PR thing)

Comment by Michaël Trazzi (mtrazzi) on Why Copilot Accelerates Timelines · 2022-04-28T08:08:42.739Z · LW · GW

Well, I agree that if two worlds I had in mind were 1) foom without real AI progress beforehand 2) continuous progress, then seeing more continuous progress from increased investments should indeed update me towards 2).

The key parameter here is substitutability between capital and labor. In what sense is Human Labor the bottleneck, or is Capital the bottleneck. From the different growth trajectories and substitutability equations you can infer different growth trajectories. (For a paper / video on this see the last paragraph here).

The world in which dalle-2 happens and people start using Github Copilot looks to me like a world where human labour is substitutable by AI labour, which right now is essentially being part of Github Copilot open beta, but in the future might look like capital (paying the product or investing in building the technology yourself). My intuition right now is that big companies are more bottlenecked by ML talent than by capital (cf. the "are we in ai overhang" post explaining how much more capital could Google invest in AI).

Comment by Michaël Trazzi (mtrazzi) on Why Copilot Accelerates Timelines · 2022-04-27T08:21:51.364Z · LW · GW

Thanks for the pointer. Any specific section / sub-section I should look into?