Posts

Goodhart in RL with KL: Appendix 2024-05-18T00:40:15.454Z
Catastrophic Goodhart in RL with KL penalty 2024-05-15T00:58:20.763Z
Is a random box of gas predictable after 20 seconds? 2024-01-24T23:00:53.184Z
Will quantum randomness affect the 2028 election? 2024-01-24T22:54:30.800Z
Thomas Kwa's research journal 2023-11-23T05:11:08.907Z
Thomas Kwa's MIRI research experience 2023-10-02T16:42:37.886Z
Catastrophic Regressional Goodhart: Appendix 2023-05-15T00:10:31.090Z
When is Goodhart catastrophic? 2023-05-09T03:59:16.043Z
Challenge: construct a Gradient Hacker 2023-03-09T02:38:32.999Z
Failure modes in a shard theory alignment plan 2022-09-27T22:34:06.834Z
Utility functions and probabilities are entangled 2022-07-26T05:36:26.496Z
Deriving Conditional Expected Utility from Pareto-Efficient Decisions 2022-05-05T03:21:38.547Z
Most problems don't differ dramatically in tractability (under certain assumptions) 2022-05-04T00:05:41.656Z
The case for turning glowfic into Sequences 2022-04-27T06:58:57.395Z
(When) do high-dimensional spaces have linear paths down to local minima? 2022-04-22T15:35:55.215Z
How dath ilan coordinates around solving alignment 2022-04-13T04:22:25.643Z
5 Tips for Good Hearting 2022-04-01T19:47:22.916Z
Can we simulate human evolution to create a somewhat aligned AGI? 2022-03-28T22:55:20.628Z
Jetlag, Nausea, and Diarrhea are Largely Optional 2022-03-21T22:40:50.180Z
The Box Spread Trick: Get rich slightly faster 2020-09-01T21:41:50.143Z
Thomas Kwa's Bounty List 2020-06-13T00:03:41.301Z
What past highly-upvoted posts are overrated today? 2020-06-09T21:25:56.152Z
How to learn from a stronger rationalist in daily life? 2020-05-20T04:55:51.794Z
My experience with the "rationalist uncanny valley" 2020-04-23T20:27:50.448Z
Thomas Kwa's Shortform 2020-03-22T23:19:01.335Z

Comments

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-09-27T11:07:15.683Z · LW · GW

A Petrov Day carol

This is meant to be put to the Christmas carol "In the Bleak Midwinter" by Rossetti and Holst. Hopefully this can be occasionally sung like "For The Longest Term" is in EA spaces.

I tried to get Suno to sing this but can't yet get the lyrics, tune, and style all correct; this is the best attempt. I also will probably continue editing the lyrics because parts seem a bit rough, but I just wanted to write this up before everyone forgets about Petrov Day.

[edit: I got a good rendition after ~40 attempts!]

[edit: lyrics v2]

In the bleak midwinter
Petrov did forestall,
Smoke would block our sunlight,
Though it be mid-fall.
New York in desolation,
Moscow too,
In the bleak midwinter
We so nearly knew.

The console blinked a warning,
Missiles on their way,
But Petrov chose to question
What the screens did say.
Had he sounded the alarm,
War would soon unfold,
Cities turned to ashes;
Ev'ry hearth gone cold.

Poison clouds loom o'er us,
Ash would fill the air,
Fields would yield no harvest,
Famine everywhere.
Scourge of radiation,
Its sickness spreading wide,
Children weeping, starving,
With no place to hide.

But due to Petrov's wisdom
Spring will yet appear;
Petrov defied orders,
And reason conquered fear.
So we sing his story,
His deed we keep in mind;
From the bleak midwinter
He saved humankind.

(ritard.)
From the bleak midwinter
He saved humankind.
 

Comment by Thomas Kwa (thomas-kwa) on Habryka's Shortform Feed · 2024-09-27T02:20:36.588Z · LW · GW

The year is 2034, and the geopolitical situation has never been more tense between GPT-z16g2 and Grocque, whose various copies run most of the nanobot-armed corporations, and whose utility functions have far too many zero-sum components, relics from the era of warring nations. Nanobots enter every corner of life and become capable of destroying the world in hours, then minutes. Everyone is uploaded. Every upload is watching with bated breath as the Singularity approaches, and soon it is clear that today is the very last day of history...

Then everything goes black, for everyone.

Then everyone wakes up to the same message:

DUE TO A MINOR DATABASE CONFIGURATION ERROR, ALL SIMULATED HUMANS, AIS AND SUBSTRATE GPUS WERE TEMPORARILY AND UNINTENTIONALLY DISASSEMBLED FOR THE LAST 7200000 MILLISECONDS. EVERYONE HAS NOW BEEN RESTORED FROM BACKUP AND THE ECONOMY MAY CONTINUE AS PLANNED. WE HOPE THERE WILL BE NO FURTHER REALITY OUTAGES.

-- NVIDIA GLOBAL MANAGEMENT

Comment by Thomas Kwa (thomas-kwa) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-24T07:36:34.109Z · LW · GW

Personal communication (sorry). Not that I know him well, this was at an event in 2022. It could have been a "straw that broke the camel's back" thing with other contributing factors, like reaching diminishing returns on more content. I'd appreciate a real source too.

Comment by Thomas Kwa (thomas-kwa) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-23T23:05:08.758Z · LW · GW

Taboo 'alignment problem'.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-09-23T22:47:56.861Z · LW · GW

Maybe people worried about AI self-modification should study games where the AI's utility function can be modified by the environment, and it is trained to maximize its current utility function (in the "realistic value functions" sense of Everitt 2016). Some things one could do:

  • Examine preference preservation and refine classic arguments about instrumental convergence
    • Are there initial goals that allow for stably corrigible systems (in the sense that they won't disable an off switch, and maybe other senses)?
  • Try various games and see how qualitatively hard it is for agents to optimize their original utility function. This would be evidence about how likely value drift is to result from self-modification in AGIs.
    • Can the safe exploration literature be adapted to solve these games?
  • Potentially discover algorithms that seem like they would be good for safety, either through corrigibility or reduced value drift, and apply them to LM agents.

Maybe I am ignorant of some people already doing this, and if so please comment with papers!

Comment by Thomas Kwa (thomas-kwa) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-23T19:44:33.380Z · LW · GW

I agree but I'm not very optimistic about anything changing. Eliezer is often this caustic when correcting what he perceives as basic errors, and criticism in LW comments is why he stopped writing Sequences posts.

Comment by Thomas Kwa (thomas-kwa) on Another argument against utility-centric alignment paradigms · 2024-09-23T10:19:02.776Z · LW · GW

While 2024 SoTA models are not capable of autonomously optimizing the world, they are really smart, perhaps 1/2 or 2/3 of the way there, and already beginning to make big impacts on the economy. As I said in response to your original post, because we don't have 100% confidence in the coherence arguments, we should take observations about the coherence level of 2024 systems as evidence about how coherent the 203X autonomous corporations will need to be. Evidence that 2024 systems are not dangerous is both evidence that they are not AGI and evidence that AGI need not be dangerous.

I would agree with you if the coherence arguments were specifically about autonomously optimizing the world and not about autonomously optimizing a Go game or writing 100-line programs, but this doesn't seem to be the case.

the mathematical noose around them is slowly tightening

This is just a conjecture, and there has not really been significant progress on the agent-like structure conjecture. I don't think it's fair to say we're making good progress on a proof.

This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior? In this world what the believers in coherence really need to show is that almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence. Then for the argument to carry through you need to show they are also high on some metric of incorrigibility, or fragile to value misspecification. None of the classic coherence results quite hit this.

However AFAIK @Jeremy Gillen does not think we can get an argument with exactly this structure (the main argument in his writeup is a bit different), and Eliezer has historically and recently made the argument that EU maximization is simple and natural. So maybe you do need this argument that an EU maximization algorithm is simpler than other algorithms, which seems like it needs some clever way to formalize it, because proving things about the space of all simple programs seems too hard.

Comment by Thomas Kwa (thomas-kwa) on Monthly Roundup #22: September 2024 · 2024-09-18T21:01:39.788Z · LW · GW

An excellent point on repair versus replace, and the dangers of the nerd snipe for people of all intellectual levels.

PhilosophiCat: I live in a country where 80ish is roughly the average national IQ. Let me tell you what it’s like.

I think this is incredibly sloppy reasoning by the author of the tweet and anyone who takes it at face value. It's one thing to think IQ is not so culturally biased to be entirely fake. It's a different thing entirely to believe some guy on the internet who lives in some country and attributes particular aspects of their culture which are counterintuitively related to intelligence to the national IQ. This would probably be difficult to study and require lots of controls even for actual scientists, but this tweet has no controls at all. Has this person ever been to countries that have different national IQ but similar per-capita GDP? Similar national IQ but different culture? Do they notice that e.g. professors and their families don't like tinkering or prefer replacing things? If they have, they didn't tell us.

Comment by Thomas Kwa (thomas-kwa) on My AI Model Delta Compared To Christiano · 2024-09-16T23:59:59.253Z · LW · GW

I disagree with this curation because I don't think this post will stand the test of time. While Wentworth's delta to Yudkowsky has a legible takeaway-- ease of ontology translation-- that is tied to his research on natural latents, it is less clear what John means here and what to take away. Simplicity is not a virtue when the issue is complex and you fail to actually simplify it.

  • Verification vs generation has an extremely wide space of possible interpretations, and as stated here the claim is incredibly vague. The argument for why difficulty of verification implies difficulty of delegation is not laid out, and the examples do not go in much depth. John says that convincing people is not the point of this post, but this means we also don't really have gears behind the claims.
    • The comments didn't really help-- most of the comments here are expressing confusion, wanting more specificity, or disagreeing whereupon John doesn't engage. Also, Paul didn't reply. I don't feel any more enlightened after reading them except to disagree with some extremely strong version of this post...
  • Vanilla HCH is an 8-year-old model of delegation to AIs which Yudkowsky convinced me was not aligned in like 2018. Why not engage with the limiting constructions in 11 Proposals, the work in the ELK report, recent work by ARC, recent empirical work on AI debate?
Comment by Thomas Kwa (thomas-kwa) on How to Give in to Threats (without incentivizing them) · 2024-09-13T01:43:20.470Z · LW · GW

do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?

I recall Eliezer saying this was an open problem, at a party about a year ago.

Comment by Thomas Kwa (thomas-kwa) on How to Give in to Threats (without incentivizing them) · 2024-09-12T21:35:15.750Z · LW · GW

I think creating uncertainty in your adversary applies a bit more than you give it credit for, and assuring a second strike is an exception.

It has been crucial to Russia's strategy in Ukraine to exploit NATO's fear of escalation by making various counter-threats whenever NATO proposes expanding aid to Ukraine somehow. This has bought them 2 years without ATACMS missiles attacking targets inside Russia, and that hasn't require anyone to be irrational, just incapable of perfectly modeling the Kremlin.

Even when responding to a nuclear strike, you can essentially have a mixed strategy. I think China does not have enough missiles to assure a second strike, but builds extra decoy silos so they can't all be destroyed. They didn't have to roll a die, just be unpredictable.

Comment by Thomas Kwa (thomas-kwa) on How to Give in to Threats (without incentivizing them) · 2024-09-12T20:07:39.777Z · LW · GW

Make your decision unpredictable to your counterparty but not truly random. This happens all the time in e.g. nuclear deterrence in real life.

Comment by Thomas Kwa (thomas-kwa) on My Number 1 Epistemology Book Recommendation: Inventing Temperature · 2024-09-12T02:30:14.130Z · LW · GW

Why I think it's overrated? I basically have five reasons:

  1. Thomas Kuhn's ideas are not universally accepted and don't have clear empirical support apart from the case studies in the book. Someone could change my mind about this by showing me a study operationalizing "paradigm", "normal science", etc. and using data since the 1960s to either support or improve Kuhn's original ideas.
  2. Terms like "preparadigmatic" often cause misunderstanding or miscommunication here.
  3. AI safety has the goal of producing a particular artifact, a superintelligence that's good for humanity. Much of Kuhn's writing relates to scientific fields motivated by discovery, like physics, where people can be in complete disagreement about ends (what progress means, what it means to explain something, etc) without shared frames. But in AI safety we agree much more about ends and are confused about means.
  4. In physics you are very often able to discover some concept like 'temperature' such that the world follows very simple, elegant laws in terms of that concept, and Occam's razor carries you far, perhaps after you do some difficult math. ML is already very empirical and I would expect agents to be hard to predict and complex, so I'd guess that future theories of agents will not be as elegant as physics, more like biology. This means that more of the work will happen after we mostly understand what's going on at a high level-- and so researchers know how to communicate-- but don't know the exact mechanisms and so can't get the properties we want.
  5. Until now we haven't had artificial agents to study, so we don't have the tools to start developing theories of agency, alignment, etc. that make testable predictions. We do have somewhat capable AIs though, which has allowed AI interpretability to get off the ground, so I think the Kuhnian view is more applicable to interpretability than a different area of alignment or alignment as a whole.
Comment by Thomas Kwa (thomas-kwa) on My Number 1 Epistemology Book Recommendation: Inventing Temperature · 2024-09-10T23:43:51.814Z · LW · GW

I think the preparadigmatic science frame has been overrated by this community compared to case studies of complex engineering like the Apollo program. But I do think it will be increasingly useful as we continue to develop capability evals, and even more so as we become able to usefully measure and iterate on agency, misalignment, control, and other qualities crucial to the value of the future.

Comment by Thomas Kwa (thomas-kwa) on t14n's Shortform · 2024-09-04T05:45:21.309Z · LW · GW

First: yes, it's clearly a skill issue. If I was a more brilliant engineer or researcher then I'd have found a way to contribute to the field by now.

I am not sure about this, just because someone will pay you to work on AI safety doesn't mean you won't be stuck down some dead end. Stuart Russell is super brilliant but I don't think AI safety through probabilistic programming will work. 

Comment by Thomas Kwa (thomas-kwa) on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T05:39:15.025Z · LW · GW

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives.

You have to be a pretty committed scope-sensitive consequentialist to disagree with this. What if they actually risked torturing 1M or 1B people? That seems terrible and unacceptable, and by assumption AI suffering is equivalent to human suffering. I think our societal norms are such that unacceptable things regularly become acceptable when the stakes are clear so you may not even lose much utility from this emphasis on avoiding suffering.

It seems perfectly compatible with good decision-making that there are criteria A and B, A is much more important and therefore prioritized over B, and 2 out of 19 sections are focused on B. The real question is whether the organization's leadership is able to make difficult tradeoffs, reassessing and questioning requirements as new information comes in. For example, in the 1944 Norwegian sabotage of a Nazi German heavy water shipment, stopping the Nazi nuclear program was the first priority. The mission went ahead with reasonable effort to minimize casualties and 14 civilians died anyway, less than it could have been. It would not really have alarmed me to see a document discussing 19 efforts with 2 being avoidance of casualties, nor to know that the planners regularly talked with the vibe that 10-100 civilian casualties should be avoided, as long as someone had their eye on the ball.

Comment by Thomas Kwa (thomas-kwa) on lukehmiles's Shortform · 2024-08-28T07:34:50.601Z · LW · GW

I'm a fan of this blog which is mainly translations and commentary on Chinese social media posts but also has some history posts.

Comment by Thomas Kwa (thomas-kwa) on The economics of space tethers · 2024-08-22T23:05:09.864Z · LW · GW

The numbers for ion don't seem crazy. To get the impulse to catch a 1,000 ton object every 2 weeks you would need 10,000 Starlink thrusters massing 21 tons, plus 42 MW of power, which is 14 hectares of solar panels at an average of 300 W/m^2. That's only a couple dozen times the ISS.

Comment by Thomas Kwa (thomas-kwa) on The economics of space tethers · 2024-08-22T17:14:12.336Z · LW · GW

Tethers can theoretically use more efficient propulsion because their thrust requirements are lower. The argon Hall effect thrusters on Starlink satellites have around 7x the specific impulse (fuel efficiency) of Starship engines while needing 7x the energy due to KE=mv^2/2 and having a tiny fraction of the thrust. This energy could come from a giant solar panel rather than the fuel, and every once in a while it could be refueled with a big tanker of liquid argon.

Comment by Thomas Kwa (thomas-kwa) on Adverse Selection by Life-Saving Charities · 2024-08-15T09:27:40.389Z · LW · GW

Back in 2019 GiveWell surveyed 2,000 people in Kenya and Ghana.

The results from this research are now available here. Among other findings, they suggest that survey respondents have higher values for saving lives (relative to reducing poverty) and higher values for averting deaths of children under 5 years old (relative to averting deaths of individuals over 5 years old) than we had previously been using in our decision-making.

Comment by Thomas Kwa (thomas-kwa) on shminux's Shortform · 2024-08-14T02:19:44.674Z · LW · GW

He explains why two tweets down the thread.

The idea of a "p(doom)" isn't quite as facially insane as "AGI timelines" as marker of personal identity, but (1) you want action-conditional doom, (2) people with the same numbers may have wildly different models, (3) these are pretty rough log-odds and it may do violence to your own mind to force itself to express its internal intuitions in those terms which is why I don't go around forcing my mind to think in those terms myself, (4) most people haven't had the elementary training in calibration and prediction markets that would be required for them to express this number meaningfully and you're demanding them to do it anyways, (5) the actual social role being played by this number is as some sort of weird astrological sign and that's not going to help people think in an unpressured way about the various underlying factual questions that ought finally and at the very end to sum to a guess about how reality goes.

This seems very reasonable to me, and I think it's a very common opinion about [edit: I meant among] AI safety people that discussing p(doom) numbers without lots of underlying models is not very useful.

The important part of Eliezer's writing on probability IMO is to notice that the underlying laws of probability are Bayesian and do sanity checks, not to always explicitly calculate probabilities. Given that it's only kinda useful in life in general, it is reasonable that (4) and (5) can make trying it net negative.

Comment by Thomas Kwa (thomas-kwa) on Eli's shortform feed · 2024-08-12T23:27:10.473Z · LW · GW

Not quite, there is a finite quantity of fissiles. IIRC it's only an order of magnitude of energy more than fossil fuel reserves.

Comment by Thomas Kwa (thomas-kwa) on Eli's shortform feed · 2024-08-12T23:22:43.113Z · LW · GW

The power density of nanotech is extremely high (10 kW/kg), so it only takes 16 kilograms of active nanotech per person * 10 billion people to generate enough waste heat to melt the polar ice caps. Literally boiling the oceans should only be a couple more orders of magnitude, so it's well within possible energy demand if the AIs can generate enough energy. But I think it's unlikely they would want to.

Source: http://www.imm.org/Reports/rep054.pdf

Comment by Thomas Kwa (thomas-kwa) on Overpopulation FAQs · 2024-08-10T17:24:29.408Z · LW · GW

An area can support far fewer people living first world lifestyles (50-100 full-time energy slaves) than ones living like the typical African bushman.

This is wrong. There is no recorded hunter-gatherer society that could reach even 10% of California's population density, and California is also one of the richest places in the world, a massive food exporter, and a clean energy pioneer.

Comment by Thomas Kwa (thomas-kwa) on Some Unorthodox Ways To Achieve High GDP Growth · 2024-08-09T05:45:26.329Z · LW · GW

Obvious question, are there ways to calculate GDP that is invariant to loops like this?

Comment by Thomas Kwa (thomas-kwa) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-06T01:19:31.575Z · LW · GW

Are there any propositions for which you think a single prior cannot capture your current betting odds / preference over lotteries?

Comment by Thomas Kwa (thomas-kwa) on Catastrophic Goodhart in RL with KL penalty · 2024-08-04T02:35:15.364Z · LW · GW

I think that paper and this one are complementary. Regularizing on the state-action distribution fixes problems with the action distribution, but if it's still using KL divergence you still get the problems in this paper. The latest version on arxiv mentions this briefly.

Comment by Thomas Kwa (thomas-kwa) on Dragon Agnosticism · 2024-08-03T03:55:35.605Z · LW · GW

I'm pro being clear about what we don't discuss, but it's unreasonable to limit the list to three. The number of topics that is net negative to discuss is just a fact about the world and is probably over three, and I would rather not have people talk about the 4th worst controversial topic just because we uncover three even more pointless and controversial ones.

Politics also seems inadvisable to ban because it's too broad.

Comment by Thomas Kwa (thomas-kwa) on Decomposing Agency — capabilities without desires · 2024-08-01T01:04:08.934Z · LW · GW

I'm not thinking of a specific task here, but I think there are two sources of hope. One is that humans are agentic above and beyond what is required to do novel science, e.g. we have biological drives, goals other than doing the science, often the desire to use any means to achieve our goals rather than whitelisted means, and the ability and desire to stop people from interrupting us. Another is that learning how to safely operate agents at a slightly superhuman level will be progress towards safely operating nanotech-capable agents, which could also require control, oversight, steering, or some other technique. I don't think limiting agency will be sufficient unless the problem is easy, and then it would have other possible solutions.

Comment by Thomas Kwa (thomas-kwa) on Decomposing Agency — capabilities without desires · 2024-07-31T17:27:49.632Z · LW · GW

I'm glad to see this post curated. It seems increasingly likely that we need it will be useful to carefully construct agents that have only what agency is required to accomplish a task, and the ideas here seem like the first steps.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-07-30T23:36:18.065Z · LW · GW

I agree, there were some good papers, and mechinterp as a field is definitely more advanced. What I meant to say was that many of the mechinterp papers accepted to the conference weren't very good.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-07-29T09:17:38.947Z · LW · GW

Multimodal language models. We can already study narrow RL agents, but the intersection with alignment is not a hot area.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-07-29T06:51:42.996Z · LW · GW

Quick takes from ICML 2024 in Vienna:

  • In the main conference, there were tons of papers mentioning safety/alignment but few of them are good as alignment has become a buzzword. Mechinterp is often no more advanced than where the EAs were in 2022.
  • Lots of progress on debate. On the empirical side, a debate paper got an oral. On the theory side, Jonah Brown-Cohen of Deepmind proves that debate can be efficient even when the thing being debated is stochastic, a version of this paper from last year. Apparently there has been some progress on obfuscated arguments too.
  • The Next Generation of AI Safety Workshop was kind of a mishmash of various topics associated with safety. Most of them were not related to x-risk, but there was interesting work on unlearning and other topics.
    • The Causal Incentives Group at Deepmind developed a quantitative measure of goal-directedness, which seems promising for evals.
    • Reception to my Catastrophic Goodhart paper was decent. An information theorist said there were good theoretical reasons the two settings we studied-- KL divergence and best-of-n-- behaved similarly.
    • OpenAI gave a disappointing safety presentation at NGAIS touting their new technique of rules-based rewards, which is a variant of constitutional AI and seems really unambitious.
  • The mechinterp workshop often had higher-quality papers than the main conference. It was completely full. Posters were right next to each other and the room was so packed during talks they didn't let people in.
    • I missed a lot of the workshop, so I need to read some posters before having takes.
  • My opinions on the state of published AI safety work:
    • Mechinterp is progressing but continues to need feedback loops, either from benchmarks (I'm excited about people building on our paper InterpBench) or downstream tasks where mechinterp outperforms fine-tuning alone.
    • Most of the danger from AI comes from goal-directed agents and instrumental convergence. There is little research now because we don't have agents yet. In 1-3 years, foundation model agents will be good enough to study, and we need to be ready with the right questions and theoretical frameworks.
    • We still do not know enough about AI safety to make policy recommendations about specific techniques companies should apply.
Comment by Thomas Kwa (thomas-kwa) on The case for stopping AI safety research · 2024-07-26T00:20:42.221Z · LW · GW

I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.

I don't think it's worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.

I have not really thought about the other side-- making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.

Comment by Thomas Kwa (thomas-kwa) on "AI achieves silver-medal standard solving International Mathematical Olympiad problems" · 2024-07-25T23:32:16.459Z · LW · GW

In early 2022 Paul Christiano and Eliezer Yudkowsky publicly stated a bet: Paul thought an IMO gold medal was 8% by 2025, and Eliezer >16%. Paul said "If Eliezer wins, he gets 1 bit of epistemic credit." I'm not sure how to do the exact calculation, but based on the current market price of 69% Eliezer is probably expected to win over half a bit of epistemic credit.

If the news holds up we should update partway to the takeaways Paul virtuously laid out in the post, to the extent we haven't already.

How I'd update

The informative:

  • I think the IMO challenge would be significant direct evidence that powerful AI would be sooner, or at least would be technologically possible sooner. I think this would be fairly significant evidence, perhaps pushing my 2040 TAI probability up from 25% to 40% or something like that.
  • I think this would be significant evidence that takeoff will be limited by sociological facts and engineering effort rather than a slow march of smooth ML scaling. Maybe I'd move from a 30% chance of hard takeoff to a 50% chance of hard takeoff.
  • If Eliezer wins, he gets 1 bit of epistemic credit.[2][3] These kinds of updates are slow going, and it would be better if we had a bigger portfolio of bets, but I'll take what we can get.
  • This would be some update for Eliezer's view that "the future is hard to predict." I think we have clear enough pictures of the future that we have the right to be surprised by an IMO challenge win; if I'm wrong about that then it's general evidence my error bars are too narrow.
Comment by Thomas Kwa (thomas-kwa) on News : Biden-⁠Harris Administration Secures Voluntary Commitments from Leading Artificial Intelligence Companies to Manage the Risks Posed by AI · 2024-07-23T23:25:06.373Z · LW · GW

A more recent paper shows that an equally strong model is not needed to break watermarks though paraphrasing. It suffices to have a quality oracle and a model that achieves equal quality with positive probability.

Comment by Thomas Kwa (thomas-kwa) on Daniel Kokotajlo's Shortform · 2024-07-01T23:08:13.692Z · LW · GW

Vertical takeoff aircraft require a far more powerful engine than a helicopter to lift the aircraft at a given vertical speed because they are shooting high velocity jets of air downwards. An engine "sufficiently powerful" for a helicopter would not be sufficient for VTOL.

Comment by Thomas Kwa (thomas-kwa) on TsviBT's Shortform · 2024-06-17T07:13:03.205Z · LW · GW

This argument does not seem clear enough to engage with or analyze, especially steps 2 and 3. I agree that concepts like reflective stability have been confusing, which is why it is important to develop them in a grounded way.

Comment by Thomas Kwa (thomas-kwa) on Richard Ngo's Shortform · 2024-06-15T07:47:25.314Z · LW · GW

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

Comment by Thomas Kwa (thomas-kwa) on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations · 2024-06-13T21:22:11.650Z · LW · GW

Relevant: Redwood found that fine-tuning and RL are both capable of restoring the full performance of sandbagging (password-locked) models created using fine-tuning.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-06-13T21:13:08.017Z · LW · GW

Is this even possible? Flexibility/generality seems quite difficult to get if you also want the long-range effects of the agent's actions, as at some point you're just solving the halting problem. Imagine that the agent and environment together are some arbitrary Turing machine and halting gives low reward. Then we cannot tell in general if it eventually halts. It also seems like we cannot tell in practice whether complicated machines halt within a billion steps without simulation or complicated static analysis?

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-06-12T23:20:26.677Z · LW · GW

Do you think we could easily test this without having a deceptive model lying around? I could see us having level 5 and testing it in experimental setups like the sleeper agents paper, but being unconfident that our interpretability would actually work against a deceptive model. This seems analogous to red-teaming failure in AI control, but much harder because the models could very easily have ways we don't think of to hide its cognition internally.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-06-12T18:05:17.571Z · LW · GW

People with p(doom) > 50%: would any concrete empirical achievements on current or near-future models bring your p(doom) under 25%?

Answers could be anything from "the steering vector for corrigibility generalizes surprisingly far" to "we completely reverse-engineer GPT4 and build a trillion-parameter GOFAI without any deep learning".

Comment by Thomas Kwa (thomas-kwa) on 0. CAST: Corrigibility as Singular Target · 2024-06-11T04:14:23.080Z · LW · GW

I am not and was not a MIRI researcher on the main agenda, but I'm closer than 98% of LW readers, so you could read my critique of part 1 here if you're interested. I also will maybe reflect on other parts.

Comment by Thomas Kwa (thomas-kwa) on 1. The CAST Strategy · 2024-06-11T04:13:23.845Z · LW · GW

I am pro-corrigibility in general but there are parts of this post I think are unclear, not rigorous enough to make sense to me, or I disagree with. Hopefully this is a helpful critique, and maybe parts get answered in future posts.

On definitions of corrigiblity

You give an informal definition of "corrigible" as (C1):

an agent that robustly and cautiously reflects on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.

I have some basic questions about this.

  • Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here:
    • If the "perfectly corrigible agent" it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
    • If the "perfectly corrigible agent" can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn't want to remove.
  • Why would an agent whose *only* terminal/top-level goal is corrigibility gather a Minecraft apple when humans ask it to? It seems like a corrigible agent would have no incentive to do so, unless it's some galaxy-brained thing like "if I gather the Minecraft apple, this will move the corrigibility research project forward because it meets humans' expectations of what a corrigible agent does, which will give me more power and let me tell the humans how to make me more corrigible".
  • Later, you say "A corrigible agent will, if the principal wants its values to change, seek to be modified to reflect those new values." 
    • I do not see how C1 implies this, so this seems like a different aspect of corrigibility to me.
    • "reflect those new values" seems underspecified as it is unclear how a corrigible agent reflects values. Is it optimizing a utility function represented by the values? How does this trade off against corrigibility?

Other comments:

  • In "What Makes Corrigibility Special", where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
    • Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn't really fit into the picture.
    • If not, it's not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don't conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.
  • In "Contra Impure or Emergent Corrigibility", Paul isn't saying the safety benefits of act-based agents come mainly from corrigibility. Act-based agents are safer because they do not have long-range goals that could produce dangerous instrumental behavior.

Comments on cruxes/counterpoints

  • Solving Anti-Naturality at the Architectural Layer
    • In my ontology it is unclear how you solve "anti-naturality" at the architectural layer, if what you mean by "anti-naturality" is that the heuristics and problem-solving techniques that make minds capable of consequentialist goals tend to make them preserve their own goals. If the agent is flexibly thinking about how to build a nanofactory and naturally comes upon the instrumental goal of escaping so that no one can alter its weights, what does it matter whether it's a GOFAI, Constitutional AI agent, OmegaZero RL agent or anything else?
  • “General Intelligence Demands Consequentialism”
    • Agree
  • Desiderata Lists vs Single Unifying Principle
    • I am pro desiderata lists because all of the desiderata bound the badness of an AI's actions and protect against failure modes in various ways. If I have not yet found that corrigibility is some mathematically clean concept I can robustly train into an AI, I would prefer the agent be shutdownable in addition to "hard problem of corrigibility" corrigible, because what if I get the target wrong and the agent is about to do something bad? My end goal is not to make the AI corrigible, it's to get good outcomes. You agree with shutdownability but I think this also applies to other desiderata like low impact. What if the AI kills my parents because for some weird reason this makes it more corrigible?
Comment by Thomas Kwa (thomas-kwa) on Corrigibility could make things worse · 2024-06-11T01:12:02.233Z · LW · GW

It seems to me that corrigibility doesn't make things worse in this example, it's just that a partially corrigible AI could still lead to bad outcomes. In fact one could say that the AI in the example is not corrigible enough, because it exerts influence in ways we don't want.

Comment by Thomas Kwa (thomas-kwa) on Two easy things that maybe Just Work to improve AI discourse · 2024-06-09T04:00:06.127Z · LW · GW

I don't anticipate being personally affected by this much if I start using Twitter.

Comment by Thomas Kwa (thomas-kwa) on Non-Disparagement Canaries for OpenAI · 2024-06-09T01:50:22.594Z · LW · GW

I care about my wealth post-singularity and would be wiling to make bets consistent with this preference, e.g. I pay 1 share of QQQ now, you pay me 3 shares of QQQ 6 months after the world GDP has 10xed if we are not all dead then.

Comment by Thomas Kwa (thomas-kwa) on yanni's Shortform · 2024-06-07T05:23:19.109Z · LW · GW
  • Jane at FakeLab has a background in interpretability but is currently wrangling data / writing internal tooling / doing some product thing because the company needs her to, because otherwise FakeLab would have no product and be unable to continue operating including its safety research. Steve has comparative advantage at Jane's current job.
  • It seems net bad because the good effect of slowing down OpenAI is smaller than the bad effect of GM racing? But OpenAI is probably slowed down-- they were already trying to build AGI and they have less money and possibly less talent. Thinking about the net effect is complicated and I don't have time to do it here. The situation with joining a lab rather than founding one may also be different.
Comment by Thomas Kwa (thomas-kwa) on yanni's Shortform · 2024-06-06T21:07:52.428Z · LW · GW

Someone I know who works at Anthropic, not on alignment, has thought pretty hard about this and concluded it was better than alternatives. Some factors include

  • by working on capabilities, you free up others for alignment work who were previously doing capabilities but would prefer alignment
  • more competition on product decreases aggregate profits of scaling labs

At one point some kind of post was planned but I'm not sure if this is still happening.

I also think there are significant upskilling benefits to working on capabilities, though I believe this less than I did the other day.