Posts

Goodhart in RL with KL: Appendix 2024-05-18T00:40:15.454Z
Catastrophic Goodhart in RL with KL penalty 2024-05-15T00:58:20.763Z
Is a random box of gas predictable after 20 seconds? 2024-01-24T23:00:53.184Z
Will quantum randomness affect the 2028 election? 2024-01-24T22:54:30.800Z
Thomas Kwa's research journal 2023-11-23T05:11:08.907Z
Thomas Kwa's MIRI research experience 2023-10-02T16:42:37.886Z
Catastrophic Regressional Goodhart: Appendix 2023-05-15T00:10:31.090Z
When is Goodhart catastrophic? 2023-05-09T03:59:16.043Z
Challenge: construct a Gradient Hacker 2023-03-09T02:38:32.999Z
Failure modes in a shard theory alignment plan 2022-09-27T22:34:06.834Z
Utility functions and probabilities are entangled 2022-07-26T05:36:26.496Z
Deriving Conditional Expected Utility from Pareto-Efficient Decisions 2022-05-05T03:21:38.547Z
Most problems don't differ dramatically in tractability (under certain assumptions) 2022-05-04T00:05:41.656Z
The case for turning glowfic into Sequences 2022-04-27T06:58:57.395Z
(When) do high-dimensional spaces have linear paths down to local minima? 2022-04-22T15:35:55.215Z
How dath ilan coordinates around solving alignment 2022-04-13T04:22:25.643Z
5 Tips for Good Hearting 2022-04-01T19:47:22.916Z
Can we simulate human evolution to create a somewhat aligned AGI? 2022-03-28T22:55:20.628Z
Jetlag, Nausea, and Diarrhea are Largely Optional 2022-03-21T22:40:50.180Z
The Box Spread Trick: Get rich slightly faster 2020-09-01T21:41:50.143Z
Thomas Kwa's Bounty List 2020-06-13T00:03:41.301Z
What past highly-upvoted posts are overrated today? 2020-06-09T21:25:56.152Z
How to learn from a stronger rationalist in daily life? 2020-05-20T04:55:51.794Z
My experience with the "rationalist uncanny valley" 2020-04-23T20:27:50.448Z
Thomas Kwa's Shortform 2020-03-22T23:19:01.335Z

Comments

Comment by Thomas Kwa (thomas-kwa) on The case for stopping AI safety research · 2024-07-26T00:20:42.221Z · LW · GW

I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.

I don't think it's worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.

I have not really thought about the other side-- making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.

Comment by Thomas Kwa (thomas-kwa) on "AI achieves silver-medal standard solving International Mathematical Olympiad problems" · 2024-07-25T23:32:16.459Z · LW · GW

In early 2022 Paul Christiano and Eliezer Yudkowsky publicly stated a bet: Paul thought an IMO gold medal was 8% by 2025, and Eliezer >16%. Paul said "If Eliezer wins, he gets 1 bit of epistemic credit." I'm not sure how to do the exact calculation, but based on the current market price of 69% Eliezer is probably expected to win over half a bit of epistemic credit.

If the news holds up we should update partway to the takeaways Paul virtuously laid out in the post, to the extent we haven't already.

How I'd update

The informative:

  • I think the IMO challenge would be significant direct evidence that powerful AI would be sooner, or at least would be technologically possible sooner. I think this would be fairly significant evidence, perhaps pushing my 2040 TAI probability up from 25% to 40% or something like that.
  • I think this would be significant evidence that takeoff will be limited by sociological facts and engineering effort rather than a slow march of smooth ML scaling. Maybe I'd move from a 30% chance of hard takeoff to a 50% chance of hard takeoff.
  • If Eliezer wins, he gets 1 bit of epistemic credit.[2][3] These kinds of updates are slow going, and it would be better if we had a bigger portfolio of bets, but I'll take what we can get.
  • This would be some update for Eliezer's view that "the future is hard to predict." I think we have clear enough pictures of the future that we have the right to be surprised by an IMO challenge win; if I'm wrong about that then it's general evidence my error bars are too narrow.
Comment by Thomas Kwa (thomas-kwa) on News : Biden-⁠Harris Administration Secures Voluntary Commitments from Leading Artificial Intelligence Companies to Manage the Risks Posed by AI · 2024-07-23T23:25:06.373Z · LW · GW

A more recent paper shows that an equally strong model is not needed to break watermarks though paraphrasing. It suffices to have a quality oracle and a model that achieves equal quality with positive probability.

Comment by Thomas Kwa (thomas-kwa) on Daniel Kokotajlo's Shortform · 2024-07-01T23:08:13.692Z · LW · GW

Vertical takeoff aircraft require a far more powerful engine than a helicopter to lift the aircraft at a given vertical speed because they are shooting high velocity jets of air downwards. An engine "sufficiently powerful" for a helicopter would not be sufficient for VTOL.

Comment by Thomas Kwa (thomas-kwa) on TsviBT's Shortform · 2024-06-17T07:13:03.205Z · LW · GW

This argument does not seem clear enough to engage with or analyze, especially steps 2 and 3. I agree that concepts like reflective stability have been confusing, which is why it is important to develop them in a grounded way.

Comment by Thomas Kwa (thomas-kwa) on Richard Ngo's Shortform · 2024-06-15T07:47:25.314Z · LW · GW

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

Comment by Thomas Kwa (thomas-kwa) on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations · 2024-06-13T21:22:11.650Z · LW · GW

Relevant: Redwood found that fine-tuning and RL are both capable of restoring the full performance of sandbagging (password-locked) models created using fine-tuning.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-06-13T21:13:08.017Z · LW · GW

Is this even possible? Flexibility/generality seems quite difficult to get if you also want the long-range effects of the agent's actions, as at some point you're just solving the halting problem. Imagine that the agent and environment together are some arbitrary Turing machine and halting gives low reward. Then we cannot tell in general if it eventually halts. It also seems like we cannot tell in practice whether complicated machines halt within a billion steps without simulation or complicated static analysis?

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-06-12T23:20:26.677Z · LW · GW

Do you think we could easily test this without having a deceptive model lying around? I could see us having level 5 and testing it in experimental setups like the sleeper agents paper, but being unconfident that our interpretability would actually work against a deceptive model. This seems analogous to red-teaming failure in AI control, but much harder because the models could very easily have ways we don't think of to hide its cognition internally.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-06-12T18:05:17.571Z · LW · GW

People with p(doom) > 50%: would any concrete empirical achievements on current or near-future models bring your p(doom) under 25%?

Answers could be anything from "the steering vector for corrigibility generalizes surprisingly far" to "we completely reverse-engineer GPT4 and build a trillion-parameter GOFAI without any deep learning".

Comment by Thomas Kwa (thomas-kwa) on 0. CAST: Corrigibility as Singular Target · 2024-06-11T04:14:23.080Z · LW · GW

I am not and was not a MIRI researcher on the main agenda, but I'm closer than 98% of LW readers, so you could read my critique of part 1 here if you're interested. I also will maybe reflect on other parts.

Comment by Thomas Kwa (thomas-kwa) on 1. The CAST Strategy · 2024-06-11T04:13:23.845Z · LW · GW

I am pro-corrigibility in general but there are parts of this post I think are unclear, not rigorous enough to make sense to me, or I disagree with. Hopefully this is a helpful critique, and maybe parts get answered in future posts.

On definitions of corrigiblity

You give an informal definition of "corrigible" as (C1):

an agent that robustly and cautiously reflects on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.

I have some basic questions about this.

  • Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here:
    • If the "perfectly corrigible agent" it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
    • If the "perfectly corrigible agent" can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn't want to remove.
  • Why would an agent whose *only* terminal/top-level goal is corrigibility gather a Minecraft apple when humans ask it to? It seems like a corrigible agent would have no incentive to do so, unless it's some galaxy-brained thing like "if I gather the Minecraft apple, this will move the corrigibility research project forward because it meets humans' expectations of what a corrigible agent does, which will give me more power and let me tell the humans how to make me more corrigible".
  • Later, you say "A corrigible agent will, if the principal wants its values to change, seek to be modified to reflect those new values." 
    • I do not see how C1 implies this, so this seems like a different aspect of corrigibility to me.
    • "reflect those new values" seems underspecified as it is unclear how a corrigible agent reflects values. Is it optimizing a utility function represented by the values? How does this trade off against corrigibility?

Other comments:

  • In "What Makes Corrigibility Special", where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
    • Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn't really fit into the picture.
    • If not, it's not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don't conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.
  • In "Contra Impure or Emergent Corrigibility", Paul isn't saying the safety benefits of act-based agents come mainly from corrigibility. Act-based agents are safer because they do not have long-range goals that could produce dangerous instrumental behavior.

Comments on cruxes/counterpoints

  • Solving Anti-Naturality at the Architectural Layer
    • In my ontology it is unclear how you solve "anti-naturality" at the architectural layer, if what you mean by "anti-naturality" is that the heuristics and problem-solving techniques that make minds capable of consequentialist goals tend to make them preserve their own goals. If the agent is flexibly thinking about how to build a nanofactory and naturally comes upon the instrumental goal of escaping so that no one can alter its weights, what does it matter whether it's a GOFAI, Constitutional AI agent, OmegaZero RL agent or anything else?
  • “General Intelligence Demands Consequentialism”
    • Agree
  • Desiderata Lists vs Single Unifying Principle
    • I am pro desiderata lists because all of the desiderata bound the badness of an AI's actions and protect against failure modes in various ways. If I have not yet found that corrigibility is some mathematically clean concept I can robustly train into an AI, I would prefer the agent be shutdownable in addition to "hard problem of corrigibility" corrigible, because what if I get the target wrong and the agent is about to do something bad? My end goal is not to make the AI corrigible, it's to get good outcomes. You agree with shutdownability but I think this also applies to other desiderata like low impact. What if the AI kills my parents because for some weird reason this makes it more corrigible?
Comment by Thomas Kwa (thomas-kwa) on Corrigibility could make things worse · 2024-06-11T01:12:02.233Z · LW · GW

It seems to me that corrigibility doesn't make things worse in this example, it's just that a partially corrigible AI could still lead to bad outcomes. In fact one could say that the AI in the example is not corrigible enough, because it exerts influence in ways we don't want.

Comment by Thomas Kwa (thomas-kwa) on Two easy things that maybe Just Work to improve AI discourse · 2024-06-09T04:00:06.127Z · LW · GW

I don't anticipate being personally affected by this much if I start using Twitter.

Comment by Thomas Kwa (thomas-kwa) on Non-Disparagement Canaries for OpenAI · 2024-06-09T01:50:22.594Z · LW · GW

I care about my wealth post-singularity and would be wiling to make bets consistent with this preference, e.g. I pay 1 share of QQQ now, you pay me 3 shares of QQQ 6 months after the world GDP has 10xed if we are not all dead then.

Comment by Thomas Kwa (thomas-kwa) on yanni's Shortform · 2024-06-07T05:23:19.109Z · LW · GW
  • Jane at FakeLab has a background in interpretability but is currently wrangling data / writing internal tooling / doing some product thing because the company needs her to, because otherwise FakeLab would have no product and be unable to continue operating including its safety research. Steve has comparative advantage at Jane's current job.
  • It seems net bad because the good effect of slowing down OpenAI is smaller than the bad effect of GM racing? But OpenAI is probably slowed down-- they were already trying to build AGI and they have less money and possibly less talent. Thinking about the net effect is complicated and I don't have time to do it here. The situation with joining a lab rather than founding one may also be different.
Comment by Thomas Kwa (thomas-kwa) on yanni's Shortform · 2024-06-06T21:07:52.428Z · LW · GW

Someone I know who works at Anthropic, not on alignment, has thought pretty hard about this and concluded it was better than alternatives. Some factors include

  • by working on capabilities, you free up others for alignment work who were previously doing capabilities but would prefer alignment
  • more competition on product decreases aggregate profits of scaling labs

At one point some kind of post was planned but I'm not sure if this is still happening.

I also think there are significant upskilling benefits to working on capabilities, though I believe this less than I did the other day.

Comment by Thomas Kwa (thomas-kwa) on What do coherence arguments actually prove about agentic behavior? · 2024-06-06T20:58:00.822Z · LW · GW

I do think they disagree based on my experience working with Nate and Vivek. Eliezer has said he has only shared 40% of his models with even Nate for infosec reasons [1] (which surprised me!), so it isn't surprising to me that they would have different views. Though I don't know Eliezer well, I think he does believe in the basic point of Deep Deceptiveness (because it's pretty basic) but also believes in coherence/utility functions more than Nate does. I can maybe say more privately but if it's important asking one of them is better.

[1] This was a while ago so he might have actually said that Nate only has 40% of his models. But either way my conclusion is valid.

Comment by Thomas Kwa (thomas-kwa) on What do coherence arguments actually prove about agentic behavior? · 2024-06-06T18:11:49.153Z · LW · GW

I think that it's mostly Eliezer who believes so strongly in utility functions. Nate Soares' post Deep Deceptiveness, which I claim is a central part of the MIRI threat model insofar as there is one, doesn't require an agent coherent enough to satisfy VNM over world-states. In fact it can depart from coherence in several ways and still be capable and dangerous:

  • It can flinch away from dangerous thoughts;
  • Its goals can drift over time;
  • Its preferences can be incomplete;
  • Maybe every 100 seconds it randomly gets distracted for 10 seconds

The important property it has is a goal about the real world and applies general problem-solving skills to achieve it, and has no stable desire to use its full intelligence to be helpful/good for humans. No one has formalized this and so no one has proved interesting things about such an agent model.

Comment by Thomas Kwa (thomas-kwa) on RobertM's Shortform · 2024-06-06T05:37:20.805Z · LW · GW

Such a world could even be more dangerous. LLMs are steerable and relatively weak at consequentialist planning. There is AFAICT no fundamental reason why the next paradigm couldn't be even less interpretable, less steerable, and more capable of dangerous optimization at a given level of economic utility.

Comment by Thomas Kwa (thomas-kwa) on My simple AGI investment & insurance strategy · 2024-06-06T02:06:36.430Z · LW · GW

VTI is basically the entire US market, and companies are added to the index as they go public. You will gain unless the value goes to private companies, but then you couldn't invest in them anyway. There are other indices for international equities.

Comment by Thomas Kwa (thomas-kwa) on davekasten's Shortform · 2024-06-05T21:35:27.040Z · LW · GW

"want to pick a war with America" is really strange wording because China's strategic goals are not "win a war against nuclear-armed America", but things like "be able to control its claims in the South China Sea including invading Taiwan without American interference". Likewise Russia doesn't want to "pick a war with the EU" but rather annex Ukraine; if they were stupid enough to want the former they would have just bombed Paris. I don't know whether national security people relate to the phrasing the same way but they do understand this.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2024-06-05T08:49:52.888Z · LW · GW

I don't know how to say this without sounding elitist, but my guess is that people who prolifically write LW comments and whose karma:(comments+posts) ratio is less than around 1.5:1 or maybe even 2:1 should be more selective in what they say. Around this range, it would be unwarranted for mods to rate-limit you, but perhaps the bottom 30% of content you produce is net negative considering the opportunity cost to readers.

Of course, one should not Goodhart for karma and Eliezer is not especially virtuous by having a 16:1 ratio, but 1.5:1 is a quite low ratio. You get 1.5 karma if 0.75 high-karma users or 1.5 low-karma users weakly upvote your comment. Considering that it takes barely a second to upvote and comments get tens of views, this is not a high bar for the average comment.

Caveats:

  • If your absolute number of comments + posts is low, the risk of making new users participate less outweighs the risk of comments clogging up the frontpage.
  • In some contexts votes might underestimate quality, like if your comment is on some esoteric topic that few people engage with.
Comment by Thomas Kwa (thomas-kwa) on Prometheus's Shortform · 2024-06-05T00:39:28.933Z · LW · GW

Application seems better than other kinds of capabilities. I was thinking of capabilities as everything other than alignment work, so inclusive of application.

Comment by Thomas Kwa (thomas-kwa) on Evidence of Learned Look-Ahead in a Chess-Playing Neural Network · 2024-06-05T00:35:12.019Z · LW · GW

This is really exciting to me. If this work generalizes to future models capable of sophisticated planning in the real world, we will be able to forecast future actions that internally justify an AI's current actions and thus tell whether they're planning to coup or not, whether or not an explicit general-purpose representation of the objective exists.

Comment by Thomas Kwa (thomas-kwa) on Prometheus's Shortform · 2024-06-04T17:23:08.825Z · LW · GW

Since many people seem to disagree, I'm going to share some reasons why I believe this:

  • AGI that poses serious existential risks seems at least 6 years away, and safety work seems much more valuable at crunch time, such that I think more than half of most peoples' impact will be more than 5 years away. So skilling up quickly should be the primary concern, until your timelines shorten.
  • Mentorship for safety is still limited. If you can get an industry safety job or get into MATS, this seems better than some random AI job, but most people can't.
    • I think there are also many policy roles that are not crowded and potentially time-critical, which people should also take if available.
  • Funding is also limited in the current environment. I think most people cannot get funding to work on alignment if they tried? This is fairly cruxy and I'm not sure of it, so someone should correct me if I'm wrong.
  • The relative impact of working on capabilities is smaller than working on alignment-- there are still 10x as many people doing capabilities as alignment, so unless returns don't diminish or you are doing something unusually harmful, you can work for 1 year on capabilities and 1 year on alignment and only reduce your impact 10%.
    • However, it seems bad to work at places that have particularly bad safety cultures and are explicitly trying to create AGI, possibly including OpenAI.
    • Safety could get even more crowded, which would make upskilling to work on safety net negative. This should be a significant concern, but I think most people can skill up faster than this.
  • Skills useful in capabilities are useful for alignment, and if you're careful about what job you take there isn't much more skill penalty in transferring them than, say, switching from vision model research to language model research.
  • Capabilities often has better feedback loops than alignment because you can see whether the thing works or not. Many prosaic alignment directions also have this property. Interpretability is getting there, but not quite. Other areas, especially in agent foundations, are significantly worse.
Comment by Thomas Kwa (thomas-kwa) on Prometheus's Shortform · 2024-06-04T14:14:09.034Z · LW · GW

I suspect that working on capabilities (edit: preferably applications rather than building AGI) in some non-maximally-harmful position is actually the best choice for most junior x-risk concerned people who want to do something technical. Safety is just too crowded and still not very tractable.

Comment by Thomas Kwa (thomas-kwa) on When is Goodhart catastrophic? · 2024-06-03T09:04:01.631Z · LW · GW

We considered that "catastrophic" might have that connotation, but we couldn't think of a better name and I still feel okay about it. Our intention with "catastrophic" was to echo the standard ML term of "catastrophic forgetting", not a global catastrophe. In catastrophic forgetting the model completely forgets how to do task A after it is trained on task B, it doesn't do A much worse than random. So we think that "catastrophic Goodhart" gives the correct idea to people who come from ML.

The natural question is then: why didn't we study circumstances in which optimizing for a proxy gives you  utility in the limit? Because it isn't true under the assumptions we are making. We wanted to study regressional Goodhart, and this naturally led us to the independence assumption. Previous work like Zhuang et al and Skalse et al has already formalized the extremal Goodhart / "use the atoms for something else" argument that optimizing for one goal would be bad for another goal, and we thought the more interesting part was showing that bad outcomes are possible even when error and utility are independent. Under the independence assumption, it isn't possible to get less than 0 utility.

To get  utility in the frame where proxy = error + utility, you would need to assume something about the dependence between error and utility, and we couldn't think of a simple assumption to make that didn't have too many moving parts. I think extremal Goodhart is overall more important, but it's not what we were trying to model.

Lastly, I think you're imagining "average" outcome as a random policy, which is an agent incapable of doing significant harm. The utility of the universe is still positive because you can go about your life. But in a different frame, random is really bad. Right now we pretrain models and then apply RLHF (and hopefully soon, better alignment techniques). If our alignment techniques produce no more utility than the prior, this means the model is no more aligned than the base model, which is a bad outcome for OpenAI. Superintelligent models might be arbitrarily capable of doing things, so the prior might be better thought of as irreversibly putting the world in a random state, which is a global catastrophe.

Comment by Thomas Kwa (thomas-kwa) on MIRI 2024 Communications Strategy · 2024-06-02T07:15:20.046Z · LW · GW

Disagree. Epistemics is a group project and impatiently interrupting people can make both you and your interlocutor less likely to combine your information into correct conclusions. It is also evidence that you're incurious internally which makes you worse at reasoning, though I don't want to speculate on Eliezer's internal experience in particular.

Comment by Thomas Kwa (thomas-kwa) on What do coherence arguments actually prove about agentic behavior? · 2024-06-02T03:29:22.717Z · LW · GW

agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor.

Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.

Comment by Thomas Kwa (thomas-kwa) on Drexler's Nanosystems is now available online · 2024-06-01T09:11:09.009Z · LW · GW

I believe Nanosystems is mostly valid physics (though I am still unsure about this) and in the far future, after GDP has doubled ten or twenty times, we will think of it like current rocket scientists think of Tsiolkovsky's writing: speculative science that gave a glimpse surprisingly far into the future through an understanding of the timeless basic principles at play, though misses many implementation details. And just like the sense of perspective by knowing in 1914 it's theoretically possible to send people to Mars on a ship with airlocks, fueled by hydrogen-oxygen engines and steered by cold gas thrusters, I think we gain an enormously valuable perspective on the universe by knowing that it is (probably) theoretically possible to perform most chemical reactions and many molecular assembly tasks with 99% efficiency using machines that precisely place atoms, self-replicate once an hour, and require only ultrapure gases, various trace metals, and electricity as input.

Comment by Thomas Kwa (thomas-kwa) on What's next for the field of Agent Foundations? · 2024-05-30T07:55:37.967Z · LW · GW

The title of this dialogue promised a lot, but I'm honestly a bit disappointed by the content. It feels like the authors are discussing exactly how to run particular mentorship programs and structure grants and how research works in full generality, while no one is actually looking at the technical problems. All field-building efforts must depend on the importance and tractability of technical problems, and this is just as true when the field is still developing a paradigm. I think a paradigm is established only when researchers with many viewpoints build a sense of which problems are important, then try many approaches until one successfully solves many such problems, thus proving the value of said approach. Wanting to find new researchers to have totally new takes and start totally new illegible research agendas is a level of helplessness that I think is unwarranted-- how can one be interested in AF without some view on what problems are interesting?

I would be excited about a dialogue that goes like this, though the format need not be rigid:

  • What are the most important [1] problems in agent foundations, with as much specificity as possible?
    • Responses could include things like:
      • A sound notion of "goals with limited scope": can't nail down precise desiderata now, but humans have these all the time, we don't know what they are, and they could be useful in corrigibility or impact measures.
      • Finding a mathematical model for agents that satisfies properties of logical inductors but also various other desiderata
      • Further study of corrigibility and capability of agents with incomplete preferences
    • Participants discuss how much each problem scratches their itch of curiosity about what agents are.
  • What techniques have shown promise in solving these and other important problems?
    • Does [infra-Bayes, Demski's frames on embedded agents, some informal 'shard theory' thing, ...] have a good success to complexity ratio?
      • probably none of them do?
  • What problems would benefit the most from people with [ML, neuroscience, category theory, ...] expertise?

[1]: (in the Hamming sense that includes tractability)

Comment by Thomas Kwa (thomas-kwa) on MIRI 2024 Communications Strategy · 2024-05-30T06:39:36.450Z · LW · GW

Does MIRI have a statement on recent OpenAI events? I'm pretty excited about frank reflections on current events as helping people to orient.

Comment by Thomas Kwa (thomas-kwa) on Try to solve the hard parts of the alignment problem · 2024-05-29T18:26:42.981Z · LW · GW

Points 1-3 and the idea that superintelligences will be able to understand our values (which I think everyone believes). But the conclusion needs a bunch of additional assumptions.

Comment by Thomas Kwa (thomas-kwa) on Try to solve the hard parts of the alignment problem · 2024-05-29T18:11:09.645Z · LW · GW

I suggest people read both that and Deep Deceptiveness (which is not about deceptiveness in particular) and think about how both could be valid, because I think they both are.

Comment by Thomas Kwa (thomas-kwa) on OpenAI: Fallout · 2024-05-29T00:34:52.425Z · LW · GW

Prerat: Everyone should have a canary page on their website that says “I’m not under a secret NDA that I can’t even mention exists” and then if you have to sign one you take down the page.

Does this work? Sounds like a good idea.

Comment by Thomas Kwa (thomas-kwa) on Catastrophic Goodhart in RL with KL penalty · 2024-05-28T20:44:31.366Z · LW · GW

The third one was a typo which I just fixed. I have also changed it to use "base policy" everywhere to be consistent, although this may change depending on what terminology is most common in an ML context, which I'm not sure of.

Comment by Thomas Kwa (thomas-kwa) on Real Life Sort by Controversial · 2024-05-28T02:23:49.478Z · LW · GW

I have strong downvoted without reading most of this post because the author appears to be trying to make something harmful for the world.

Comment by Thomas Kwa (thomas-kwa) on What mistakes has the AI safety movement made? · 2024-05-27T23:57:04.156Z · LW · GW

Thanks, fixed.

Comment by Thomas Kwa (thomas-kwa) on Alexander Gietelink Oldenziel's Shortform · 2024-05-27T20:31:35.321Z · LW · GW

Then I think you should specify that progress within this single innovation could be continuous over years and include 10+ ML papers in sequence each developing some sub-innovation.

Comment by Thomas Kwa (thomas-kwa) on Alexander Gietelink Oldenziel's Shortform · 2024-05-27T18:32:13.534Z · LW · GW

I think a single innovation left to create LTPA is unlikely because it runs contrary to the history of technology and of machine learning. For example, in the 10 years before AlphaGo and before GPT-4, several different innovations were required-- and that's if you count "deep learning" as one item. ChatGPT actually understates the number here because different components of the transformer architecture like attention, residual streams, and transformer++ innovations were all developed separately. 

Comment by Thomas Kwa (thomas-kwa) on Do you believe in hundred dollar bills lying on the ground? Consider humming · 2024-05-27T06:41:37.831Z · LW · GW

I don't know what you mean by "total amount" because ppm is a concentration, but that tweet's interpretation agrees with mine.

The wording ppm*hour being a typo for ppm/hour does not make sense to me because that would be dimensionally very strange. That could mean the concentration increases by 0.11 ppm per hour every hour, but for how long? A single dose can't cause this increase indefinitely. The only ways that I could see exposure being measured sensibly are:

  • ppm * hour (NO concentration of nasal air, integrated exposure over time, it is unspecified whether the concentration is 0.11 ppm for 1 hour or 19.8 ppm for 10 seconds or whatever)
  • ppm (NO concentration of nasal air, peak)
  • ppm (NO concentration of nasal air, average over the 8 hour interval between doses)
  • ppm (concentration of the 0.56ml of nasal spray, so 0.11 ppm would be 0.06 nL or 0.06 µg or something of NO delivered).
Comment by Thomas Kwa (thomas-kwa) on Do you believe in hundred dollar bills lying on the ground? Consider humming · 2024-05-27T02:18:28.340Z · LW · GW

I have received a bounty on paypal. Thanks for offering, as well as for laying out the reasoning in this post such that it's easy to critique.

Comment by Thomas Kwa (thomas-kwa) on Veedrac's Shortform · 2024-05-27T00:12:32.446Z · LW · GW

What do you mean by "robust to even trivial selection pressure"?

Comment by Thomas Kwa (thomas-kwa) on What mistakes has the AI safety movement made? · 2024-05-24T21:55:13.402Z · LW · GW

My opinions:

Too many galaxy-brained arguments & not enough empiricism

Our stories need more contact with the real world

Agree. Although there is sometimes a tradeoff between direct empirical testability and relevance to long-term alignment.

Adrià Garriga-Alonso thought that infrabayesianism, parts of singular learning theory and John Wentworth’s research programs are unlikely to end up being helpful for safety:

Agree. Thinking about mathematical models for agency seems fine because it is fundamental and theorems can get you real understanding, but the more complicated and less elegant your models get and the more tangential they are to the core question of how AI and instrumental convergence work, the less likely they are to be useful.

Evan Hubinger pushed back against this view by defending MIRI’s research approach. [...] we had no highly capable general-purpose models to do experiments on

Some empirical work could have happened well before the shift to empiricism around 2021. FAR AI's Go attack work could have happened in shortly after LeelaZero was released in 2017, as could interpretability on non-general-purpose models.

Too insular

Many in AI safety have been too quick to dismiss the concerns of AI ethicists [... b]ut AI ethics has many overlaps with AI safety both technically and policy:

Undecided; I used to believe this but then heard that AI ethicists have been uncooperative when alignment people try to reach out. But maybe we are just bad at politics and coalition-building.

AI safety needs more contact with academia. [...] research typically receives less peer review, leading to on average lower quality posts on sites like LessWrong. Much of AI safety research lacks the feedback loops that typical science has.

Agree; I also think that the research methodology and aesthetic of academic machine learning has been underappreciated (although it is clearly not perfect). Historically some good ideas like the LDT paper were rejected in journals, but it is definitely true that many things you do for the sake of publishing actually make your science better, e.g. having both theory and empirical results, or putting your contributions in an ontology people understand. I did not really understand how research worked until attending ICML last year.

Many of the computer science and math kids in AI safety do not value insights from other disciplines enough [....] Norms and values are the equilibria of interactions between individuals, produced by their behaviors, not some static list of rules up in the sky somewhere.

Plausible but with reservations:

  • I think "interdisciplinary" can be a buzzword that invites a lot of bad research
  • Thinking of human values as a utility function can be a useful simplifying assumption in developing basic theory

[...] too much jargony and sci-fi language. Esoteric phrases like “p(doom)”, “x-risk” or “HPMOR” can be off-putting to outsiders and a barrier to newcomers, and give culty vibes.

Disagree. This is the useful kind of jargon; "x-risk" is a concept we really want in our vocabulary and it is not clear how to make it sound less weird; if AI safety people are offputting to outsiders it is because we need to be more charismatic and better at communication.

Ajeya Cotra thought some AI safety researchers, like those at MIRI, have been too secretive about the results of their research.

Agree; I think there had been a mindset where since MIRI's plan for saving the world needed them to reach the frontier of AI research with far safer (e.g. non-ML) designs, they think their AI capabilities ideas are better than they are.

Holly Elmore suspected that this insular behavior was not by mistake, but on purpose. The rationalists wanted to only work with those who see things the same way as them, and avoid too many “dumb” people getting involved.

Undecided; this has not been my experience. I do think people should recognize that AI safety has been heavily influenced by what is essentially a trauma response from being ignored by the scientific establishment from 2003-2023,

Bad messaging

6 respondents thought AI safety could communicate better with the wider world.

Agree. It's wild to me that e/acc and AI safety seem memetically evenly matched on Twitter (could be wrong about this, if so someone please correct me) while e/acc has a worse favorability rating than Scientology in surveys.

4 thought that some voices push views that are too extreme or weird

I think Eliezer's confidence is not the worst thing because in most fields there are scientists who are super overconfident. But probably he should be better at communication e.g. realizing that people will react negatively to raising the possibility of nuking bombing datacenters without lots of contextualizing. Undecided on Pause AI and Conjecture.

Ben Cottier lamented the low quality of discourse around AI safety, especially in places like Twitter.

I'm pretty sure a large part of this is some self-perpetuating thing where participating in higher-quality discourse on LW or better, your workplace Slack is more fun than Twitter. Not sure what to do here. Agree about polarization but it's not clear what to do there either.

AI safety’s relationship with the leading AGI companies

3 respondents also complained that the AI safety community is too cozy with the big AGI companies. A lot of AI safety researchers work at OpenAI, Anthropic and DeepMind. The judgments of these researchers may be biased by a conflict of interest: they may be incentivised for their company to succeed in getting to AGI first. They will also be contractually limited in what they can say about their (former) employer, in some cases even for life.

Agree about conflicts of interest. I remember hearing at one of the AI safety international dialogues, every academic signed but no one with a purely corporate affiliation. There should be some way for safety researchers to divest their equity rather than give it up / donate it and lose 85% of their net worth, but conflicts of interest will remain.

The bandwagon

Many in the AI safety movement do not think enough for themselves, 4 respondents thought.

Slightly agree I guess? I don't really have thoughts. It makes sense that Alex thinks this because he often disagrees with other safety researchers-- not to discredit his position.

Discounting public outreach & governance as a route to safety

Historically, the AI safety movement has underestimated the potential of getting the public on-side and getting policy passed, 3 people said. There is a lot of work in AI governance these days, but for a long time most in AI safety considered it a dead end. The only hope to reduce existential risk from AI was to solve the technical problems ourselves, and hope that those who develop the first AGI implement them. Jamie put this down to a general mistrust of governments in rationalist circles, not enough faith in our ability to solve coordination problems, and a general dislike of “consensus views”.

I think this is largely due to a mistake by Yudkowsky, which is maybe compatible with Jamie's opinions.

I also want to raise the possibility that the technical focus was rational and correct at the time. Early MIRI/CFAR rationalists were nerds with maybe -1.5 standard deviations of political aptitude on average. So I think it is likely that they would have failed at their policy goals, and maybe even had three more counterproductive events like the Puerto Rico conference where OpenAI was founded. Later, AI safety started attracting political types, and maybe this was the right time to start doing policy.

[Holly] also condemned the way many in AI safety hoped to solve the alignment problem via “elite shady back-room deals”, like influencing the values of the first AGI system by getting into powerful positions in the relevant AI companies.

It doesn't sound anywhere near as shady if you phrase it as "build a safety focused culture or influence decisions at companies that will build the first AGI", which seems more accurate.

Comment by Thomas Kwa (thomas-kwa) on The case for stopping AI safety research · 2024-05-24T02:41:39.420Z · LW · GW

The burden of proof is on you that current safety research is not incremental progress towards safety research that matters on superintelligent AI. Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful. Imagine if we had to land rockets on barges before anyone had invented PID controllers and observed their failure modes.

Also, the directions suggested in section 5 of the paper you linked seem to fall well within the bounds of normal AI safety research.

Edit: Two people reacted to taboo "burden of proof". I mean that the claim is contrary to reference classes I can think of, and to argue for it there needs to be some argument why it is true in this case. It is also possible that the safety effect is significant but outweighted by the speedup effect, but that should also be clearly stated if it is what OP believes.

Comment by Thomas Kwa (thomas-kwa) on robo's Shortform · 2024-05-19T19:24:25.213Z · LW · GW

Seems reasonable except that Eliezer's p(doom | trying to solve alignment) in early 2023 was much higher than 50%, probably more like 98%. AGI Ruin was published in June 2022 and drafts existed since early 2022. MIRI leadership had been pretty pessimistic ever since AlphaGo in 2016 and especially since their research agenda collapsed in 2019.

Comment by Thomas Kwa (thomas-kwa) on robo's Shortform · 2024-05-19T07:17:32.547Z · LW · GW

As recently as early 2023 Eliezer was very pessimistic about AI policy efforts amounting to anything, to the point that he thought anyone trying to do AI policy was hopelessly naive and should first try to ban biological gain-of-function research just to understand how hard policy is. Given how influential Eliezer is, he loses a lot of points here (and I guess Hendrycks wins?)

Then Eliezer updated and started e.g. giving podcast interviews. Policy orgs spun up and there are dozens of safety-concerned people working in AI policy. But this is not reflected in the LW frontpage. Is this inertia, or do we like thinking about computer science more than policy, or is it something else?

Comment by Thomas Kwa (thomas-kwa) on Do you believe in hundred dollar bills lying on the ground? Consider humming · 2024-05-19T02:44:26.867Z · LW · GW

My prior is that solutions contain on the order of 1% active ingredients, and of things on the Enovid ingredients list, citric acid and NaNO2 are probably the reagents that create NO [1], which happens at a 5.5:1 mass ratio. 0.11ppm*hr as an integral over time already means the solution is only around 0.01% NO by mass [1], which is 0.055% reagents by mass, probably a bit more because yield is not 100%. This is a bit low but believable. If the concentration were really only 0.88ppm and dissipated quickly, it would be extremely dilute which seems unlikely. This is some evidence for the integral interpretation over the instantaneous 0.88ppm interpretation-- not very strong evidence; I mostly believe it because it seems more logical and also dimensionally correct. [2]

[1] https://chatgpt.com/share/e95fcaa3-4062-4805-80c3-7f1b18b12db2

[2] If you multiply 0.11ppmhr by 8 hours, you get 0.88ppmhr^2, which doesn't make sense.

Comment by Thomas Kwa (thomas-kwa) on Catastrophic Goodhart in RL with KL penalty · 2024-05-17T23:36:39.111Z · LW · GW

Also, why do you think that error is heavier tailed than utility?

Goodhart's Law is really common in the real world, and most things only work because we can observe our metrics, see when they stop correlating with what we care about, and iteratively improve them. Also the prevalence of reward hacking in RL often getting very high values.

If the reward model is as smart as the policy and is continually updated with data, maybe we're in a different regime where errors are smaller than utility.