Catastrophic Regressional Goodhart: Appendix 2023-05-15T00:10:31.090Z
When is Goodhart catastrophic? 2023-05-09T03:59:16.043Z
Challenge: construct a Gradient Hacker 2023-03-09T02:38:32.999Z
Failure modes in a shard theory alignment plan 2022-09-27T22:34:06.834Z
Utility functions and probabilities are entangled 2022-07-26T05:36:26.496Z
Deriving Conditional Expected Utility from Pareto-Efficient Decisions 2022-05-05T03:21:38.547Z
Most problems don't differ dramatically in tractability (under certain assumptions) 2022-05-04T00:05:41.656Z
The case for turning glowfic into Sequences 2022-04-27T06:58:57.395Z
(When) do high-dimensional spaces have linear paths down to local minima? 2022-04-22T15:35:55.215Z
How dath ilan coordinates around solving alignment 2022-04-13T04:22:25.643Z
5 Tips for Good Hearting 2022-04-01T19:47:22.916Z
Can we simulate human evolution to create a somewhat aligned AGI? 2022-03-28T22:55:20.628Z
Jetlag, Nausea, and Diarrhea are Largely Optional 2022-03-21T22:40:50.180Z
The Box Spread Trick: Get rich slightly faster 2020-09-01T21:41:50.143Z
Thomas Kwa's Bounty List 2020-06-13T00:03:41.301Z
What past highly-upvoted posts are overrated today? 2020-06-09T21:25:56.152Z
How to learn from a stronger rationalist in daily life? 2020-05-20T04:55:51.794Z
My experience with the "rationalist uncanny valley" 2020-04-23T20:27:50.448Z
Thomas Kwa's Shortform 2020-03-22T23:19:01.335Z


Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2023-09-24T02:14:44.363Z · LW · GW

I'm looking for AI safety projects with people with some amount of experience. I have 3/4 of a CS degree from Caltech, one year at MIRI, and have finished the WMLB and ARENA bootcamps. I'm most excited about activation engineering, but willing to do anything that builds research and engineering skill.

If you've published 2 papers in top ML conferences or have a PhD in something CS related, and are interested in working with me, send me a DM.

Comment by Thomas Kwa (thomas-kwa) on Science of Deep Learning more tractably addresses the Sharp Left Turn than Agent Foundations · 2023-09-21T01:52:23.150Z · LW · GW

Upvoted with the caveat that I think no one has clearly articulated the "sharp left turn" failure mode, including Nate Soares, and so anything attempting to address it might not succeed.

Comment by Thomas Kwa (thomas-kwa) on leogao's Shortform · 2023-09-15T09:31:29.369Z · LW · GW


Comment by Thomas Kwa (thomas-kwa) on D0TheMath's Shortform · 2023-09-14T01:34:12.359Z · LW · GW

Not in isolation, but that's just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties.

I haven't read the paper and I'm not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model's mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it. 

Comment by Thomas Kwa (thomas-kwa) on D0TheMath's Shortform · 2023-09-14T01:01:40.126Z · LW · GW

This might do more for alignment. Better that we understand mesa-optimization and can engineer it than have it mysteriously emerge.

Comment by Thomas Kwa (thomas-kwa) on Progress links digest, 2023-09-08: The Conservative Futurist, cargo airships, and more · 2023-09-09T03:26:19.923Z · LW · GW

Thanks, I think the crux is economies of scale, both in size of the airship and production rate. That paper assumes the same cost as the Zeppelin NT, and its cost estimate is 93% airframe cost. The assumption is that the airframe cost per kg is the same as the Zeppelin NT, of which only 7 have ever been built. The Eli Dourado article imagines huge economies of scale, with airships 5x larger than the study, 25,000 units and a cost around $100 million per unit or ~12x lower per kg.

I would guess scaling is unlikely to work because it requires hydrogen, but there's so much money available to solve at least the technical problems.

Comment by Thomas Kwa (thomas-kwa) on Progress links digest, 2023-09-08: The Conservative Futurist, cargo airships, and more · 2023-09-09T00:19:44.114Z · LW · GW

Whose "competent analysis"?

Comment by Thomas Kwa (thomas-kwa) on Alexander Gietelink Oldenziel's Shortform · 2023-09-07T18:58:44.097Z · LW · GW

I'm not too concerned about this. ML skills are not sufficient to do good alignment work, but they seem to be very important for like 80% of alignment work and make a big difference in the impact of research (although I'd guess still smaller than whether the application to alignment is good)

  • Primary criticisms of Redwood involve their lack of experience in ML
  • The explosion of research in the last ~year is partially due to an increase in the number of people in the community who work with ML. Maybe you would argue that lots of current research is useless, but it seems a lot better than only having MIRI around
  • The field of machine learning at large is in many cases solving easier versions of problems we have in alignment, and therefore it makes a ton of sense to have ML research experience in those areas. E.g. safe RL is how to get safe policies when you can optimize over policies and know which states/actions are safe; alignment can be stated as a harder version of this where we also need to deal with value specification, self-modification, instrumental convergence etc.
Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2023-09-04T22:07:13.775Z · LW · GW

I think the framing "alignment research is preparadigmatic" might be heavily misunderstood. The term "preparadigmatic" of course comes from Thomas Kuhn's The Structure of Scientific Revolutions. My reading of this says that a paradigm is basically an approach to solving problems which has been proven to work, and that the correct goal of preparadigmatic research should be to do research generally recognized as impressive.

For example, Kuhn says in chapter 2 that "Paradigms gain their status because they are more successful than their competitors in solving a few problems that the group of practitioners has come to recognize as acute." That is, lots of researchers have different ontologies/approaches, and paradigms are the approaches that solve problems that everyone, including people with different approaches, agrees to be important. This suggests that to the extent alignment is still preparadigmatic, we should try to solve problems recognized as important by, say, people in each of the five clusters of alignment researchers (e.g. Nate Soares, Dan Hendrycks, Paul Christiano, Jan Leike, David Bau). 

I think this gets twisted in some popular writings on LessWrong. John Wentworth writes that a researcher in a preparadigmatic field should spend lots of time explaining their approaches:

Because the field does not already have a set of shared frames - i.e. a paradigm - you will need to spend a lot of effort explaining your frames, tools, agenda, and strategy. For the field, such discussion is a necessary step to spreading ideas and eventually creating a paradigm.

I think this is misguided. A paradigm is not established by ideas diffusing between researchers with different frames until they all agree on some weighted average of the frames. A paradigm is established by research generally recognized as impressive, which proves the correctness of (some aspect of) someone's frames. So rather than trying to communicate one's frame to everyone, one should communicate with other researchers to get an accurate sense of what problems they think are important, and then work on those problems using one's own frames. (edit: of course, before this is possible one should develop one's frames to solve some problems)

Comment by Thomas Kwa (thomas-kwa) on Brain Efficiency Cannell Prize Contest Award Ceremony · 2023-08-31T17:11:29.272Z · LW · GW

What about Drexler himself?

Comment by Thomas Kwa (thomas-kwa) on Open Call for Research Assistants in Developmental Interpretability · 2023-08-30T16:28:35.528Z · LW · GW

I would apply if the rate were more like USD $50/hour, considering the high cost of living in Berkeley. As it is this would be something like a 76% pay cut compared to the minimum ML engineer pay at Google and below typical grad student pay, and require an uncommon level of frugality.

Comment by Thomas Kwa (thomas-kwa) on Eliezer Yudkowsky Is Frequently, Confidently, Egregiously Wrong · 2023-08-29T16:57:23.989Z · LW · GW

This piece should be a test case for LW's rate-limiting policies. If a post gets -24 karma on 75 votes and lots of engagement, implying something like 40-44% positive reception, should its author be able to reply in the comments and keep posting? My guess is yes in this case, for the sake of promoting healthy disagreement.

Comment by Thomas Kwa (thomas-kwa) on Shortform · 2023-08-29T01:47:33.520Z · LW · GW

They value different things, but this is not uniformly less effort on any physical activity. More than one person from my very rationalist workplace climbs v8 boulders. 

Comment by Thomas Kwa (thomas-kwa) on What AI Posts Do You Want Distilled? · 2023-08-25T17:41:08.860Z · LW · GW

Academic papers seem more valuable, as posts are often already distilled (except for things like Paul Christiano blog posts) and the x-risk space is something of an info bubble. There is a list of safety-relevant papers from ICML here, but I don't totally agree with it; two papers I think it missed are

  • HarsanyiNet, an architecture for small neural nets that basically restricts features such that you can easily calculate Shapley value contributions of inputs
  • This other paper on importance functions, which got an oral presentation.

If you want to get a sense of how to do this, first get fast at understanding papers yourself, then read Rohin Shah's old Alignment Newsletters and the technical portions of Dan Hendrycks's AI Safety Newsletters.

To get higher value technical distillations than this, you basically have to talk to people in person and add detailed critiques, which is what Lawrence did with distillations of shard theory and natural abstractions.

Edit: Also most papers are low quality or irrelevant; my (relatively uninformed) guess is that 92% of big 3 conference papers have little relevance to alignment, and of the remainder, 2/3 of posters and 1/3 of orals are too low quality to be worth distilling. So you need to have good taste.

Comment by Thomas Kwa (thomas-kwa) on Ideas for improving epistemics in AI safety outreach · 2023-08-22T16:09:59.953Z · LW · GW

Stay in touch with the broader ML community (e.g., by following them on Twitter, attending AI events)

I got a lot of value out of attending ICML and would probably recommend attending an ML conference to anyone who has the resources. You actually get to talk to authors about their field and research process, which gets a lot more than reading papers or reading Twitter.

Anyway, I think you missed one of the best ideas: actually trying to understand the arguments yourself and only using correct ones. An argument isn't correct just because it's "grounded" or "contemporary" although it is good to have supporting evidence. The steps all have to be locally valid and you have to make valid assumptions. Sometimes an argument needs to be slightly changed from the most common version to be valid [1], but this only makes it more important.

Community builders often don't do technical research themselves so my guess is it's easy to underinvest, but sometimes the required steps are as simple as listing out an argument in great detail and looking at it skeptically, or checking with someone who knows ML about whether a current ML system has some property that we assume and if we expect it to arise anytime soon.

[1]: two examples: making various arguments compatible with Reward is not the optimization target, and making coherence arguments work even though AI systems will not necessarily have a single fixed utility function

Comment by Thomas Kwa (thomas-kwa) on Direct Preference Optimization in One Minute · 2023-08-18T01:56:53.427Z · LW · GW

Maybe the reward models are expressive enough to capture all patterns in human preferences, but it seems nice to get rid of this assumption if we can. Scaling laws suggest that larger models perform better (in the Gao paper there is a gap between 3B and 6B reward model) so it seems reasonable that even the current largest reward models are not optimal.

I guess it hasn't been tested whether DPO scales better than RLHF. I don't have enough experience with these techniques to have a view on whether it does.

Comment by Thomas Kwa (thomas-kwa) on Direct Preference Optimization in One Minute · 2023-08-18T00:35:18.956Z · LW · GW

DPO seems like a step towards better and more fine-grained control over models than RLHF, because it removes the possibility that the reward model underfits.

Comment by Thomas Kwa (thomas-kwa) on Stampy's AI Safety Info - New Distillations #4 [July 2023] · 2023-08-16T22:53:08.569Z · LW · GW

The answer on ARC doesn't cover mechanistic anomaly detection, which is the technical approach to ELKlike problems they've spent the last ~year on.

Comment by Thomas Kwa (thomas-kwa) on riceissa's Shortform · 2023-08-13T08:41:31.585Z · LW · GW

Perhaps there's enough uncertainty in how durable products are that consumers' purchase decisions don't really depend on perceived durability, and so companies don't focus on it? I like durable products but can't be so confident in a product's durability before trying it unless there's a very strong brand or I know something verifiable about how the product is constructed.

Comment by Thomas Kwa (thomas-kwa) on Seeking Input to AI Safety Book for non-technical audience · 2023-08-11T00:41:18.446Z · LW · GW

I might have thoughts if I can read the draft.

Comment by Thomas Kwa (thomas-kwa) on Yann LeCun on AGI and AI Safety · 2023-08-07T05:11:19.779Z · LW · GW

How do we know this? If it is "well-established", then by whom and what is their evidence?

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2023-07-29T20:42:34.323Z · LW · GW

After working at MIRI (loosely advised by Nate Soares) for a while, I now have more nuanced views and also takes on Nate's research taste. It seems kind of annoying to write up so I probably won't do it unless prompted.

Comment by Thomas Kwa (thomas-kwa) on Underwater Torture Chambers: The Horror Of Fish Farming · 2023-07-27T19:39:00.177Z · LW · GW

I say trollish because a decision procedure like this strikes me as likely to swamp and overwhelm you with way too many different considerations pointing in all sorts of crazy directions and to be just generally unworkable so I feel like something has to be going wrong here.


I feel like this decision procedure is difficult but necessary, in that I can't think of any other decision procedure you can follow that won't cause you to pass up on enormous amounts of utility, cause you to violate lots of deontological constraints, or whatever you decide morality is made of on reflection. Surely if you actually think some consideration is 1% to be true, you should act on it?

Comment by Thomas Kwa (thomas-kwa) on Slowing down AI progress is an underexplored alignment strategy · 2023-07-25T01:37:30.233Z · LW · GW

There are other reasons why nukes haven't killed us yet-- all the known mechanisms for destruction are too small, including nuclear winter.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2023-07-24T21:31:51.402Z · LW · GW

Edited, thanks

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2023-07-24T21:09:09.750Z · LW · GW

People say it's important to demonstrate alignment problems like goal misgeneralization. But now, OpenAI, Deepmind, and Anthropic have all had leaders sign the CAIS statement on extinction risk and are doing substantial alignment research. The gap between the 90th percentile alignment concerned people at labs and the MIRI worldview is now more about security mindset. Security mindset is present in cybersecurity because it is useful in the everyday, practical environment researchers work in. So perhaps a large part of the future hinges on whether security mindset becomes useful for solving problems in applied ML.

  • Will it become clear that the presence of one bug in an AI system implies that there are probably five more?
  • Will architectures that we understand with fewer moving parts be demonstrated to have better robustness than black-box systems or systems that work for complicated reasons?
  • Will it become tractable to develop these kinds of simple transparent systems so that security mindset can catch on?
Comment by Thomas Kwa (thomas-kwa) on All AGI Safety questions welcome (especially basic ones) [July 2023] · 2023-07-21T20:13:55.487Z · LW · GW

The usual analogy is that evolution (in the case of early hominids) is optimizing for genetic fitness on the savanna, but evolution did not manage to create beings who mostly intrinsically desire to optimize their genetic fitness. Instead the desires of humans are a mix of decent proxies (valuing success of your children) and proxies that have totally failed to generalize (sex drive causes people to invent condoms, desire for savory food causes people to invent Doritos, desire for exploration causes people to explore Antarctica, etc.).

The details of gradient descent are not important. All that matters for the analogy is that both gradient descent and evolution optimize for agents that score highly, and might not create agents that intrinsically desire to maximize the "intended interpretation" of the score function before they create highly capable agents. In the AI case, we might train them on human feedback or whatever safety property, and they might play the training game while having some other random drives. It's an open question whether we can make use of other biases of gradient descent to make progress on this huge and vaguely defined problem, which is called inner alignment.

Off the top of my head, some ways biases differ; again none of these affect the basic analogy afaik.

Only in evolution

  • Sex
  • Genetic hitchhiking
  • Junk DNA

Only in SGD

  • Continuity; neural net parameters are continuous whereas base pairs are discrete
  • Gradient descent is able to optimize all parameters at once as long as the L2 distance is small, whereas biological evolution has to increase the rate of individual mutations
  • All of these
Comment by Thomas Kwa (thomas-kwa) on mental number lines · 2023-07-20T05:37:33.476Z · LW · GW

Someone who didn't have a sense for the magnitude of large numbers would just place them based on the number of digits, which would be roughly logarithmic. Does the paper give any evidence that second graders make logarithmic placements beyond this, say within numbers with the same number of digits?

Comment by Thomas Kwa (thomas-kwa) on Direct effects matter! · 2023-07-18T17:41:03.462Z · LW · GW

Saw an example of this in Hacker News today. California and Cambridge, MA removed algebra from their middle school math curriculum for a stated reason of equity. Here's the Boston Globe challenging the equity argument with another equity argument:

Udengaard is one of dozens of parents who recently have publicly voiced frustration with a years-old decision made by Cambridge to remove advanced math classes in grades six to eight. [...]

“The students who are able to jump into a higher level math class are students from better-resourced backgrounds,” said Jacob Barandes, another district parent and a Harvard physicist. “They’re shortchanging a significant number of students, overwhelmingly students from less-resourced backgrounds, which is deeply inequitable.”

In contrast the blogger Noahpinion points out the direct effect, but then also ties it to equity. Later he (unfairly) accuses the proponents of removing math of hereditarian, un-progressive views on education.

When you ban or discourage the teaching of a subject like algebra in junior high schools, what you are doing is withdrawing state resources from public education. There is a thing you could be teaching kids how to do, but instead you are refusing to teach it. In what way is refusing to use state resources to teach children an important skill “progressive”? How would this further the goal of equity?

Why does Noahpinion mention equity even after bringing up the direct effect? I think it's because equity is a frame that everyone (in the debate among progressives) cares about, so your argument is guaranteed to at least land. Some people care more about equity than the direct effect of teaching advanced students more math. This is basically explanation 3 (domains of respectability), but it's less about status and more about being universally important vs maybe not important to the opponents of your argument.

Comment by Thomas Kwa (thomas-kwa) on Thomas Kwa's Shortform · 2023-07-18T01:20:23.803Z · LW · GW

I looked at Tetlock's Existential Risk Persuasion Tournament results, and noticed some oddities. The headline result is of course "median superforecaster gave a 0.38% risk of extinction due to AI by 2100, while the median AI domain expert gave a 3.9% risk of extinction." But all the forecasters seem to have huge disagreements from my worldview on a few questions:

  • They divided forecasters into "AI-Concerned" and "AI-Skeptic" clusters. The latter gave 0.0001% for AI catastrophic risk before 2030, and even lower than this (shows as 0%) for AI extinction risk. This is incredibly low, and don't think you can have probabilities this low without a really good reference class.
  • Both the AI-Concerned and AI-skeptic clusters gave low probabilities for space colony before 2030, 0.01% and "0%" medians respectively.
  • Both groups gave numbers I would disagree with for the estimated year of extinction: year 3500 for AI-concerned, and 28000 for AI-skeptic. Page 339 suggests that none of the 585 survey participants gave a number above 5 million years, whereas it seems plausible to me and probably many EA/LW people on the "finite time of perils" thesis that humanity survives for 10^12 years or more, likely giving an expected value well over 10^10. The justification given for the low forecasts even among people who believed the "time of perils" arguments seems to be that conditional on surviving for millions of years, humanity will probably become digital, but even a 1% chance of the biological human population remaining above the "extinction" threshold of 5,000 still gives an expected value in the billions.

I am not a forecaster and would probably be soundly beaten in any real forecasting tournament, but perhaps there is a bias against outlandish-seeming forecasts, strongest in this last question, that also affects the headline results.

Comment by Thomas Kwa (thomas-kwa) on Build knowledge base first, or backchain? · 2023-07-17T09:36:21.732Z · LW · GW

My best guess after ~1 year upskilling is to learn interpretability (which is quickly becoming paradigmatic) and read ML papers that are as close to the area you plan to work in as possible; if it's conceptual work also learn how to do research first, e.g. do half a PhD or work with someone experienced, although I am pessimistic about most independent conceptual work. Learn all the math that seems like obvious prerequisites (for me this was linear algebra, basic probability, and algorithms, which I mostly already had). Then learn everything else you feel like you're missing lazily, as it comes up. Get to the point where you have a research loop of posing and solving small problems, so that you have some kind of feedback loop. 

The other possible approach, something like going through John Wentworth's study guide, seems like too much background before contact with the problem. One reason is that conceptual alignment research contains difficult steps even if you have the mathematical background. Suppose you want to design a way to reliably get a desirable cognitive property into your agent. How do you decide what cognitive properties you intuitively want the agent to have, before formalizing them? This is philosophy. How do you formalize the property of conservatism or decide whether to throw it out? More philosophy. Learning Jaynes and dynamical systems and economics from textbooks might give you ideas, but you don't get either general research experience or contact with the problem, and your theorems are likely to be useless. In a paradigmatic field, you learn exactly the fields of math that have been proven useful. In a preparadigmatic field, it might be necessary to learn all fields that might be useful, but it seems a bit perverse to do this before looking at subproblems other people have isolated, making some progress, and greatly narrowing down the areas you might need to study.

Comment by Thomas Kwa (thomas-kwa) on Why was the AI Alignment community so unprepared for this moment? · 2023-07-15T07:10:00.316Z · LW · GW

I have some sympathy for this sentiment, but I want to point out that the alignment community was tiny until last year and still is small, so many opportunities that are becoming viable now were well below the bar earlier. If you were Rohin Shah, could you do better than finishing your PhD, publishing the Alignment Newsletter starting 2018 and then joining DeepMind in 2020? If you were Rob Miles, could you do better than the YouTube channel starting 2017? As Jason Matheny, could you do better than co-chairing the task force that wrote the US National AI Strategic Plan in 2016, then founding CSET in 2018? As Kelsey Piper, could you do better than writing for Vox starting 2018? Or should we have diverted some purely technical researchers, who are mostly computer science nerds with no particular talent at talking to people, to come up with a media outreach plan?

Keeping in mind that it's not clear in advance what your policy asks are or what arguments you need to counter and so your plan would go stale every couple of years, and that for the last 15 years the economy, health care, etc. have been orders of magnitude more salient to the average person than AGI risk with no signs this would change as suddenly as it did with ChatGPT.

Comment by Thomas Kwa (thomas-kwa) on Wildfire of strategicness · 2023-07-04T13:04:25.131Z · LW · GW

It seems like there's some intuition underlying this post for why the wildfire spark of strategicness is possible, but there is no mechanism given. What is this mechanism, and in what toy cases do you see a wildfire of strategicness? My guess is something like

  • Suppose one part of your systems contains a map from desired end-states to actions required to achieve those ends, another part has actuators, and a third part starts acting strategically. Then the third part needs only to hook together the other two parts with its goals to become an actualizing agent.

This doesn't really feel like a wildfire though, so I'm curious if you have something different in mind.

Comment by Thomas Kwa (thomas-kwa) on Why Not Subagents? · 2023-06-23T09:23:45.607Z · LW · GW

I commented on the original post last year regarding the economics angle:

Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent utility function in general. A bounty also hasn't found any such papers.

Based on this lit review and the Wikipedia page and ChatGPT [1], I'm 90% sure that "representative agent" in economics means the idea that the market's aggregate preferences are similar to the typical individuals' preferences, and the general question of whether a market has any complete preference ordering does not fall within the scope of the term.

[1] GPT4 says "The representative agent is assumed to behave in a way that represents the average or typical behavior of the group in the aggregate. In macroeconomics, for instance, a representative agent might be used to describe the behavior of all households in an economy.

This modeling approach is used to reduce the complexity of economic models, making them more tractable, but it has also received criticism."

Comment by Thomas Kwa (thomas-kwa) on Brain Efficiency: Much More than You Wanted to Know · 2023-06-22T17:36:05.642Z · LW · GW

The link is for cat6e cable, not coax. Also, the capacitance goes down to zero as r -> R in the coaxial cable model, and the capacitance appears to increase logarithmically with wire radius for single wire or two parallel wires, with the logarithmic decrease being in distance between wires.

Comment by Thomas Kwa (thomas-kwa) on "Corrigibility at some small length" by dath ilan · 2023-06-22T10:21:28.522Z · LW · GW

This was previously posted (though not to AF) here:

Comment by Thomas Kwa (thomas-kwa) on Co-found an incubator for independent AI Safety researchers (rolling applications) · 2023-06-05T11:22:42.266Z · LW · GW

I didn't hit disagree, but IMO there are way more than "few research directions" that can be accessed without cutting-edge models, especially with all the new open-source LLMs.

  • All conceptual work: agent foundations, mechanistic anomaly detection, etc.
  • Mechanistic interpretability, which when interpreted broadly could be 40% of empirical alignment work
  • Model control like the nascent area of activation additions

I've heard that evals, debate, prosaic work into honesty, and various other schemes need cutting-edge models, but in the past few weeks transitioning from mostly conceptual work into empirical work, I have far more questions than I have time to answer using GPT-2 or AlphaStar sized models. If alignment is hard we'll want to understand the small models first.

Comment by Thomas Kwa (thomas-kwa) on A mind needn't be curious to reap the benefits of curiosity · 2023-06-02T20:48:24.095Z · LW · GW

Proposed exercise: write 5 other ways the AI could manage to robustly survive?

Comment by Thomas Kwa (thomas-kwa) on The case for turning glowfic into Sequences · 2023-06-02T20:20:54.975Z · LW · GW

The bounty remains open, but I'm no longer excited about this due to three reasons:

  • lack of evidence for glowfic being an important positive influence on rationality,
  • Eliezer is speaking in the public sphere (some would argue too much)
  • general increasing quality and decreasing weirdness of alignment research
Comment by Thomas Kwa (thomas-kwa) on Book Review: How Minds Change · 2023-06-02T14:44:14.216Z · LW · GW

Thanks, I agree. I would still make the weaker claim that more than half the people in alignment are very unlikely to change their career prioritization from Street Epistemology-style conversations, and that in general the person with more information / prior exposure to the arguments will be less likely to change their mind.

Comment by Thomas Kwa (thomas-kwa) on Think carefully before calling RL policies "agents" · 2023-06-02T13:52:29.237Z · LW · GW

How do you think "agent" should be defined?

Comment by Thomas Kwa (thomas-kwa) on Book Review: How Minds Change · 2023-05-28T21:17:16.388Z · LW · GW

It's not just his fiction. Recently he went on what he thought was a low-stakes crypto podcast and was surprised that the hosts wanted to actually hear him out when he said we were all going to die soon:

I don't think we can take this as evidence that Yudkowsky or the average rationalist "underestimates more average people". In the Bankless podcast, Eliezer was not trying to do anything like trying to explore the beliefs of the podcast hosts, just explaining his views. And there have been attempts at outreach before. If Bankless was evidence towards "the world at large is interested in Eliezer's ideas and takes them seriously", The Alignment Problem and Human Compatible and rejection of FDT from academic decision theory journals is stronger evidence against. It seems to me that the lesson we should gather is that alignment's time in the public consciousness has come sometime in the last ~6 months.

I'm also not sure the techniques are asymmetric.

  • Have people with false beliefs tried e.g. Street Epistemology and found it to fail?
  • I think few of us in the alignment community are actually in a position to change our minds about whether alignment is worth working on. With a p(doom) of ~35% I think it's unlikely that arguments alone push me below the ~5% threshold where working on AI misuse, biosecurity, etc. become competitive with alignment. And there are people with p(doom) of >85%.

That said it seems likely that rationalists should be incredibly embarrassed for not realizing the potential asymmetric weapons in things like Street Epistemology. I'd make a Manifold market for it, but I can't think of a good operationalization.

Comment by Thomas Kwa (thomas-kwa) on When is Goodhart catastrophic? · 2023-05-23T00:34:10.252Z · LW · GW
Comment by Thomas Kwa (thomas-kwa) on Catastrophic Regressional Goodhart: Appendix · 2023-05-20T20:22:46.925Z · LW · GW

Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:

Comment by Thomas Kwa (thomas-kwa) on Godzilla Strategies · 2023-05-18T23:04:44.060Z · LW · GW

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

Comment by Thomas Kwa (thomas-kwa) on $500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions · 2023-05-17T05:59:51.970Z · LW · GW

I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.

edit: The proof is easy. Let  be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for  replaced by . Then . But this means  and so  You don't lose any channel capacity switching to 

Comment by Thomas Kwa (thomas-kwa) on Steering GPT-2-XL by adding an activation vector · 2023-05-15T18:08:58.834Z · LW · GW
  • Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
  • I thought briefly about the Ilharco et al paper and am very impressed by it as well.
  • Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

Comment by Thomas Kwa (thomas-kwa) on Steering GPT-2-XL by adding an activation vector · 2023-05-15T01:20:47.752Z · LW · GW

I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far OOD. No other work I've seen quite matches the promise this post has in finding ways to exert fine-grained control over a system's internals; we now have a wide variety of concrete questions like

  • how to find steering vectors for new behaviors e.g. speaking French?
  • how to make these techniques more robust?
  • What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
  • Can we decompose the effect of a prompt into steering vectors from simpler prompts, thereby understanding why complex prompts work?
  • Are the effects of steering vectors nonlinear for small coefficients? What does this mean about superposition?
  • What's the mechanism by which adding a steering vector with too large a coefficient breaks the model?
  • Adding steering vectors at different layers surely means you are intervening at different "stages of processing". What do the model's internal concepts look like at different stages?

Comparing this to other work, my sense is that

  • intervening on activations is better than training (including RLHF), because this builds towards understanding systems rather than steering a black box with a black-box reward model, and for the reasons the authors claim.
  • Debate, although important, seems less likely to be a counterfactual, robust way to steer models. The original debate agenda ran into serious problems, and neither it nor the current Bowman agenda tells us much about the internals of models.
  • steering a model with activation vectors is better than mechinterp (e.g. the IOI paper), because here you've proven you can make the AI do a wide variety of interesting things, plus mechinterp is slow
  • I'm not up to date on the adversarial training literature (maybe academia has produced something more impressive), but I think this is more valuable than the Redwood paper, which didn't have a clearly positive result. I'm glad people are working on adversarial robustness.
  • steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

It's a judgement call whether this makes it the most impressive achievement, but I think this post is pretty clearly Pareto-optimal in a very promising direction. That said, I have a couple of reservations:

  • By "most impressive concrete achievement" I don't necessarily mean the largest single advance over SOTA. There have probably been bigger advances in the past (RLHF is a candidate), and the impact of ELK is currently unproven but will shoot to the top if mechanistic anomaly detection ever pans out.
  • I don't think we live in a world where you can just add a "be nice" vector to a nanotech-capable system and expect better consequences, again for deep deceptiveness-ish reasons. Therefore, we need advances in theory to convert our ability to make systems do things into true mastery of cognition.
  • I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are! I don't think RLHF is editing values, in the sense that it does something different from even the weak version of instilling desires to create diamonds, and this seems even less connected to values. The only connection is it's modifying something contextually activated which is way too broad.
  • It's unclear that this works in a wide range of situations, or in the situations we need it to for future alignment techniques. The authors claim that cherry-picking was limited, but there are other uncertainties: when we need debaters that don't collude to mislead the judge, will we be able to use activation patching? What if we need an AI that doesn't self-modify to remove some alignment property?
Comment by Thomas Kwa (thomas-kwa) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T03:22:08.234Z · LW · GW

This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.

Edit: I explain this view in a reply.

Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions.

Comment by Thomas Kwa (thomas-kwa) on How should one feel morally about using chatbots? · 2023-05-11T01:31:10.874Z · LW · GW

Using chatbots and feeling ok about it seems like a no-brainer. It's technology that provides me a multiple percentage point productivity boost, it's used by over a billion people, and a boycott of chatbots is well outside the optimal or feasible space of actions to help the world.

I think the restaurant analogy fails because ChatGPT was not developed in malice, just recklessness. For the open source models, there's not even an element of greed.