RyanCarey's Shortform 2021-01-24T11:32:34.393Z
New paper: The Incentives that Shape Behaviour 2020-01-23T19:07:36.972Z
What are some good examples of incorrigibility? 2019-04-28T00:22:45.459Z
When to use quantilization 2019-02-05T17:17:12.169Z
Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence. 2018-10-21T12:03:12.064Z
USA v Progressive 1979 excerpt 2017-11-27T17:32:10.425Z
A combined analysis of genetically correlated traits identifies 107 loci associated with intelligence | bioRxiv 2017-07-18T06:30:14.977Z
Equilibria in adversarial supervised learning 2017-05-03T08:14:45.000Z
Call for Special Issue on Superintelligence - Informatica 2017-05-03T05:08:02.159Z
Online Learning 3: Adversarial bandit learning with catastrophes 2016-11-14T22:58:01.000Z
Online Learning 2: Bandit learning with catastrophes 2016-10-29T16:53:28.000Z
Online Learning 1: Bias-detecting online learners 2016-10-29T16:45:04.000Z
Improving long-run civilisational robustness 2016-05-10T11:15:55.694Z
Reducing Catastrophic Risks, A Practical Introduction 2015-09-09T22:39:40.212Z
The Effective Altruism Handbook 2015-04-24T00:30:08.040Z
One week left for CSER researcher applications 2015-04-17T00:40:00.901Z
SETI-related: fast radio bursts 2015-04-06T10:23:01.688Z
GCRI: Updated Strategy and AMA on EA Forum next Tuesday 2015-02-23T12:35:13.450Z
Certificates of Impact [Paul Christiano; Link] 2014-11-11T16:58:13.337Z
Recovery Manual for Civilization 2014-10-31T09:36:20.755Z
Announcing The Effective Altruism Forum 2014-08-24T08:07:31.980Z
Testing my cognition 2014-02-19T22:30:50.295Z
Mind Hacks 2014-02-03T19:32:22.512Z
Inferential credit history 2013-07-24T14:12:52.190Z
Meetup : Melbourne, social meetup: 31 May 2013 07:00PM 2013-05-27T11:10:23.014Z
Collating widely available time/money trades 2012-11-19T22:57:20.125Z
Online Meetup: The High Impact Network 2012-11-19T02:55:49.308Z


Comment by RyanCarey on The Apprentice Thread · 2021-07-05T17:58:08.991Z · LW · GW

Have you considered just doing some BTC/BTC-PERP arbitrage, or betting on politics and sports? You'd probably learn what skills they're looking for, gain some of them, and make money while you're at it...

Comment by RyanCarey on AMA: Paul Christiano, alignment researcher · 2021-05-14T11:02:03.368Z · LW · GW

Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order:

3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment. 

-> Agreed. We're starting to study "side-effect incentives" (improved name pending), which have this property. We're still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems.

2) sometimes we need more specific quantities, than just D affects A.

-> Agreed. We've privately discussed directional quantities like "do(D=d) causes A=a" as being more safety-relevant, and are happy to hear other ideas.

1) eliminating all control-incentives seems unrealistic

-> Strongly agree it's infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human's values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches.

Overall, I concede that we haven't engaged much on safety issues in the last year. Partly, it's that the projects have had to fit within people's PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns like these are, and to what extent causal decision problems/games are a really useful ontology for AI safety.

Comment by RyanCarey on MIRI location optimization (and related topics) discussion · 2021-05-12T14:40:32.297Z · LW · GW

I think moving to the country could possibly be justified despite harms to recruitment and the rationality community, but in the official MIRI explanations, the downsides are quite underdiscussed.

Comment by RyanCarey on What is going on in the world? · 2021-01-28T06:30:02.301Z · LW · GW

This list is pretty relevant too:

Comment by RyanCarey on What is going on in the world? · 2021-01-26T10:17:29.367Z · LW · GW

Interesting that about half of these "narratives" or "worldviews" are suffixed with "-ism": Malthusianism, Marxism, Georgism, effective altruism, transhumanism. But most of the (newer and less popular) rationalist narratives haven't yet been suchly named. This would be one heuristic for finding other worldviews. 

More generally, if you want people to know and contrast a lot of these worldviews, it'd be useful to name them all in 1-2 words each.

Comment by RyanCarey on RyanCarey's Shortform · 2021-01-24T11:32:34.922Z · LW · GW

Causal prediction markets.

Prediction markets (and prediction tournaments more generally) may be useful for telling us not only what will happen, but which actions will achieve our goals. One proposal for getting prediction markets to help with this is to get users to make conditional predictions. For example, we can ask the question "if Biden wins the election, GDP will be higher than if Trump wins" and use that as evidence about who to elect, and so on. But conditional predictions only predict the effect of an action if the event (e.g. who is elected) is unconfounded with the outcome (GDP). It may be that higher GDP and Biden being elected have a common cause, even if electing Biden does not increase GDP directly. One way to address this would be to have the market only pay out if Biden barely wins, or Trump barely wins, so that the confounders can be assumed to be in a similar state. Another strategy for identifying the causal effect would be to randomise. We can't randomise the election result, but we can randomise other quantities. For instance, "we generate a number from 1-100, and audit company X if we generate 1. If we generate the number 1, how much tax evasion will we find?". In general, in order to design action-guiding prediction markets, it may be important to draw on identification strategies from the causal inference literature.

I haven't yet checked for existing literature on this topic. Does anyone know of any?

Comment by RyanCarey on A vastly faster vaccine rollout · 2021-01-14T21:41:06.030Z · LW · GW

A response from @politicalmath, based on Smallpox: The Death of a Disease by  DA Henderson:

1) There were no phases, it was just "show up, get poked"
2) There were plenty of vaccines to go around. Countries typically had millions of smallpox vaccine doses ready to go just in case
3) with no lockdowns, they could go to schools / churches / offices & line people up
4) the smallpox vaccine was incredibly heat-stable. There were batches that were still efficacious after being stored for a year at 113F
5) the public health infrastructure had a lot of practice with mass vaccinations (they did them all the time in other countries)
And it did cause a city-wide panic. Smallpox has like a 20% fatality rate so people were pretty motivated to get the vaccine. All this is from this book

Comment by RyanCarey on A vastly faster vaccine rollout · 2021-01-14T19:43:47.623Z · LW · GW

New York has apparently distributed 35% of the vaccine that it has. Maybe they are focusing on other bottlenecks? Though my naive guess would be that the main problems are that the staff at US agencies are more numerous, less-competent, more regulated, as part of the aging process of any bureaucracy, compounded by the declining prestige of governmental jobs.

Comment by RyanCarey on The Case for a Journal of AI Alignment · 2021-01-11T02:01:52.785Z · LW · GW

One alternative would be to try to raise funds (e.g. perhaps from the EA LTF fund) to pay reviewers to perform reviews.

Comment by RyanCarey on The Case for a Journal of AI Alignment · 2021-01-11T01:58:16.174Z · LW · GW

I don't (and perhaps shouldn't) have a guaranteed trigger - probably I will learn a lot more about what the trigger should be over the next couple years. But my current picture would be that the following are mostly true:

  • The AIS field is publishing 3-10x more papers per year as the causal inference field is now.
  • We have ~3 highly aligned tenured professors at top-10 schools, and ~3 mostly-aligned tenured professors with ~10k citations, who want to be editors of the journal
  • The number of great papers that can't get into other top AI journals is >20 per year. I figure it's currently like ~2.
  • The chance that some other group creates a similar (worse) journal for safety in the subsequent 3 years is >20%
Comment by RyanCarey on The Case for a Journal of AI Alignment · 2021-01-10T16:46:26.871Z · LW · GW

This idea has been discussed before. Though it's an important one, so I don't think it's a bad thing for us to bring it up again. My perspective now and previously is that this would be fairly bad at the moment, but might be good in a couple of years time.

My background understanding is that the purpose of a conference or journal in this case (and in general) is primarily to certify the quality of some work (and to a lesser extent, the field of inquiry). This in-turn helps with growing the AIS field, and the careers of AIS researchers.

This is only effective if the conference or journal is sufficiently prestigious. Presently, publishing AI safety papers in Neurips, AAAI, JMLR, JAIR serves to certify the validity of the work, and boosts the field of AI safety whereas publishing in (for example) Futures or AGI doesn't. If you create a new publication venue, by default, its prestige would be comparable to, or less than Futures or AGI, and so wouldn't really help to serve the role of a journal.

Currently, the flow of AIS papers into the likes of Neurips and AAAI (and probably soon JMLR, JAIR) is rapidly improving. New keywords have been created there at several conferences, along the lines of "AI safety and trustworthiness" (I forget the exact wording) so that you can nowadays expect, on average, to receive reviewer who average out to neutral, or even vaguely sympathetic to AIS research. Ten or so papers were published in such journals in the last year, and all these authors will become reviewers under that keyword when the conference comes around next year. Yes, things like "Logical Inductors" or "AI safety via debate" are very hard to publish. There's some pressure to write research that's more "normie". All of that sucks, but it's an acceptable cost for being in a high-prestige field. And overall, things are getting easier, fairly quickly.

If you create a too low-prestige journal, you can generate blowback. For example, there was some criticism on Twitter about Pearl's "Journal of Causal Inference", even though his field is somewhat more advanced than hours.

In 1.5-3 years time, I think the risk-benefit calculus will probably change. The growth of AIS work (which has been fast) may outpace the virtuous cycle that's currently happening with AI conferences and journals, such that a lot of great papers are getting rejected. There could be enough tenure-track professors at top schools to make the journal decently high-status (moreso than Futures and AGI). We might even be nearing the point where some unilateral actor will go and make a worse journal if we don't make one. I'd say when a couple of those things are true, that's when we should pull the trigger and make this kind of conference/journal.

Comment by RyanCarey on Covid 12/24: We’re F***ed, It’s Over · 2020-12-24T22:13:18.646Z · LW · GW

Yeah, I was thinking something similar. It seems the bottom line is we'll have to stay at home and receive deliveries for most of the next 4-8 months, while vaccines and infections bring the world toward herd immunity. So as individuals, we should make sure we're suitably located and supplied for that scenario.

Comment by RyanCarey on ESRogs's Shortform · 2020-11-29T22:58:45.404Z · LW · GW

Good year for this portfolio. Any new tips? :P

Comment by RyanCarey on The rationalist community's location problem · 2020-11-10T22:40:28.596Z · LW · GW

Follow-on post from mingyuan: Location Discussion Takeaways.

Comment by RyanCarey on Bet On Biden · 2020-10-28T11:52:14.068Z · LW · GW

There are three arguments (1) polls underestimating Dems in Southern states, and (2) benchmarking against 2018 senate, and (3) some low-quality Tweets.

It's weird to hold a lot of stock in (2), given noise from candidate selection and other variables.

If you place a lot of weight on (1), the actually sane bet would be Biden in AZ. It's rated 2nd and 4th most likely to go dem by Cohn and Wasserman respectively.

Biden for AZ: 77% likely (Economist), priced at 54% on Election Betting Odds.

Comment by RyanCarey on Bet On Biden · 2020-10-27T08:47:06.392Z · LW · GW

The Texas bet (TX) seems EV neutral to me, and clearly far worse than the nationwide electoral college (EC) bet.

Biden for EC: 95% likely (The Economist model), priced at 62%

Biden for TX: 26% likely (The Economist), priced at 29%

The two Twitter feeds are full of a lot of shitposting, and don't update me much.

Comment by RyanCarey on The bads of ads · 2020-10-23T19:41:23.997Z · LW · GW

It bears noting that ads can do good - they can spread important messages. They can encourage people to make purchases that they actually benefit from. And they can help especially with launching new projects that people aren't yet aware of. 

So ideally the advertiser would pay a price for inflicting these negatives, so that we would get the benefits with fewer of the costs.

Comment by RyanCarey on The rationalist community's location problem · 2020-10-12T10:06:51.020Z · LW · GW

The same is basically true for any niche interest - it will only be fulfilled where there's adequate population to justify it. In my case, particular jazz music.

Probably a lot of people have different niche interests like that, even if they can't agree on one.

Comment by RyanCarey on Why Boston? · 2020-10-11T18:31:11.547Z · LW · GW

He is even more effusive in his essay "cities and ambition" (which incidentally is quite relevant for figuring where rationalists should want to live):

Great cities attract ambitious people. You can sense it when you walk around one. In a hundred subtle ways, the city sends you a message: you could do more; you should try harder. The surprising thing is how different these messages can be. New York tells you, above all: you should make more money. There are other messages too, of course. You should be hipper. You should be better looking. But the clearest message is that you should be richer. What I like about Boston (or rather Cambridge) is that the message there is: you should be smarter. You really should get around to reading all those books you've been meaning to.


As of this writing, Cambridge seems to be the intellectual capital of the world. I realize that seems a preposterous claim. What makes it true is that it's more preposterous to claim about anywhere else. American universities currently seem to be the best, judging from the flow of ambitious students. And what US city has a stronger claim? New York? A fair number of smart people, but diluted by a much larger number of neanderthals in suits. The Bay Area has a lot of smart people too, but again, diluted; there are two great universities, but they're far apart. Harvard and MIT are practically adjacent by West Coast standards, and they're surrounded by about 20 other colleges and universities. [1] Cambridge as a result feels like a town whose main industry is ideas, while New York's is finance and Silicon Valley's is startups.


When I moved to New York, I was very excited at first. It's an exciting place. So it took me quite a while to realize I just wasn't like the people there. I kept searching for the Cambridge of New York. It turned out it was way, way uptown: an hour uptown by air.

Comment by RyanCarey on The Rise and Fall of American Growth: A summary · 2020-10-06T15:41:08.547Z · LW · GW

Given that the policies are never going to be reverted, maybe better questions would be: which of the policies were the ones that mattered, are any of them political feasible, and if none of them are feasible in the US, then where?

Comment by RyanCarey on Open & Welcome Thread - September 2020 · 2020-09-15T10:34:20.795Z · LW · GW

I guess it's because high-conviction ideologies outperform low-conviction ones, including nationalistic and political ideologies, and religions. Dennett's Gold Army/Silver Army analogy explains how conviction can build loyatly and strength, but a similar thing is probably true for movement-builders. Also, conviction might make adherents feel better, and therefore simply be more attractive.

Comment by RyanCarey on Comparing reward learning/reward tampering formalisms · 2020-06-10T16:54:51.759Z · LW · GW

It would be nice to draw out this distinction in more detail. One guess:

  • Uninfluencability seems similar to requiring zero individual treatment effect of D on R.
  • Riggability (from the paper) would then correspond to zero average treatment effect of D on R
Comment by RyanCarey on New paper: The Incentives that Shape Behaviour · 2020-01-24T10:59:23.547Z · LW · GW


Comment by RyanCarey on Subscripting Typographic Convention For Citations/Dates/Sources/Evidentials: A Proposal · 2020-01-10T01:18:41.651Z · LW · GW

This is a cool idea. However, are you actually using the subscript in two confusingly different ways? In I_2010, it seems you're talking about you, indexed to the year 2020, whereas in {Abdul Bey}_2000, it seems you're citing a book. It would be pretty bad for people to see a bunch of the first kind of case, and then expect citations, but only get them half of the time.

Comment by RyanCarey on Defining AI wireheading · 2019-11-21T13:43:41.344Z · LW · GW

Seems like the idea is that wireheading denotes specification gaming that is egregious in its focus on the measurement channel. I'm inclined to agree..

Comment by RyanCarey on What technical prereqs would I need in order to understand Stuart Armstrong's research agenda? · 2019-09-21T12:28:38.127Z · LW · GW

You could check out Best Textbooks on Every Subject. But people usually recommend Linear Algebra Done Right for LinAlg. Understanding ML seems good for ML Theory. Sutton and Barto is an easy read for RL.

Comment by RyanCarey on What technical prereqs would I need in order to understand Stuart Armstrong's research agenda? · 2019-09-18T20:34:13.475Z · LW · GW

It may be that technical prereqs are missing. It could also be that you're missing a broader sense of "mathematical maturity", or that you're struggling because Stuart's work is simply hard to understand. That said, useful prereq areas (in which you could also gain overall mathematical maturity) would include:

  • Probability theory
  • Linear Algebra
  • Machine learning theory
  • Reinforcement Learning

It's probably overkill to go deep into these topics. Usually, what you need is in the first chapter.

Comment by RyanCarey on Where are people thinking and talking about global coordination for AI safety? · 2019-08-31T19:16:46.037Z · LW · GW

I would guess three main disagreements are:

i) are the kinds of transformative AI that we're reasonably likely to get in the next 25 years are unalignable?

ii) how plausible are the extreme levels of cooperation Wei Dai wants

iii) how important is career capital/credibility?

I'm perhaps midway between Wei Dai's view and the median governance view so may be an interesting example. I think we're ~10% likely to get transformative general AI in the next 20 years, and ~6% likely to get an incorrigible one, and ~5.4% likely to get incorrigible general AI that's insufficiently philosophically competent. Extreme cooperation seems ~5% likely, and is correlated with having general AI. It would be nice if more people worked on that, or on whatever more-realistic solutions would work for the transformative unsafe AGI scenario, but I'm happy for some double-digit percentage of governance researchers to keep working on less extreme (and more likely) solutions to build credibility.

Comment by RyanCarey on Forum participation as a research strategy · 2019-08-03T11:38:42.237Z · LW · GW

I agree that some people can benefit from doing both, although getting everyone online is a hard collective action problem. I just claim that many researchers will satisfy with OP. At MIRI/FHI/OpenAI there are ~30-150 researchers, who think about a wide range of areas, which seems broadly comparable to the researchers among LessWrong/AF's active users (depending on your definition of 'researcher', or 'active'). Idea-exchange is extended by workshops and people moving jobs. Many in such a work environment will fund that FP has unacceptably low signal-noise ratio and will inevitably avoid FP...

Comment by RyanCarey on Forum participation as a research strategy · 2019-08-02T09:06:56.967Z · LW · GW

I would note that many of these factors apply as benefits of office-chat participation (OP) as well. The main benefit of FP absent from OP, I suppose, is preparing you for efficient written communication, but the rest seem feature in both. The fact that their benefits overlap explains why remote researchers benefit so much more than others from FP.

Comment by RyanCarey on IRL in General Environments · 2019-07-25T09:22:23.386Z · LW · GW
Aside from yourself, the other CHAI grad students don't seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem?

I think this is actually pretty strategically reasonable.

CHAI students would have high returns to their probability of attaining a top professorship by writing papers, which is quite beneficial for later recruiting top talent to work on AI safety, and quite structurally beneficial for the establishment of AI safety as a field of research. The time they might spend writing up their research strategy does not help with their this, nor with recruiting help with their line of work (because other nearby researchers face similar pressures, and because academia is not structured to have PhD students lead large teams).

Moreover, if they are pursuing academic success, they face strong incentives to work on particular problems, and so their research strategies may be somewhat distorted by these incentives, decreasing the quality of a research agenda written in that context.

When I look at CHAI research students, I see some pursuing IRL, some pursuing game theory, some pursuing the research areas of their supervisors (all of which could lead to professorships), and some pursuing projects of other research leaders like MIRI or Paul. This seems healthy to me.

Comment by RyanCarey on Writing children's picture books · 2019-06-27T10:00:28.332Z · LW · GW

In general, thinking of yourself commuciating your ideas to a less intelligent and knowledgeable person could push you in the direction of confabulating freeer-flowing stories whereas imagining yourself communicating your ideas to a smarter person could push you in the direction of saying less, with higher-rigour.

It seems like which one is desirable depends on the individual and the context (cf the Law of Equal and Opposite Advice)

Comment by RyanCarey on [deleted post] 2019-06-12T16:02:42.940Z
Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.

The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...

Comment by RyanCarey on [deleted post] 2019-06-11T22:28:20.888Z
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

I don't see why this has to be true, given that we get to choose the AI's value function. Why can't we just make the agent act-based?

My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability
) and might stumble upon an escape message/action then.

If the agent is model-based, then you should be able to gather a dataset of (prediction, world_state, accuracy) tuples with random actions (as random actions will practically never make the bad prediction) and random decision of whether to read the response. And then just ask the agent to maximize the natural direct effect of its prediction, treating world_state as the mediator, and a null prediction as the default action. (this equates to asking what the world would be like if a null action was outputted - I'll release my current work on direct effects in AI safety soon, and feel free to ask for it in the meantime). I don't see how this has this particular bad consequence (actually making the bad self-confirming prediction) in either training or deployment...

The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.

In particular, shutting down the system is just a way of saying "only maximize reward in the current timestep, i.e. be an act-based agent. This can be just incorporated into the reward function.

Indeed, when reading the predictions of the counterfactual oracle we're not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).

The hope is that since the agent is not trying to find self-confirming prophecies, then hopefully the accidental effects of self-confirmation are sufficiently small...

Comment by RyanCarey on Our plan for 2019-2020: consulting for AI Safety education · 2019-06-04T18:53:40.325Z · LW · GW

Should now be fixed

Comment by RyanCarey on Our plan for 2019-2020: consulting for AI Safety education · 2019-06-04T17:33:25.551Z · LW · GW

Hey! Thanks for sharing your experience with RAISE.

I'm sorry to say it, but I'm not convinced by this plan overall. Also, on the meta-level, I think you've got insufficient feedback on the idea before sharing it. Personally, my preferred format for giving inline feedback on a project idea is Google Docs, and so I've copied this post into a GDoc HERE and added a bunch of my thoughts there.

I don't mean to make you guys get discouraged, but I think that a bunch of aspects of this proposal are pretty ill-considered and need a bunch of revision. I'd be happy to provide further input.

Comment by RyanCarey on TAISU - Technical AI Safety Unconference · 2019-05-23T17:40:06.402Z · LW · GW

There is now, and it's this thread! I'll also go if a couple of other researchers do ;)

Comment by RyanCarey on Not Deceiving the Evaluator · 2019-05-10T01:42:56.039Z · LW · GW

Ok! That's very useful to know.

It seems pretty related to the Inverse Reward Design paper. I guess it's a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.

Comment by RyanCarey on Not Deceiving the Evaluator · 2019-05-09T22:52:03.747Z · LW · GW

As others have commented, it's difficult to understand what this math is supposed to say.

My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator's distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.

But this may be inaccurate, or there may be other material ideas here that I've missed.

Comment by RyanCarey on What are some good examples of incorrigibility? · 2019-05-01T09:40:39.751Z · LW · GW

yep, fixed.

Comment by RyanCarey on When to use quantilization · 2019-02-08T17:29:38.740Z · LW · GW

At least typically, we're talking about a strategy in the following sense. Q: Suppose you want to pick a teacher for a new classroom, how should you pick a teacher? A: you randomly sample from teachers above some performance threshold, in some base distribution. This works best given some fixed finite amount of "counterfeit performance" in that distribution.

If we treat the teachers as a bunch of agents, we don't yet have a game-theoretic argument that we should actually expect the amount of counterfeit performance (I) to be bounded. It might be that all of the teachers exploit the metric as far as they can, and counterfeit performance is unbounded...

I don't fully understand the rest of the comment.

Comment by RyanCarey on When to use quantilization · 2019-02-05T21:46:01.259Z · LW · GW

This is a rough draft, so pointing out any errors by email or PM is greatly appreciated.

Comment by RyanCarey on Announcement: AI alignment prize round 4 winners · 2019-02-03T01:16:27.501Z · LW · GW

As another anecdata point, I considered writing more to pursue the prize pool but ultimately didn't do any more (counterfactual) work!

Comment by RyanCarey on Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence. · 2018-10-21T14:04:28.647Z · LW · GW

Note: This is bound to contain a bunch of errors and sources of confusion so please let me know about them here.

Comment by RyanCarey on Moloch in whom I sit alone · 2018-10-04T11:12:31.598Z · LW · GW

Maybe the new-conversation--place is the bar or snack-bar. (Plausible deniability!)

Comment by RyanCarey on A brief note on factoring out certain variables · 2018-09-30T01:13:30.000Z · LW · GW

[Note: This comment is three years later than the post]

The "obvious idea" here unfortunately seems not to work, because it is vulnerable to so-called "infinite improbability drives". Suppose is a shutdown button, and gives some weight to and . Then, the AI will benefit from selecting a Q such that it always chooses an action , in which it enters a lottery, and if it does not win, then it the button B is pushed. In this circumstance, is unchanged, while both and allocate almost all of the probability to great outcomes. So the approach will create an AI that wants to exploit its ability to determine .

Comment by RyanCarey on Corrigible but misaligned: a superintelligent messiah · 2018-04-03T17:02:04.076Z · LW · GW

I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don't think it's being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.

I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I'm not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that's sure incorrigible. Maybe, as you suggest, it's just fundamentally ill-defined...

Comment by RyanCarey on Corrigible but misaligned: a superintelligent messiah · 2018-04-03T08:34:25.064Z · LW · GW

I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that's not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.

Comment by RyanCarey on Corrigible but misaligned: a superintelligent messiah · 2018-04-03T04:22:09.687Z · LW · GW
Does this sound right?


A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we'd definitely want our operators to be metaphilosophically competent, and we'd definitely want our AI to not corrupt them.

I agree with this.

a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste.

There's a lot of broad model uncertainty here, but yes, I'm sympathetic to this position.

Does the new title seem better?


At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.

What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.

This I agree with.

Comment by RyanCarey on Corrigible but misaligned: a superintelligent messiah · 2018-04-01T23:45:20.338Z · LW · GW

It seems to me that for a corrigible, moderately superhuman AI, it is mostly the metaphilosophical competence of the human that matters, rather than that of the AI system. I think there are a bunch of confusions presented here, and I'll run through them, although let me disclaim that it's Eliezer's notion of corrigibility that I'm most familiar with, and so I'm arguing that your critiques fall flat on Eliezer's version.

"[The AI should] figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things..."

You omitted a key component of the quote that almost entirely reversis its meaning. The correct quote would read [emphasis added]: "[The AI should] help me figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...". i.e. the AI should help with ensuring that the control continues to reside in the human, rather than in itself.

The messiah would in his heart of hearts have the best of intentions for them, and everyone would know that.

To my understanding, the point of corrigibility is that a corrigible system is supposed to benefit its human operators even if its intentions are somewhat wrong, so it is rather a non sequitur to say that an agent is corrigible because it has the best of intentions in its heart of hearts. If it truly fully understood human intentions and values, corrigibility may even be unnecessary.

He might also think it's a good idea for his followers to all drink cyanide together, or murder some pregnant actresses, and his followers might happily comply.

Clearly you're right that corrigibility is not sufficient for safety. A corrigible agent can still be instructed by its human operators to make a decision that is irreversibly bad. But it seems to help, and to help a lot. The point of a corrigible AI si that once it takes a few murderous actions, you can switch it off, or tell it to pursue a different objective. So for the messiah example, a corrigible messiah might poison a few followers and then when it is discovered, respond to an instruction to desist. An incorrigible messiah might be asked to stop murdering followers, but continue to do so anyway. So many of the more mundane existential risks would be mitigated by corrigibility.

And what about more exotic ones? I argue they would also be greatly (though not entirely) reduced. Consider that a corrigible messiah may still hide poison for all of the humans at once, leading to an irrevocably terrible outcome. But why should it? If it thinks it is doing well by the humans, then its harmful actions ought to be transparent. Perhaps the AI system would's actions would not be transparent if it intelligence was so radically great that it was inclined to act in fast an incomprehensible ways. But it is hard to see how we could know with confidence that such a radically intelligent AI is the kind we will soon be dealing with. And even if we are going to deal with that kind of AI, there could be other remedies that would be especially helpful in such scenarios. For example, an AI that permits informed oversight of its activities could be superb if it was already corrigible. Then, it could not only provide truthful explanations of its future plans but also respond to feedback on them. Overall, if we had an AI system that was (1) only a little bit superhumanly smart, (2) corrigible, and (3) providing informative explanations of its planned behaviour, then it would seem that we are in a pretty good spot.

"This is absurd. Wouldn't they obviously have cared about animal suffering if they'd reflected on it, and chosen to do something about it before blissing themselves out?"
Yeah, but they never got around to that before blissing themselves out.

I think you're making an important point here, but here is how I would put it: If you have an AI system that is properly deferential to humans, you still need to rely on the humans not to give it any existentially catastrophic orders. But the corrigibility/deferential behavior has changed the situation from one in which you're relying on the metaphilosophical competence of the AI, to one in which you're relying on the metaphilosphical competence on the human (albeit as filtered through the actions of the AI system). In the latter case, yes, you need to survive having a human's power increased by some N-fold. (Not necessarily 10^15 as in the more extreme self-improvement scenarios, but by some N>1). So when you get a corrigible AI, you still need to be very careful with what you tell it to do, but your situation is substantially improved. Note that what I'm saying is at least in some tension with the traditional story of indirect normativity. Rather than trying to give the AI very general instructions for its interpretation, I'm saying that we should in the first instance try to stabilize the world so that we can do more metaphilosophical reasoning ourselves before trying to program an AI system that can carry out the conclusions of that thinking or perhaps continue it.

Would it want to? I think yes, because it's incentivized not to optimize for human values, but to turn humans into yes-men... The only thing I can imagine that would robustly prevent this manipulation is to formally guarantee the AI to be metaphilosophically competent itself.

Yes, an approval-directed agent might reward-hack by causing the human to approve of things that it does not value. And it might compromise the humans' reasoning abilities while doing so. But why must the AI system's metaphilosophical competence be the only defeator? Why couldn't this be achieved by quantilizing, or otherwise throttling the agent's capabilities? By restricting the agent's activities to some narrow domain? By having the agent somehow be deeply uncertain about where the human's approval mechanism resides? None of these seems clearly viable, but neither do any of them seem clearly impossible, especially in cases where the AI system's capabilities are overall not far beyond those of its human operators.

Overall, I'd say superintelligent messiahs are sometimes corrigible, and they're more likely to be aligned if so.