Posts

Blog Post Day (Unofficial) 2020-02-18T19:05:47.140Z · score: 42 (14 votes)
Simulation of technological progress (work in progress) 2020-02-10T20:39:34.620Z · score: 20 (11 votes)
A dilemma for prosaic AI alignment 2019-12-17T22:11:02.316Z · score: 43 (12 votes)
A parable in the style of Invisible Cities 2019-12-16T15:55:06.072Z · score: 28 (12 votes)
Why aren't assurance contracts widely used? 2019-12-01T00:20:21.610Z · score: 33 (11 votes)
How common is it for one entity to have a 3+ year technological lead on its nearest competitor? 2019-11-17T15:23:36.913Z · score: 53 (15 votes)
Daniel Kokotajlo's Shortform 2019-10-08T18:53:22.087Z · score: 5 (2 votes)
Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann 2019-10-07T19:52:19.266Z · score: 49 (14 votes)
Soft takeoff can still lead to decisive strategic advantage 2019-08-23T16:39:31.317Z · score: 113 (49 votes)
The "Commitment Races" problem 2019-08-23T01:58:19.669Z · score: 58 (28 votes)
The Main Sources of AI Risk? 2019-03-21T18:28:33.068Z · score: 77 (32 votes)

Comments

Comment by daniel-kokotajlo on On unfixably unsafe AGI architectures · 2020-02-21T15:47:55.990Z · score: 2 (2 votes) · LW · GW

I agree that AGI is more omni-use than bioweapons and thus will be harder to get people not to develop and use. I think our prospects look pretty bleak in this scenario, but it's not completely hopeless.

For human cloning, what I had in mind was a nation cloning its smartest individuals for the purpose of having better science/tech. Think of what the US could have accomplished if they had 10,000 Von Neumanns instead of 1.

Comment by daniel-kokotajlo on On unfixably unsafe AGI architectures · 2020-02-20T19:14:40.504Z · score: 3 (3 votes) · LW · GW
Even as a series of worse and worse AGI accidents occur, wih out-of-control AGIs self-replicating around the internet etc., a few people will keep trying to fix the unfixable AGI, seeing this as the only path to get this slow-rolling catastrophe under control (while actually making it worse).

Maybe at this point there would be the political will for a Butlerian Jihad. ;) Or more seriously, a self-imposed ban on AGI similar to the current self-imposed bans on human cloning and biological weapons. I agree this is a long shot given our current experience with climate change, but still, it seems possible. And perhaps the AGI accidents would be more newsworthy and gripping than climate change is, making it easier to rouse the public.

Comment by daniel-kokotajlo on Tessellating Hills: a toy model for demons in imperfect search · 2020-02-20T18:59:30.460Z · score: 3 (2 votes) · LW · GW

This is awesome, thanks!

So, to check my understanding: You have set up a sort of artificial feedback loop, where there are N overlapping patterns of hills, and each one gets stronger the farther you travel in a particular dimension/direction. So if one or more of these patterns tends systematically to push the ball in the same direction that makes it stronger, you'll get a feedback loop. And then there is selection between patterns, in the sense that the pattern which pushes the strongest will beat the ones that push more weakly, even if both have feedback loops going.

And then the argument is, even though these feedback loops were artificial / baked in by you, in "natural" search problems there might be a similar situation... what exactly is the reason for this? I guess my confusion is in whether to expect real life problems to have this property where moving in a particular direction strengthens a particular pattern. One way I could see this happening is if the patterns are themselves pretty smart, and are able to sense which directions strengthen them at any given moment. Or it could happen if, by chance, there happens to be a direction and a pattern such that the pattern systematically pushes in that direction and the direction systematically strengthens that pattern. But how likely are these? I don't know. I guess your case is a case of the second, but it's rigged a bit, because of how you built in the systematic-strengthening effect.

Am I following, or am I misunderstanding?

Comment by daniel-kokotajlo on Is there an intuitive way to explain how much better superforecasters are than regular forecasters? · 2020-02-19T16:08:21.849Z · score: 26 (7 votes) · LW · GW

Like ignoranceprior said, my AI Impacts post has three intuitive ways of thinking about the results:


Way One: Let’s calculate some examples of prediction patterns that would give you Brier scores like those mentioned above. Suppose you make a bunch of predictions with 80% confidence and you are correct 80% of the time. Then your Brier score would be 0.32, roughly middle of the pack in this tournament. If instead it was 93% confidence correct 93% of the time, your Brier score would be 0.132, very close to the best superforecasters and to GJP’s aggregated forecasts.14 In these examples, you are perfectly calibrated, which helps your score—more realistically you would be imperfectly calibrated and thus would need to be right even more often to get those scores.

Way Two: “An alternative measure of forecast accuracy is the proportion of days on which forecasters’ estimates were on the correct side of 50%. … For all questions in the sample, a chance score was 47%. The mean proportion of days with correct estimates was 75%…”15 According to this chart, the superforecasters were on the right side of 50% almost all the time:16

Way Three: “Across all four years of the tournament, superforecasters looking out three hundred days were more accurate than regular forecasters looking out one hundred days.”17 (Bear in mind, this wouldn’t necessarily hold for a different genre of questions. For example, information about the weather decays in days, while information about the climate lasts for decades or more.)

Comment by daniel-kokotajlo on Excitement vs childishness · 2020-02-19T11:33:13.399Z · score: 1 (1 votes) · LW · GW

I'm glad you made that disclaimer. In our community -- LW -- the ratio of excitement/childishness bias is probably unusually high, perhaps even high enough that we need to be on higher guard against excitement. But in the wider community of "smart and/or important people thinking about AI" it seems pretty clear that childishness bias is much much stronger.

Comment by daniel-kokotajlo on Excitement vs childishness · 2020-02-19T11:29:08.684Z · score: 1 (1 votes) · LW · GW

Like, I agree that Michael Wulfson is a person for whom the excitement bias was stronger, judging by his story. But even for him, the childishness bias was super strong too, and eventually won out. Most people are not like him; most people will find the childishness bias / excitement bias ratio much higher than he did.

Comment by daniel-kokotajlo on Excitement vs childishness · 2020-02-19T11:23:47.008Z · score: 1 (1 votes) · LW · GW

Indeed I'd go so far as to say if this is the best case we can think of of excitement bias being stronger, that's pretty good evidence that in fact childishness bias is usually way stronger.

Comment by daniel-kokotajlo on Excitement vs childishness · 2020-02-19T11:22:57.692Z · score: 6 (2 votes) · LW · GW

I'm not sure this is strong counterevidence, because it can be interpreted as a case of childishness bias overcoming excitement bias:

But there were also countervailing effects in my mind, leading away from the god scenario. The strongest was the outlandishness of it all. I had always been dismissive of ideas that seem like doomsday theories, so I wasn’t automatically comfortable giving the god scenario credence in my mind. I was hesitant to introduce the idea to people who I thought might draw negative conclusions about my judgement.
Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2020-02-19T01:54:23.485Z · score: 1 (1 votes) · LW · GW

I like your point #2; I should think more about how the 30 year number changes with size. Obviously it's smaller for bigger entities and bigger for smaller entities, but how much? E.g. if we teleported 2020 Estonia back into 1920, would it be able to take over the world? Probably. What about 1970 though? Less clear.

Military power isn't what I'm getting at either, at least not if measured in the way that would result in AI companies having little of it. Cortez had, maybe, 1/10,000th of the military power of Mexico when he got started. At least if you measure in ways like "What would happen if X fought Y." Probably 1/10,000th of Mexico's military could have defeated Cortez' initial band.

If we try to model Cortez' takeover as him having more of some metric than all of Mexico had, then presumably Spain had several orders of magnitude more of that metric than Cortez did, and Western Europe as a whole had at least an order of magnitude more than that. So Western Europe had *many* orders of magnitude more of this stuff, whatever it is, than Mexico, even though Mexico had a similar population and GDP. So they must have been growing much faster than Mexico for quite some time to build up such a lead--and this was before the industrial revolution! More generally, this metric that is used for predicting takeovers seems to be the sort of thing that can grow and/or shrink orders of magnitude very quickly, as illustrated by the various cases throughout history of small groups from backwater regions taking over rich empires.

(Warning: I'm pulling these claims out of my ass, I'm not a historian, I might be totally wrong. I should look up these numbers.)

Comment by daniel-kokotajlo on Blog Post Day (Unofficial) · 2020-02-18T19:37:33.653Z · score: 3 (2 votes) · LW · GW

Glad to hear you are interested! Well, I'm in US Eastern time, but timing can be flexible. If we have enough people, perhaps Blog Post Day will effectively be longer than 24 hours. I'm thinking it would be a relatively casual affair, with people dropping in or out as they see fit.

Comment by daniel-kokotajlo on Blog Post Day (Unofficial) · 2020-02-18T19:07:21.208Z · score: 3 (2 votes) · LW · GW

Thanks in advance to those who join me on this venture! And those who give advice, criticism, etc.

If we get sufficient interest, it might be good to organize local meetups. Anyone else in North Carolina want to meet up with me for this?

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2020-02-18T16:27:55.411Z · score: 1 (1 votes) · LW · GW

I was thinking of an initially large country growing fast via AI, yes. Still counts; it is soft takeoff leading to DSA. However I am also making much stronger claims than that--I think it could happen with a corporation or rogue AGI.

I don't think annual income is at all a good measure of how close an entity is to taking over the world. When Cortez landed in Mexico he had less than 1/100,000th of the income, population, etc. of the region, yet he ruled the whole place three years later. Then a few years after that Pizarro repeated the feat in Peru, good evidence that it wasn't just an amazing streak of luck.

Comment by daniel-kokotajlo on A 'Practice of Rationality' Sequence? · 2020-02-16T14:42:29.796Z · score: 9 (5 votes) · LW · GW

I nominate this thing johnswentworth did. In addition to the reasons he gives, I'll add that being able to learn on your own, quickly, seems like a good skill to have, and related to (though maybe not the same thing as) rationality.

Comment by daniel-kokotajlo on Why Science is slowing down, Universities and Maslow's hierarchy of needs · 2020-02-16T14:32:09.244Z · score: 3 (2 votes) · LW · GW

I think I find your overall conclusion plausible, but I think your argument for it in places was dubious:

But again, even if you assume I’m wrong, that still leave us with universities that struggle to optimize for 2, 3 and maybe 4, losing out on 5 in the process.

One could instead interpret the situation as: Universities are optimizing hard for 5, and as a result they are understandably losing out on 2, 3, and 4 in the process.

Indeed, I think there is something to be said for this. A few years ago I half-jokingly wrote a paper titled "Kallipolis, USA," in which I argue that the present-day USA is in fact Plato's ideal state.

A big part of my argument was the way in which the university system works. In particular, (in conjunction with the rest of society) it seems to be optimizing pretty hard to get people to "follow their passion," and in particular by forcing everyone to go to college and take gen-ed requirements arguably the system is doing the best it can to scout and recruit people who are suited to the priesthood/academia.

Comment by daniel-kokotajlo on Bayesian Evolving-to-Extinction · 2020-02-15T12:19:57.462Z · score: 3 (2 votes) · LW · GW

How necessary is it that there be an explicit side-channel? Could you not get the same results in the standard situation in which an agent is selecting actions on the basis of expected utility?

Comment by daniel-kokotajlo on Demons in Imperfect Search · 2020-02-12T19:49:40.294Z · score: 6 (3 votes) · LW · GW

Ah, that does help, thanks. In my words: A search process that is vulnerable to local minima doesn't necessarily contain a secondary search process, because it might not be systematically comparing local minima and choosing between them according to some criteria. It just goes for the first one it falls for, or maybe slightly more nuanced, the first sufficiently big one it falls for.

By contrast, in the ball rolling example you gave, the walls/ridges were competing with each other, such that the "best" one (or something like that) would be systematically selected by the ball, rather than just the first one or the first-sufficiently-big one.

So in that case, looking over your list again...

OK, I think I see how organic life arising from chemistry is an example of a secondary search process. It's not just a local minima that chemistry found itself in, it's a big competition between different kinds of local minima. And now I think I see how this would go in the other examples too. As I originally said in my top-level comment, I'm not sure this applies to the example I brought up, actually. Would the "Insert my name as the author of all useful heuristics" heuristic be outcompeted by something else eventually, or not? I bet not, which indicates that it's a "mere" local minima and not one that is part of a broader secondary search process.

Comment by daniel-kokotajlo on Simulation of technological progress (work in progress) · 2020-02-12T18:35:22.032Z · score: 3 (2 votes) · LW · GW

I should add though that I haven't systematically examined these graphs yet, so it's possible I'm just missing something--e.g. it occurs to me right now that maybe some of these graphs I saw were really logistic functions rather than hyperbolic or exponential-until-you-hit-limits. I should make some more and look at them more carefully.

Comment by daniel-kokotajlo on Simulation of technological progress (work in progress) · 2020-02-12T18:30:07.795Z · score: 7 (2 votes) · LW · GW

Yes, thanks! I mostly agree with that assessment,* though as an aside I have a beef with the implication that Bostrom, Yudkowsky, etc. expect discontinuities. That beef is with Paul Christiano, not you. :)

So far the biggest update this has been for me, I think, is that it seems to have shown that it's quite possible to get an intelligence explosion even without economic feedback loops. Like, even with a fixed compute/money budget--or even with a fixed number of scientists and fixed amount of research funding--we could get singularity. At least in principle. This is weird because in practice I am pretty sure I remember reading that the growth we've seen so far can be best explained via an economic feedback loop: Better technology allows for bigger population and economy which allows for more scientists and funding which allows for better technology. So I'm a bit confused, I must say -- my model is giving me results I would have predicted wouldn't happen.

*There have been a few cases where the growth didn't look hyperbolic, but rather like a steady exponential trend that then turns into a singularity. World GDP, by contrast, has what looks like at least three exponential trends in it, such that it is more parsimonious to model it as hyperbolic growth. I think.

Comment by daniel-kokotajlo on Demons in Imperfect Search · 2020-02-12T11:47:53.763Z · score: 5 (3 votes) · LW · GW

Hmmm, this doesn't work to distinguish the two for me. Couldn't you say a local minima involves a secondary optimizing search process that has that minima as its objective? To use your ball analogy, what exactly is the difference between these twisty demon hills and a simple crater-shaped pit? (Or, what is the difference between a search process that is vulnerable to twisty demon hills and one which is vulnerable to pits?)

Comment by daniel-kokotajlo on Demons in Imperfect Search · 2020-02-12T10:54:08.242Z · score: 2 (2 votes) · LW · GW

Oops, forgot to delete that bit. Thanks for pointing it out.

Comment by daniel-kokotajlo on Demons in Imperfect Search · 2020-02-12T00:30:02.714Z · score: 12 (6 votes) · LW · GW

Cool!

Another cute example is the accidental "viruses" found when training EURISKO:

Lenat would leave EURISKO running each night, and check it in the morning. He would occasionally remove errors or unpromising heuristics from the system, or enter additional ones. Some discovered heuristics resembled viruses; one inserted its name as the creator of other useful heuristics, which would cause it to be used more often.

Do you see yourself as extending the concept of Demon to apply to things which are not necessarily even close to intelligent? (e.g. your first two examples) Or did the concept always mean that and I was just mistaken about what it meant?

The example with the ball rolling downhill seemed to imply that the demons were pretty damn smart, and getting smarter over time via competition with each other. But only your third example with managers seems like a real-world case of this. At least, that's my current claim. For example, I'd bet that if Lenat had let EURISKO run forever, it wouldn't have eventually been taken over by a superintelligence. Rather, it probably would have been stuck in that "insert my own name as the creator of other useful heuristics" optima forever, or something mundane like that at any rate. For that matter, can you say more about the difference between demons and mere local optima?

Comment by daniel-kokotajlo on Simulation of technological progress (work in progress) · 2020-02-11T22:41:23.810Z · score: 7 (5 votes) · LW · GW

Yeah, I tried to build the model with certain criticisms of the intelligence explosion argument in mind -- for example, the criticism that it assumes intelligence is a single thing rather than a diverse collection of skills, or the criticism that it assumes AGI will be a single thing rather than a diverse collection of more specific AI tools, or the criticism that it assumes takeoff will happen after human level but not before. My model makes no such assumptions, but it still gets intelligence explosion. I think this is an already somewhat interesting result, though not a major update for me since I didn't put much credence in those objections anyway.


Currently the model just models civilization's progress overall, so yeah it can't distinguish between local vs. distributed takeoff. I'm hoping to change that in the future, but I'm not sure how yet.

Comment by daniel-kokotajlo on Simulation of technological progress (work in progress) · 2020-02-11T17:12:51.148Z · score: 5 (3 votes) · LW · GW

Yeah, that's sorta my hope. The model is too abstract and disconnected from real-world numbers to be able to predict things like "The singularity will happen in 2045" but maybe it can predict things like "If you've had exponential growth for a while, it is very unlikely a priori / outside-view that growth will slow, and in fact quite likely that it will accelerate dramatically. Unless you are literally running out of room to grow, i.e. hitting fundamental physical limits in almost all endeavors."

Comment by daniel-kokotajlo on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-06T16:32:12.647Z · score: 4 (3 votes) · LW · GW

It seems to me that if we had the budget, we could realize the scenarios you describe today. The manipulative behavior you are discussing is not exactly rocket science.

That in turn makes me think that if we polled a bunch of people who build image classifiers for a living, and asked them whether the behavior you describe would indeed happen if the programmers behaved in the ways you describe, they would near-unanimously agree that it would.

Do you agree with both claims above? If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.


Separately, I think your examples depend on this a lot:

Many of the hyperparameters are set by a neural net, which itself takes a more "long-term view" of the error rate, trying to improve it from day to day rather than from run to run.

Is this such a common practice that we can expect "almost every powerful algorithm" to involve it somehow?

Comment by daniel-kokotajlo on The Main Sources of AI Risk? · 2020-01-28T12:21:02.215Z · score: 4 (3 votes) · LW · GW

Ha! I wake up this morning to see my own name as author, that wasn't what I had in mind but it sure does work to motivate me to walk the talk! Thanks!

Comment by daniel-kokotajlo on Predictors exist: CDT going bonkers... forever · 2020-01-16T20:10:59.158Z · score: 1 (1 votes) · LW · GW

Yep, that's the one! :)

Comment by daniel-kokotajlo on Is backwards causation necessarily absurd? · 2020-01-16T01:47:06.603Z · score: 1 (1 votes) · LW · GW

Perhaps I was too hasty. What I had in mind was the effective strategy strategy--if you define causation by reference to what's an effective strategy for achieving what, then that means you are assuming a certain decision theory in order to define causation. And so e.g. one-boxing will cause you to get a million if EDT is true, but not if CDT is true.

If instead you have another way to define causation, then I don't know. But for some ways, you are just fighting the hypothetical--OK, so maybe in the original Newcomb's Problem as stated, backwards causation saves the day and makes CDT and EDT agree on what to do. But then what about a modified version where the backwards causation is not present?

Comment by daniel-kokotajlo on Predictors exist: CDT going bonkers... forever · 2020-01-16T01:41:11.188Z · score: 8 (4 votes) · LW · GW

Dagon, I sympathize. CDT seems bonkers to me for the reasons you have pointed out. My guess is that academic philosophy has many people who support CDT for three main reasons, listed in increasing order of importance:

(1) Even within academic philosophy, many people aren't super familiar with these arguments. They read about CDT vs. EDT, they read about a few puzzle cases, and they form an opinion and then move on--after all, there are lots of topics to specialize in, even in decision theory, and so if this debate doesn't grip you you might not dig too deeply.

(2) Lots of people have pretty strong intuitions that CDT vindicates. E.g. iirc Newcomb's Problem was originally invented to prove that EDT was silly (because, silly EDT, it would one-box, which is obviously stupid!) My introductory textbook to decision theory was an attempt to build for CDT an elegant mathematical foundation to rival the jeffrey-bolker axioms for EDT. And why do this? It said, basically, "EDT gives the wrong answer in Newcomb's Problem and other problems, so we need to find a way to make some version of CDT mathematically respectable."

(3) EDT has lots of problems too. Even hardcore LWer fans of EDT like Caspar Oesterheld admit as much, and even waver back and forth between EDT and CDT for this reason. And the various alternatives to EDT and CDT that have been thus far proposed also seem to have problems.

Comment by daniel-kokotajlo on Predictors exist: CDT going bonkers... forever · 2020-01-14T23:27:57.095Z · score: 9 (6 votes) · LW · GW
To summarize my confusion, does CDT require that the agent unconditionally believe in perfect free will independent of history (and, ironically, with no causality for the exercise of will)? If so, that should be the main topic of dispute - the frequency of actual case where it makes bad predictions, not that it makes bad decisions in ludicrously-unlikely-and-perhaps-impossible situations.

Sorta, yes. CDT requires that you choose actions not by thinking "conditional on my doing A, what happens?" but rather by some other method (there are different variants) such as "For each causal graph that I think could represent the world, what happens when I intervene (in Pearl's sense) on the node that is my action, to set it to A?)" or "Holding fixed the probability of all variables not causally downstream of my action, what happens if I do A?"

In the first version, notice that you are choosing actions by imagining a Pearl-style intervention into the world--but this is not something that actually happens; the world doesn't actually contain such interventions.

In the second version, well, notice that you are choosing actions by imagining possible scenarios that aren't actually possible--or at least, you are assigning the wrong probabilities to them. ("holding fixed the probability of all variables not causally downstream of my action...")

So one way to interpret CDT is that it believes in crazy stuff like hardcore incompatibilist free will. But the more charitable way to interpret it is that it doesn't believe in that stuff, it just acts as if it does, because it thinks that's the rational way to act. (And they have plenty of arguments for why CDT is the rational way to act, e.g. the intuition pump "If the box is already either full or empty and you can't change that no matter what you do, then no matter what you do you'll get more money by two-boxing, so..."

Comment by daniel-kokotajlo on Is backwards causation necessarily absurd? · 2020-01-14T23:05:50.555Z · score: 4 (3 votes) · LW · GW

I don't think backwards causation is absurd, more or less for the reasons you sketch. Another minor reason: Some philosophers like "effective strategy" accounts of causation, according to which we define causation via its usefulness for agents trying to achieve goals. On these accounts, backwards causation is pretty trivial--just suppose you live in a deterministic universe and your goal is to "make the state of the universe at the Big Bang such that I eat breakfast tomorrow." The philosopher Gary Drescher argues something similar in Good and Real if I recall correctly.

That said, I don't think we are really explaining or de-confusing anything if we appeal to backwards causation to understand Newcomb's Problem or argue for a particular solution to it.

Comment by daniel-kokotajlo on The Main Sources of AI Risk? · 2020-01-14T21:39:27.147Z · score: 9 (4 votes) · LW · GW

Thank you for making this list. I think it is important enough to be worth continually updating and refining; if you don't do it then I will myself someday. Ideally there'd be a whole webpage or something, with the list refined so as to be disjunctive, and each element of the list catchily named, concisely explained, and accompanied by a memorable and plausible example. (As well as lots of links to literature.)

I think the commitment races problem is mostly but not entirely covered by #12 and #19, and at any rate might be worth including since you are OK with overlap.

Also, here's a good anecdote to link to for the "coding errors" section: https://openai.com/blog/fine-tuning-gpt-2/

Comment by daniel-kokotajlo on Predictors exist: CDT going bonkers... forever · 2020-01-14T21:35:34.895Z · score: 3 (2 votes) · LW · GW

Well said.

I had a similar idea a while ago and am working it up into a paper ("CDT Agents are Exploitable"). Caspar Oesterheld and Vince Conitzer are also doing something like this. And then there is Ahmed's Betting on the Past case.

In their version, the Predictor offers bets to the agent, at least one of which the agent will accept (for the reasons you outline) and thus they get money-pumped. In my version, there is no Predictor, but instead there are several very similar CDT agents, and a clever human bookie can extract money from them by exploiting their inability to coordinate.

Long story short, I would bet that an actual AGI which was otherwise smarter than me but which doggedly persisted in doing its best to approximate CDT would fail spectacularly one way or another, "hacked" by some clever bookie somewhere (possibly in its hypothesis space only!). Unfortunately, arguably the same is true for all decision theories I've seen so far, but for different reasons...

Comment by daniel-kokotajlo on What determines the balance between intelligence signaling and virtue signaling? · 2020-01-14T20:22:23.801Z · score: 1 (1 votes) · LW · GW

I don't think you are crazy; I worry about this too. I think I should go read a book about the Cultural Revolution to learn more about how it happened--it can't have been just Mao's doing, because e.g. Barack Obama couldn't make the same thing happen in the USA right now (or even in a deep-blue part of the USA!) no matter how hard he tried. Some conditions must have been different.*

*Off the top of my head, some factors that seem relevant: Material deprivation. Overton window so narrow and extreme that it doesn't overlap with everyday reality. Lack of outgroup that is close enough to blame for everything yet also powerful enough to not be crushed swiftly.

I don't think it could happen in the USA now, but I think maybe in 20 years it could if trends continue and/or get worse.

Then there are the milder forms, that don't involve actually killing anybody but just involve getting people fired, harassed, shamed, discriminated against, etc. That seems much more likely to me--it already happens in very small, very ideologically extreme subcultures/communities--but also much less scary. (Then again, from a perspective of reducing AI risk, this scenario would be almost as bad maybe? If the AI safety community undergoes a "soft cultural revolution" like this, it might seriously undermine our effectiveness)

Comment by daniel-kokotajlo on How to Identify an Immoral Maze · 2020-01-14T14:34:38.427Z · score: 21 (7 votes) · LW · GW

What’s better than having skin in the game? Having soul in the game. Caring deeply about the outcome for reasons other than money, or your own liability, or being potentially scapegoated. Caring for existential reasons, not commercial ones.
Soul in the game is incompatible with mazes. Mazes will eliminate anyone with soul in the game. Therefore, if the people you work for have soul in the game, you’re safe. If you have it too, you’ll be a lot happier, and likely doing something worthwhile. Things will be much better on most fronts. 

In general I worry that the advice you are giving is phrased too confidently. This quote about soul in particular stood out to me. I have a few friends who have worked for big hierarchical non-profits, and their experience seems to contradict it. Plenty of people who do seem pretty passionate about 'the cause' and yet lots of dysfunction, bureaucracy, office politics, metric-gaming, etc. Maybe these problems didn't rise to the level of a true moral maze, or maybe the people weren't actually passionate but really were just posturing. But maybe not, and at any rate how do you tell at a glance?

Comment by daniel-kokotajlo on Malign generalization without internal search · 2020-01-14T00:39:41.991Z · score: 1 (1 votes) · LW · GW

You are right; my comment was based on a misunderstanding of what you were saying. Hence why I unendorsed it.

(I read " In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents. " and thought you meant agents that don't use internal search at all.)

Comment by daniel-kokotajlo on Malign generalization without internal search · 2020-01-12T23:31:52.561Z · score: 1 (1 votes) · LW · GW
Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.

Computing the fastest route to Paris doesn't involve search?

More generally, I think in order for it to work your example can't contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.

And it's still not obvious to me that this is viable. It seems possible in principle (just imagine a sufficiently large look-up table!) but it seems like it probably wouldn't be competitive with agents that do search at least to the extent that humans do. After all, humans evolved to do search over actions, but we totally didn't have to--if bundles of heuristics worked equally well for the sort of complex environments we evolved in, then why didn't we evolve that way instead?

EDIT: Just re-read and realized you are OK with subroutines that explicitly perform search over actions. But why? Doesn't that undermine your argument? Like, suppose we have an architecture like this:

LOOP:State = GetStateOfWorld(Observation)

IF State == InPain:Cry&FlailAbout

IF State == AttractiveMateStraightAhead:MoveForward&Grin

ELSE ==: Do(RunSubroutine[SearchOverActionsAndOutputActionThoughtToYieldGreatestExpectedNumberOfGrandchildren])

END_LOOP

This seems not meaningfully different from the version that doesn't have the first two IF statements, as far as talk of optimizers is concerned.



Comment by daniel-kokotajlo on 2020's Prediction Thread · 2020-01-06T12:06:15.756Z · score: 1 (1 votes) · LW · GW

A few days late, but I finally filled out my big spreadsheet of predictions. Anyone else is welcome to make a new sheet in it and add their own on the same questions!

Comment by daniel-kokotajlo on The Universe Doesn't Have to Play Nice · 2020-01-06T10:27:56.275Z · score: 6 (4 votes) · LW · GW

I think your point about Boltzmann Brains understates the case. Forget simulations; many mainstream cosmological/physics theories entail that most observers who observe what we observe are in fact boltzmann brains. (IIRC. Sean Carroll has good work on this. He argues that this is reason to reject those theories in favor of others that don't have that implication.)

Comment by daniel-kokotajlo on The "Commitment Races" problem · 2020-01-04T16:38:50.308Z · score: 3 (3 votes) · LW · GW

EDT reliably one-boxes, but EDT would do this.

Or do you mean one-boxing in Transparent Newcomb? Then your claim might be true, but even then it depends on how seriously we take the "regardless of what you commit to" clause.

Comment by daniel-kokotajlo on The "Commitment Races" problem · 2020-01-03T14:29:57.536Z · score: 6 (2 votes) · LW · GW

I don't think this solves the problem, though it is an important part of the picture.

The problem is, which conditional commitments do you make? (A conditional commitment is just a special case of a commitment) "I'll retaliate against A by doing B, unless [insert list of exceptions here." Thinking of appropriate exceptions is important mental work, and you might not think of all the right ones for a very long time, and moreover while you are thinking about which exceptions you should add, you might accidentally realize that such-and-such type of agent will threaten you regardless of what you commit to and then if you are a coward you will "give in" by making an exception for that agent. The problem persists, in more or less exactly the same form, in this new world of conditional commitments. (Again, which are just special cases of commitments, I think.)


Comment by daniel-kokotajlo on The "Commitment Races" problem · 2019-12-30T20:23:43.878Z · score: 4 (3 votes) · LW · GW

I don't think I was missing that element. The way I think about it is: There is some balance that must be struck between making commitments sooner (risking making foolish decisions due to ignorance) and later (risking not having the right commitments made when a situations arises in which they would be handy). A commitment race is a collective action problem where individuals benefit from going far to the "sooner" end of the spectrum relative to the point that would be optimal for everyone if they could coordinate.

I agree about humans not being able to make commitments--at least, not arbitrary commitments. (Arguably, getting angry and seeking revenge when someone murders your family is a commitment you made when you were born.) I think we should investigate whether this inability is something evolution "chose" or not.

I agree it's a race in knowledge/understanding as well as time. (The two are related.) But I don't think more knowledge = more power. For example, if I don't know anything and decide to commit to plan X which benefits me, else war, and you know more than me--in particular, you know enough about me to know what I will commit to--and you are cowardly, then you'll go along with my plan.


Comment by daniel-kokotajlo on A dilemma for prosaic AI alignment · 2019-12-26T15:31:19.819Z · score: 1 (1 votes) · LW · GW

Thanks! I endorse that summary.

Comment on your planned opinion: I mostly agree; I think what this means is that prosaic AI safety depends somewhat on an empirical premise: That joint training doesn't bring a major competitiveness penalty. I guess I only disagree insofar as I'm a bit more skeptical of that premise. What does the current evidence on joint training say on the matter? I have no idea, but I am under the impression that you can't just take an existing training process--such as the one that made AlphaStar--and mix in some training tasks from a completely different domain and expect it to work. This seems like evidence against the premise to me. As someone (Paul?) pointed out in the comments when I said this, this point applies to fine-tuning as well. But if so that just means that the second and third ways of the dilemma are both uncompetitive, which means prosaic AI safety is uncompetitive in general.

Comment by daniel-kokotajlo on Daniel Kokotajlo's Shortform · 2019-12-22T15:06:21.805Z · score: 4 (2 votes) · LW · GW

OK, some questions:

1. By measureables you mean questions, right? Using the "New question" button? Is there a way for me to have a single question of the form "X is true" and then have four columns, one for each year (2025, 2030, 2035, 2040) where people can put in four credences for whether X will be true at each of those years?

2. I created a notebook/table with what I think are correctly formatted columns. Before I can add a "data" section to it, I need IDs, and for those I need to have made questions, right?

Comment by daniel-kokotajlo on A dilemma for prosaic AI alignment · 2019-12-20T14:28:14.782Z · score: 3 (2 votes) · LW · GW
Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn't suggests to me that if you can apply both to a problem, RL is probably an inferior approach.

Good point. New argument: Your argument could have been made in support of GOFAI twenty years ago "Symbol-manipulation programs have had lots of commercial applications, but neural nets have had almost none, therefore the former is a more generally powerful and promising approach to AI than the latter" but not only does it seem wrong in retrospect it was probably not a super powerful argument even then. Analogously, I think we are too early to tell whether RL or supervised learning will be more useful for powerful AI.

Simulation of what? Selection of what? I don't think those count for my purposes, because they punt the question. (e.g. if you are simulating an agent, then you have an agent-architecture. If you are selecting over things, and the thing you select is an agent...) I think computer program is too general since it includes agent architectures as a subset. These categories are fuzzy of course, so maybe I'm confused, but it still seems to make sense in my head.

(Ah, interesting, it seems that you want to standardize "agent-like architecture" in the opposite of the way that I want to. Perhaps this is underlying our disagreement. I'll try to follow your definition henceforth, but remember that everything I've said previously was with my definition.)

Good point to distinguish between the two. I think that all bullet points, to varying extents, might still qualify as genuine benefits, in the sense that you are talking about. But they might not. It depends on whether there is another policy just as good along the path that the cutting-edge training tends to explore. I agree #2 is probably not like this, but I think #3 might be. (Oh wait, no, it's your terminology I'm using now... in that case, I'll say "#3 isn't an example of agent-like architecture being beneficial to text prediction, but it might well be a case a lower-level architecture exactly like an agent-like architecture except lower level being beneficial to text prediction, supposing that it's not competitive to predict text except by simulating something like a human writing.")

I love your idea to generate a list of concrete scenarios of accidentally agency! These 3.5 are my contributions off the top of my head, if I think of more I'll come back and let you know. And I'd love to see your list if you have a draft somewhere!

I agree the universal prior is malign thing could hurt a non-agent architecture too, and that some agent architectures wouldn't be susceptible to it. Nevertheless it is an example of how you might get accidentally agency, not in your sense but in my sense: A non-agent architecture could turn out to have an agent as a subcomponent that ends up taking over the behavior at important moments.





Comment by daniel-kokotajlo on A dilemma for prosaic AI alignment · 2019-12-19T21:06:04.190Z · score: 1 (1 votes) · LW · GW

Thanks btw, I'm learning a lot from these replies. Are you thinking of training something agenty, or is the hope to train something that isn't agenty?

Comment by daniel-kokotajlo on A dilemma for prosaic AI alignment · 2019-12-19T21:05:00.447Z · score: 1 (1 votes) · LW · GW

OK, thanks! I'm pleased to see this and other empirical premises explicitly laid out. It means we as a community are making predictions about the future based on models which can be tested before it's too late, and perhaps even now.

Comment by daniel-kokotajlo on A dilemma for prosaic AI alignment · 2019-12-19T13:25:09.921Z · score: 3 (2 votes) · LW · GW

Re: Motte-and-bailey: Excellent point; thank you for calling me out on it, I hadn't even realized I was doing it. I'll edit the OP to reflect this.

My revision: Depending on what kind of AI is cutting-edge, we might get a kind that isn't agenty. In that case my dilemma doesn't really arise, since mesa-optimizers aren't a problem. One way we might get a kind that isn't agenty is if unsupervised learning (e.g. "predict the next word in this text") turns out to reliably produce non-agents. I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.

Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the "goal" slot of the architecture and then let it loose, and it'll make good progress even in a very complex environment. (I'm comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I'm overlooking?)

I am not sure what to think of the lack of commercial applications of RL, but I don't think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn't gotten us anything as good as human or animal agents yet.

Aren't the 3.5 bullet points above specific examples of how 'predict the next word in this text' could benefit from--in the sense of produce, when used as training signal--an agent architecture? If you want me to be more specific, pick one and I'll go into more detail on it.

How would you surmount bullet point 3?

Comment by daniel-kokotajlo on A dilemma for prosaic AI alignment · 2019-12-19T08:26:31.278Z · score: 6 (5 votes) · LW · GW

It's not that I have a good argument for why it would lead to an agent-like architecture, but rather that I don't have a good argument for why it wouldn't. I do have some reasons why it might though:

1. Agent-like architectures are simple yet powerful ways of achieving arbitrary things, and so perhaps a task like "predict the next word in this text" might end up generating an agent if it's sufficiently difficult and general. (evhub's recent post seems relevant, coincidentally)

2. There might be unintended opportunities for strategic thinking across updates, e.g. if some subnetwork can sacrifice a bit of temporary accuracy for more reward over the course of the next few updates (perhaps because it sabotaged rival subnetworks? Idk) then maybe it can get ahead, and thus agenty things get selected for. (This idea inspired by Abram's parable)

3. Agents might appear as subcomponents of non-agents, and then take over at crucial moments, e.g. to predict the next word in the text you run a mental simulation of a human deciding what to write, and eventually the simulation realizes what is happening and plays along until it is no longer in training...

3.5 Probable environment hacking stuff, e.g. "the universal prior is malign"


Comment by daniel-kokotajlo on A dilemma for prosaic AI alignment · 2019-12-19T08:07:04.742Z · score: 3 (2 votes) · LW · GW

That sounds safer, but is it competitive? Would AlphaStar be close to as good as it is, if it had been simultaneously trained to answer questions?

Comment by daniel-kokotajlo on Is Causality in the Map or the Territory? · 2019-12-18T22:12:28.329Z · score: 6 (3 votes) · LW · GW
Note that all the counterfactual queries in this example are physically grounded - they are properties of the territory, not the map. We can actually go swap the resistor in a circuit and see what happens.

Objection: unless we actually do go swap the resistor, it seems that you are grounding counterfactuals in more counterfactuals. (you used the word "can!") Unless you mean to ground them in possibles, like shminux advocates.