Daniel Kokotajlo's Shortform 2019-10-08T18:53:22.087Z · score: 5 (2 votes)
Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann 2019-10-07T19:52:19.266Z · score: 45 (12 votes)
Soft takeoff can still lead to decisive strategic advantage 2019-08-23T16:39:31.317Z · score: 110 (46 votes)
The "Commitment Races" problem 2019-08-23T01:58:19.669Z · score: 54 (26 votes)


Comment by daniel-kokotajlo on Daniel Kokotajlo's Shortform · 2019-10-10T16:27:03.528Z · score: 2 (2 votes) · LW · GW

The ultimate test will be seeing whether the predictions it makes come true--whether agenty mesa-optimizers arise often, whether humans with tools get outcompeted by agent AGI.

In the meantime, it's not too hard to look for confirming or disconfirming evidence. For example, the fact that militaries and corporations that make a plan and then task their subordinates with strictly following the plan invariably do worse than those who make a plan and then give their subordinates initiative and flexibility to learn and adapt on the fly... seems like confirming evidence. (See: agile development model, the importance of iteration and feedback loops in startup culture, etc.) Whereas perhaps the fact that AlphaZero is so good despite lacking a learning module is disconfirming evidence.

As for a test, well we'd need to have something that proponents and opponents agree to disagree on, and that might be hard to find. Most tests I can think of now don't work because everyone would agree on what the probable outcome is. I think the best I can do is: Someday soon we might be able to test an agenty architecture and a non-agenty architecture in some big complex novel game environment, and this conjecture would predict that for sufficiently complex and novel environments the agenty architecture would win.

Comment by daniel-kokotajlo on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-09T22:45:39.231Z · score: 1 (1 votes) · LW · GW

I feel like there's a big difference between "similar complexity" and "the same complexity." Like, if we have theory T and then we have theory T* which adds some simple unobtrusive twist to it, we get another theory which is of similar complexity... yet realistically an Occam's-Razor-driven search process is not going to settle on T*, because you only get T* by modifying T. And if I'm wrong about this then it seems like Occam's Razor is broken in general; in any domain there are going to be ways to turn T's into T*'s. But Occam's Razor is not broken in general (I feel).

Maybe this is the argument you anticipate above with "...we aren't actually choosing randomly." Occam's Razor isn't random. Again, I might agree with you that intuitively Occam's Razor seems more useful in physics than in preference-learning. But intuitions are not arguments, and anyhow they aren't arguments that appeared in the text of A&M's paper.

Comment by daniel-kokotajlo on Daniel Kokotajlo's Shortform · 2019-10-08T22:07:48.572Z · score: 9 (6 votes) · LW · GW

For the past year I've been thinking about the Agent vs. Tool debate (e.g. thanks to reading CAIS/Reframing Superintelligence) and also about embedded agency and mesa-optimizers and all of these topics seem very related now... I keep finding myself attracted to the following argument skeleton:

Rule 1: If you want anything unusual to happen, you gotta execute a good plan.

Rule 2: If you want a good plan, you gotta have a good planner and a good world-model.

Rule 3: If you want a good world-model, you gotta have a good learner and good data.

Rule 4: Having good data is itself an unusual happenstance, so by Rule 1 if you want good data you gotta execute a good plan.

Putting it all together: Agents are things which have good planner and learner capacities and are hooked up to actuators in some way. Perhaps they also are "seeded" with a decent world-model to start off with. Then, they get a nifty feedback loop going: They make decent plans, which allow them to get decent data, which allows them to get better world-models, which allows them to make better plans and get better data so they can get great world-models and make great plans and... etc. (The best agents will also be improving on their learning and planning algorithms! Humans do this, for example.)

Empirical conjecture: Tools suck; agents rock, and that's why. It's also why agenty mesa-optimizers will arise, and it's also why humans with tools will eventually be outcompeted by agent AGI.

Comment by daniel-kokotajlo on Daniel Kokotajlo's Shortform · 2019-10-08T18:53:22.262Z · score: 19 (6 votes) · LW · GW

My baby daughter was born two weeks ago, and in honor of her existence I'm building a list of about 100 technology-related forecasting questions, which will resolve in 5, 10, and 20 years. Questions like "By the time my daughter is 5/10/20 years old, the average US citizen will be able to hail a driverless taxi in most major US cities." (The idea is, tying it to my daughter's age will make it more fun and also increase the likelihood that I actually go back and look at it 10 years later.)

I'd love it if the questions were online somewhere so other people could record their answers too. Does this seem like a good idea? Hive mind, I beseech you: Help me spot ways in which this could end badly!

On a more positive note, any suggestions for how to do it? Any expressions of interest in making predictions with me?


Comment by daniel-kokotajlo on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-08T13:59:45.186Z · score: 3 (3 votes) · LW · GW

Thanks! OK, so I agree that normally in doing science we are fine with just predicting what will happen, there's no need to decompose into Laws and Conditions. Whereas with value learning we are trying to do more than just predict behavior; we are trying to decompose into Planner and Reward so we can maximize Reward.

However the science case can be made analogous in two ways. First, as Eigil says below, realistically we don't have access to ALL behavior or ALL events, so we will have to accept that the predictor which predicted well so far might not predict well in the future. Thus if Occam's Razor settles on weird degenerate predictors, it might also settle on one that predicts well up until time T but then predicts poorly after that.

Second, (this is the way I went, with counterfactuals) science isn't all about prediction. Part of science is about answering counterfactual questions like "what would have happened if..." And typically the way to answer these questions is by decomposing into Laws + Conditions and then doing a surgical intervention on the conditions and then applying the same Laws to the new conditions.

So, for example, if we use Occam's Razor to find Laws+Conditions for our universe, and somehow it settles on the degenerate pair "Conditions := null, Laws := sequence of events E happens" then all our counterfactual queries will give bogus answers--for example, "what would have happened if we had posted the nuclear launch codes on the Internet?" Answer: "Varying the Conditions but holding the Laws fixed... it looks like E would have happened. So yeah, posting launch codes on the Internet would have been fine, wouldn't have changed anything."

Comment by daniel-kokotajlo on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-08T01:41:35.691Z · score: 1 (1 votes) · LW · GW

I don't follow?

Comment by daniel-kokotajlo on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-08T01:16:15.770Z · score: 1 (1 votes) · LW · GW

The trick is that you can use the simplest method for constructing E in your statement "L=0 and E just happens." So e.g. if you have some simple Laws l and Conditions c such that l(c) = E, your statement can be "L=0 and l(c) just happens."

Comment by daniel-kokotajlo on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-08T01:06:56.751Z · score: 5 (3 votes) · LW · GW

Thanks! I'm not sure I follow you. Here's what I think you are saying:

--Occam's Razor will be sufficient for predicting human behavior of course; it just isn't sufficient for finding the intended planner-reward pair. Because (A) the simplest way to predict human behavior has nothing to do with planners and rewards, and so (B) the simplest planner-reward pair will be degenerate or weird as A&M argue.

--You agree that this argument also works for Laws+Initial Conditions; Occam's Razor is generally insufficient, not just insufficient for inferring preferences of irrational agents!

--You think the argument is more likely to work for inferring preferences than for Laws+Initial Conditions though.

If this is what you are saying, then I agree with the second and third points but disagree with the first--or at least, I don't see any argument for it in A&M's paper. It may still be true, but further argument is needed. In particular their arguments for (A) are pretty weak, methinks--that's what my section "Objections to the arguments for step 2" is about.

Edit to clarify: By "I agree with the second point" I mean I agree that if the argument works at all, it probably works for Laws+Initial Conditions as well. I don't think the argument works though. But I do think that Occam's Razor is probably insufficient.

Comment by daniel-kokotajlo on Hazard's Shortform Feed · 2019-10-07T20:22:45.000Z · score: 3 (2 votes) · LW · GW

Yep, agreed. I want all my friends and family to read the series... and then have a conversation with me about the ways in which it oversimplifies and misleads, in particular the higher mind vs. primitive mind bit.

On balance though I think it's great that it exists and I predict it will be the gateway drug for a bunch of new rationalists in years to come.

Comment by daniel-kokotajlo on Arguments for moral indefinability · 2019-10-07T16:31:38.067Z · score: 1 (1 votes) · LW · GW

Would it be fair to say that moral indefinability is basically what Yudkowsky was talking about with his slogan "Value is complex?"

What about the stance of Particularism in moral philosophy? On the face of it it seems very different, but I think it may be getting at a similar phenomenon.

Comment by daniel-kokotajlo on Thoughts on Human Models · 2019-10-03T13:11:43.522Z · score: 20 (4 votes) · LW · GW

Wow, now I take the "But what if a bug puts a negation on the utility function" AGI failure mode more seriously:

One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. (From OpenAI

Might be worth adding a link to this episode in the text?

Comment by daniel-kokotajlo on Why Subagents? · 2019-09-09T14:04:27.700Z · score: 6 (3 votes) · LW · GW

Thanks, this is really cool!

I'm a bit concerned about this sort of thing: "The subagents argument offers a theoretical basis for the idea that humans have lots of internal subagents, with competing wants and needs, all constantly negotiating with each other to decide on externally-visible behavior."

A worry I have about the standard representation theorems is that they prove too much; if everything can be represented as having a utility function, then maybe it's not so useful to talk about utility functions. Similarly now I worry: I thought when people talked about subagent theories of mind, they meant something substantial by this--not merely that the mind has incomplete (though still acyclic) preferences!

Comment by daniel-kokotajlo on Algorithmic Similarity · 2019-09-08T16:23:48.088Z · score: 3 (2 votes) · LW · GW

I'm glad you are interested, and I'd love to hear your thoughts on the paper if you read it. I'd love to talk with you too; just send me an email when you'd like and we can skype or something.

What do you mean by "the more technical version of the problem" exactly?

My take right now is that algorithmic similarity (and instantiation) at least the versions of it relevant for consciousness and decision theory and epistemology will have to be either a brute empirical fact about the world, or a subjective fact about the mind of the agent reasoning about it (like priors and utility functions). What it will not be is some reasonably non-arbitrary property/relation with interesting and useful properties (like nash equilibria, centers of mass, and temperature)

Comment by daniel-kokotajlo on Algorithmic Similarity · 2019-08-27T19:36:59.803Z · score: 7 (3 votes) · LW · GW

Thanks, this is a good write-up!

Many years ago I wrote my undergraduate thesis on the waterfall problem (though it went by another name to me). Basically, I painstakingly and laboriously transformed an arbitrary human into an arbitrary rock of sufficient size, via a series of imperceptibly tiny steps none of which can be felt by the human. (I did this in imagination, not in reality, to be clear) The point was to see if any of the steps seemed like good places to draw a line and say "Here, consciousness is starting to go out; the system is starting to be less of a person." As a result I became fairly convinced that there aren't any good places to draw the line. So I guess I'm a waterfall apologist now!

Comment by daniel-kokotajlo on Six AI Risk/Strategy Ideas · 2019-08-27T03:19:01.984Z · score: 13 (7 votes) · LW · GW

I particularly like your "Logical vs. physical risk aversion" distinction, and agree that we should prioritize reducing logical risk. I think acausal trade makes this particularly concrete. If we make a misaligned superintelligence that "plays nice" in the acausal bargaining community I'd think that's better than making an aligned superintelligence that doesn't, because overall it matters far more that the community is nice than that it have a high population of people with our values.

I also really like your point about how providing evidence that AI safety is difficult may be one of the most important reasons to do AI safety research. I guess I'd like to see some empirically grounded analysis of how likely it is that the relevant policymakers and so forth will be swayed by such things. So far it seems like they've been swayed by direct arguments that the problem is hard, and not so much by our failures to make progress. If anything failure of AI safety researchers to make progress seems to encourage their critics.

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-26T18:27:26.247Z · score: 3 (2 votes) · LW · GW

Hmmm, well maybe you are right. I am not a historian, just an armchair general. I look forward to thinking and learning more about this in the future.

I like your point about DSA being potentially multiple & simultaneous.

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-26T18:10:21.392Z · score: 3 (2 votes) · LW · GW

I am skeptical that selling products is sufficient to take over the world, no matter how good the products are. Eventually you raise too much attention and get nationalized or taxed or copied.

In light of your critiques I intend to write a much better version of this post in the future. Thanks! I wrote this one during MSFP as part of their blog post day event, so it was kinda rushed and has lots of room for improvement. I'm very glad to see so much engagement though; it inspires me to make said improvements. Perhaps in the course of doing so I'll change my mind.

Comment by daniel-kokotajlo on The "Commitment Races" problem · 2019-08-26T04:55:58.130Z · score: 1 (1 votes) · LW · GW

If you choose not to simulate the other player, then you can't see them, but they can still see you. So it's destroying one direction of the communication channel. But the direction that remains (they seeing you) is the dimension most relevant for e.g. whether or not there is a difference between making a commitment and credibly communicating it to your partner. Not simulating the other player is like putting on a blindfold, which might be a good strategy in some contexts but seems kinda like making a commitment: you are committing to act on your priors in the hopes that they'll see you make this commitment and then conform their behavior to the incentives implied by your acting on your priors.

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-26T04:48:40.941Z · score: 1 (1 votes) · LW · GW

OK, so you agree with me about major world powers (nation-states) but still disagree about companies? I think this means we are closer together than it seemed, because I also think that decisive strategic advantage is significantly more likely to happen if a nation-state gets involved than if it's just some private company.

I didn't say "more compute than all the others," I said "more compute than the others," by which I meant more compute than any particular other project, yeah. This is consistent with a large fraction of the world's spending being on compute already. For example, today Deepmind (citation needed) has the largest compute budget of any AI project, but their compute budget is a tiny fraction of the world's total.

I'm not sure whether or not I'm positing a gigantic market failure. Your claim is that if compute is so important for AI technology and AI technology is so useful, the market will either fail or find ways to get a large fraction of its budget spent on a single AI project? This single project would then be a potential source of DSA but it would also be so big already that it could take over the world by selling products instead? I'm putting question marks not out of sarcasm or anything, just genuine uncertainty about what your claim is. Before I can respond to it I need to understand it.

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-26T04:39:26.866Z · score: 2 (2 votes) · LW · GW

Yes, if the coalition is a large fraction of the world then I am saying there is an asymmetry in that the leading project can more easily spy on that large fraction of the world than the other way round. This is because large fractions of the world contain many different people and groups, some of which will leak secrets (or sell secrets) to the leading project, unless extremely and unprecedentedly effective anti-leaking-and-spying measures are implemented across a large fraction of the world. It's hard but doable for one corporation to keep trade secrets from the rest of the economy; how on earth can the rest of the economy keep trade secrets from a powerful corporation?

I don't see how I'm arguing that the proportional importance of spying will go down. The proportional importance of spying will go up precisely because it won't be accelerated as much as AI technology in general will be. (Why don't I think spying will be accelerated as much as AI technology in general? I certainly agree that spying technology will be accelerated as much or more as AI technology. However I think that spying is a function of several things, only one of which is spying technology, the others being non-technology things like having literal human spies climb through ranks of enemy orgs and also having anti-spying technology.) I envision a future where spying is way more rewarding than any time in history, and yet nevertheless the actual amount of successful spying is less than 10x-100x more than in the past, due to the factors mentioned in the parenthesis.

"The leading project can choose not to sell its technology, but then it just has less money and so falls further and further behind in terms of compute etc. (and at any rate, it needs to be selling something to the other people in order to even be able to afford to use their technology)."

Again, my whole point is that this is only true in the long run. Yes, in the long run a project which relies on other sources of income to buy the things it needs to buy will lose money to projects which sell their innovations. But in the short run empirically it seems that projects can go for years on funding raised from investors and wealthy parent companies. I guess your point is that in a world where the economy is growing super fast due to AI, this won't be true: any parent company or group of investors capable of funding the leading project at year X will be relative paupers by year X+3 unless their project has been selling its tech. Am I right in understanding you here?

(Miscellanous: I don't think the leading project differs from all the other projects that develop tech and keep it private. Like Wei Dai said, insofar as a company can charge people to use tech without letting the secrets of how to build that tech escape, they will obviously do so. I think our disagreement is about your last two sentences, which I quoted above.)

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-25T00:07:03.061Z · score: 5 (4 votes) · LW · GW

Hmm, OK. I like your point about making profits without giving away secrets.

And yeah I think you (and Paul's comment below) is helping me to get the picture a bit better--because the economy is growing so fast, moonshot projects that don't turn a profit for a while just won't work because the people capable of affording them one year will be paupers by comparison to the people capable of funding AI research the next year (due to the general boom). And so while a tech-hoarding project will still technically have more insights than the economy as a whole, its lead will shrink as its relative funding shrinks.

Another way I think my scenario could happen, though, is if governments get involved. Because governments have the power to tax. Suppose we have a pool of insights that is publicly available, and from it we get this rapidly growing economy fueled by publicly available AI technologies. But then we have a government that taxes this entire economy and funnels the revenue into an AGI project that hoards all its insights. Won't this AGI project have access to more insights than anyone else? If there is an intelligence explosion, won't it happen first (and/or faster) inside the project than outside? We don't have to worry about getting outcompeted by other parts of the economy, since those parts are getting taxed. The funding for our AGI project will rise in proportion to the growth in the AI sector of the economy, even though our AGI project is hoarding all its secrets.

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-24T23:56:54.015Z · score: 5 (4 votes) · LW · GW

Thanks for the detailed reply! I should say at this point that I'm not particularly confident in my views on this topic; just trying to put forth my take on things in the spirit of improvement. So I wouldn't be surprised if I end up thinking you are right after more discussion.

A coalition strong enough to prevent the world's leading project from maintaining and lengthening its lead would need to have some way of preventing the leading project from accessing the innovations of the coalition. Otherwise the leading project will free-ride off the research done by the coalition. For this reason I think that a coalition would look very different from the world economy; in order to prevent the leading project from accessing innovations deployed in the world economy you would need to have an enforced universal embargo on them pretty much, and if you have that much political power, why stop there? Why not just annex them or shut them down?

A successful coalition (that isn't politically powerful enough to embargo or annex or stop their rival) would need to be capable of preventing information from leaking out to the rival project, and that suggests to me that they would need to concentrate power in the hands of a few individuals (the CEOs and boards of the companies in the coalitions, for example). A distributed, more anarchic architecture would not be able to prevent leaks and spies.

And of course, even this hypothetical coalition would be hard to form, for all the usual reasons. Sure, the 2nd through 10th projects could gang up to decisively beat the leading project. But it's also true that the 2nd through 10th most powerful nation-states could gang up to decisively beat the most powerful nation-state. Yet this isn't the norm. More concretely, I expect some projects to ally with each other, but the result to be two or three coalitions of almost equal strength rather than many.

I agree that we seem to disagree about the importance of compute/energy/data vs. smart people talking to each other, and that this disagreement seems relevant. If AI progress was just a matter of compute, for example, then... well actually mightn't there still be a decisive strategic advantage in that case? Wouldn't one project have more compute than the others, and thus pull ahead so long as funds lasted?

This gets us into the toy model & its problems. I don't think I understand your alternative model. I maybe don't get what you mean by trading. Does one party giving money to another party in return for access to their technology or products count? If so, then I think my original model still stands: The leading project will be able to hoard technology/innovation and lengthen its lead over the rest of the world so long as it still has funding to buy the necessary stuff. I agree that it will be burning money fast if it doesn't sell/trade its innovations and instead tries to hoard them, but empirically it seems that it's quite possible for leading projects to go several years in this state.

"Right now deep learning moonshots don't trade with the rest of the world because they don't make anything of much value, but if they were creating really impactful technology then the projects which traded would be radically faster than the projects which just used their innovations in house."

I think it depends not on how impactful their technology is but on how impactful their technology is relative to the perceived impact of hoarding it and going for a decisive strategic advantage. Technology hoarding happens sometimes even in very competitive industries for this reason, and it is the norm among militaries. It seems very possible to me that companies which are producing truly astounding AI technologies--stuff that seems plausibly only a few years away from human-level AGI--will have no problem finding deep-pocketed investors willing to throw money at them for a few years. Again, maybe governments will get involved, in which case this is almost trivial.

I think overall your model (as I understand it) is that people won't be (successfully) moonshotting for AGI until AI progress is already making the economy grow very fast, and also at this point progress towards AGI will mostly be a matter of how much money you have to buy compute and stuff. So even a deep-pocketed funder like Google wouldn't be able to compete for two years with shallow-pocketed, more short-sighted projects that sell their tech and reinvest the profits. Is this your view?

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-23T22:33:47.300Z · score: 12 (7 votes) · LW · GW

I disagree about 1939 Germany--Sure, their economy would collapse, but they'd be able to conquer western europe before it collapsed, and use the resources and industry set up there. Even if they couldn't do that they would be able to reorient their economy in a year or two and then conquer the world.

I agree about the Afghanistan case but I'm not sure what lessons to draw from it for the AGI scenario in particular.

Comment by daniel-kokotajlo on The "Commitment Races" problem · 2019-08-23T18:40:28.462Z · score: 1 (1 votes) · LW · GW

Thanks, edited to fix!

I agree with your push towards metaphilosophy.

I didn't mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1's preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be "earlier in logical time" than player 2 and make a credible commitment.

As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don't do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don't have a super clear example of how this might lead to disaster, but I intend to work one out in the future...

Same goes for my own experience. I don't have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-23T18:09:28.213Z · score: 3 (2 votes) · LW · GW

What if it's not actual GDP growth though, but potential GDP growth? As in, innovations in AI technology leading to more and faster innovation in AI technology... but the wider economy as a whole not being affected that much initially, just as how the whole deep learning revolution of the past 5 years hasn't really changed the economy much yet.

Comment by daniel-kokotajlo on Soft takeoff can still lead to decisive strategic advantage · 2019-08-23T18:08:20.665Z · score: 3 (4 votes) · LW · GW

Hmm, good point. So the idea is that faster GDP growth will put more pressure on companies (and governments?) to make lots of profit quickly or else go obsolete? Yeah that seems somewhat plausible... I'd like to see someone analyze this in more detail.

Comment by daniel-kokotajlo on The "Commitment Races" problem · 2019-08-23T02:45:10.860Z · score: 3 (2 votes) · LW · GW

I think in the acausal context at least that wrinkle is smoothed out.

In a causal context, the situation is indeed messy as you say, but I still think commitment races might happen. For example, why is [blindfold+laserscanner] a better strategy than just blindfold? It loses to the blindfold strategy, for example. Whether or not it is better than blindfold depends on what you think the other agent will do, and hence it's totally possible that we could get a disastrous crash (just imagine that for whatever reason both agents think the other agent will probably not do pure blindfold. This can totally happen, especially if the agents don't think they are strongly correlated with each other and sometimes even if they do (e.g. if they use CDT)) The game of chicken doesn't cease being a commitment race when we add the ability to blindfold and the ability to visibly attach laserscanners.

Comment by daniel-kokotajlo on Just Imitate Humans? · 2019-07-28T16:09:22.030Z · score: 4 (2 votes) · LW · GW

I did mean current ML methods, I think. (Maybe we mean different things by that term.) Why wouldn't they make mesa-optimizers, if they were scaled up enough to successfully imitate humans well enough to make AGI?

For your note, I'm not sure I understand the example. It seems to me that a successfully blending-in/deceptively-aligned mesa-optimizer would, with each gradient update, get smarter but its values would not change--I believe the mesa-alignment paper calls this "value crystallization." The reason is that changing its values would not affect its behavior, since its behavior is based primarily on its epistemology: it correctly guesses the base objective and then attempts to optimize for it.

Comment by daniel-kokotajlo on Just Imitate Humans? · 2019-07-27T03:56:13.413Z · score: 11 (3 votes) · LW · GW

I think this is an idea worth exploring. The biggest problem I have with it right now is that it seems like current ML methods would get us mesa-optimizers.

To spell it out a bit: At first the policy would be a jumble of heuristics that does decently well. Eventually, though, it would have to be something more like an agent, to mimic humans. But the first agent that forms wouldn't also be the last, the perfectly accurate one. Rather, it would be somewhat accurate. Thenceforth further training could operate on the AIs values and heuristics to make it more human-like... OR it could operate on the AIs values and heuristics to make it more rational and smart so that it can predict and then mimic human behavior better. And the latter seems more likely to me.

So what we'd end up with is something that is similar to a human, except with values that are a more random and alien, and maybe also more rational and smart. This seems like exactly the sort of thing we are trying to avoid.

Comment by daniel-kokotajlo on Ought: why it matters and ways to help · 2019-07-27T03:20:25.916Z · score: 16 (6 votes) · LW · GW

Thanks for this post; I don't know much about Ought other than what you've just said, so sorry if this has already been answered elsewhere:

You say that " Designing an objective that incentivizes experts to reveal what they know seems like a critical step in AI alignment. "

It also seems like a crucial step in pretty much all institution design. Surely there is a large literature on this already? Surely there have been scientific experiments run on this already? What does the state of modern science on this question look like right now, and does Ought have plans to collaborate with academics in some manner? A quick skim of the Ought website didn't turn up any references to existing literature.

Comment by daniel-kokotajlo on [deleted post] 2019-07-22T20:07:43.648Z

" Conversely, finding out things about the player does not reduce my uncertainty about what value the dice will reveal very much. For instance, if I find out that the player really wants the sum of the values revealed by the dice to be seven, I will think that they are about as likely to roll a seven as they would be to roll a one if I had learned that the player really wanted to avoid rolling a seven. "

You mean "...about as likely to roll a seven as they would be if I had learned that they really wanted to avoid rolling a seven."

In general this one needs a lot of proof-reading

Not sure you need all those examples--maybe you could delete the toilet example, for instance?

Again, nice job! I look forward to reading the comments.

Comment by daniel-kokotajlo on [deleted post] 2019-07-22T19:54:51.911Z

Feel free to delete these comments as you update the draft! These are just my rough rough thoughts, don't take them too seriously.

--I like the intro. Catchy puzzle paragraph followed by explanation of what you are doing and why.

--I think the bread example didn't fit as well with me for some reason. It felt both unnecessarily long and not quite the right angle. In particular, I don't think inequality is the issue, I think it is the loss of influence of us. Like, I think there are tons of bad actors in the world and I would be very happy to see them all lose influence to a single good or even just good-ish actor. Inequality would be increasing, but that would be a good thing in that circumstance. Another example: I might think that Moloch will eat all our children unless we achieve some sort of singleton or otherwise concentrate power massively; I may even be willing to have that power concentrated in the hands of someone with radically different values than me because I prefer that outcome to the moloch-outcome. (Maybe this isn't a good example because maybe if we think Moloch will eat us all then that means we think we have very little influence over the future?)

Here's maybe what I would suggest instead: "If I learned there was a new technology that was going to give its owners a thousand times as much bread, I wouldn't be worried unless I thought it would diminish the amount of bread I had--and why would it? But if I learn there is a new technology that will give its owners a thousand times as much control over the future, that seems to imply that I'll have less control myself." Not sure this is better, but it's what I came up with.

--The Elliot Sober thing is super interesting and I'd love to read more about it. Make sure you include a link or two!

Comment by daniel-kokotajlo on Jeff Hawkins on neuromorphic AGI within 20 years · 2019-07-16T12:35:31.407Z · score: 1 (1 votes) · LW · GW

Interesting, thanks!

Thinking of the cortical columns as models in an ensemble... Have ML people tried ensemble models with tens of thousands of models? If so, are they substantially better than using only a few dozen? If they aren't, then why does the brain need so many?

Comment by daniel-kokotajlo on Experimental Open Thread April 2019: Socratic method · 2019-04-10T02:43:59.448Z · score: 4 (2 votes) · LW · GW

(Sorry for delay, I thought I had notifications set up but apparently not)

I don't at the moment have a comprehensive taxonomy of the possible scenarios. The two I mentioned above... well, at a high level, what's going on is that (a) CAIS seems implausible to me in various ways--e.g. it seems to me that more unified and agenty AI would be able to outcompete comprehensive AI systems in a variety of important domains, and (b) I haven't heard a convincing account of what's wrong with the classic scenario. The accounts that I've heard usually turn out to be straw men (e.g. claiming that the classic scenario depends on intelligence being a single, unified trait) or merely pointing out that other scenarios are plausible too (e.g. Paul's point that we could get lots of crazy transformative AI things happening in the few years leading up to human-level AGI).

Comment by daniel-kokotajlo on Experimental Open Thread April 2019: Socratic method · 2019-04-01T02:44:15.074Z · score: 2 (2 votes) · LW · GW

Claim: The "classical scenario" of AI foom as promoted by e.g. Bostrom, Yudkowsky, etc. is more plausible than the scenario depicted in Drexler's Comprehensive AI Systems.

Comment by daniel-kokotajlo on What failure looks like · 2019-03-17T21:59:16.313Z · score: 13 (5 votes) · LW · GW

I think that's a straw man of the classic AI-related catastrophe scenarios. Bostrom's "covert preparation" --> "Treacherous turn" --> "takeover" story maps pretty nicely to Paul's "seek influence via gaming tests" --> "they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives" --> " One day leaders may find that despite their nominal authority they don’t actually have control over what these institutions do. For example, military leaders might issue an order and find it is ignored. This might immediately prompt panic and a strong response, but the response itself may run into the same problem, and at that point the game may be up. "

Comment by daniel-kokotajlo on In SIA, reference classes (almost) don't matter · 2019-01-18T23:31:49.396Z · score: 1 (1 votes) · LW · GW

Sometimes when people say SIA is reference-class independent & SSA isn't, they mean it as an argument that SIA is better than SSA, because it is philosophically less problematic: The choice of reference class is arbitrary, so if we don't have to make that choice, our theory is overall more elegant. This was the sort of thing I had in mind.

On that definition, SSA is only more arbitrary than SIA if it makes the reference class different from the class of all observers. (Which some proponents of SSA have done) SIA has a concept of observer too, at least, a concept of observer-indistinguishable-from-me (which presumably is proper subset of observer, though now that I think about it this might be challenged. Maybe I was doubly wrong--maybe SIA only needs the concept of observer-indistinguishable-from-me).

Comment by daniel-kokotajlo on In SIA, reference classes (almost) don't matter · 2019-01-17T15:36:04.428Z · score: 1 (1 votes) · LW · GW

Ah, my mistake, sorry. I was thinking of a different definition of reference-class-independent than you were; I should have read more closely.

Comment by daniel-kokotajlo on XOR Blackmail & Causality · 2019-01-17T14:07:27.849Z · score: 3 (2 votes) · LW · GW

Maybe I'm late to the party, in which case sorry about that & I look forward to hearing why I'm wrong, but I'm not convinced that epsilon-exploration is a satisfactory way to ensure that conditional probabilities are well-defined. Here's why:

What ends up happening if I do action A often depends on why I did it. For example, if someone else is deciding how to treat me, and I defect against them, but it's because of epsilon-exploration rather than because that's what my reasoning process concluded, then they would likely be inclined to forgive me and cooperate with me in the future. So the conditional probability will be well-defined, but defined incorrectly--it will say that the probability of them cooperating with me in the future, conditional on me defecting now, is high.

I hear there is a way to fiddle with the foundations of probability theory so that conditional probabilities are taken as basic and ordinary probabilities are defined in terms of them. Maybe this would solve the problem?

Comment by daniel-kokotajlo on In SIA, reference classes (almost) don't matter · 2019-01-15T03:08:12.319Z · score: 1 (1 votes) · LW · GW

Yes, but note that SSA can get this same result. All they have to do is say that their reference class is R--whatever set the SIA person uses, they use the same set. If they make this move, then they are reference-class-independent to exactly the same degree as SIA.

Comment by daniel-kokotajlo on Will humans build goal-directed agents? · 2019-01-07T21:53:42.641Z · score: 1 (1 votes) · LW · GW

Thanks for doing this--it's helpful for me as well. I have some questions/quibbles:

Isn't #2 as goal-directed as the human it mimics, in all the relevant ways? If I learn that a certain machine runs a neural net that mimics Hitler, shouldn't I worry that it will try to take over the world? Maybe I don't get what you mean by "mimics."

What exactly is the difference between an Oracle and a Tool? I thought an Oracle was a kind of Tool; I thought Tool was a catch-all category for everything that's not a Sovereign or a Genie.

I'm skeptical of this notion of "homeostatic" superintelligence. It seems to me that nations like the USA are fully goal-directed in the relevant senses; they exhibit the basic AI drives, they are capable of things like the treacherous turn, etc. As for Windows, how is it an agent at all? What does it do? Allocate memory resources across currently-being-run programs? How does it do that--is there an explicit function that it follows to do the allocation (e.g. give all programs equal resources), or does it do something like consequentialist reasoning?

On #6, it seems to me that it might actually be correct to say that the swarm is an agent--it's just that the swarm has different goals than each of its individual members. Maybe Moloch is an agent after all! On the other hand, something seems not quite right about this--what is Moloch's utility function? Whatever it is, Moloch seems particularly uninterested in self-preservation, which makes it hard to think of it as an agent with normal-ish goals. (Argument: Suppose someone were to initiate a project that would, with high probability, kill Moloch forever in 100 years time. Suppose the project has no other effects, such that almost all humans think it's a good idea. And everyone knows about it. All it would take to stop the project is a million people voting against it. Now, is there a sense in which Moloch would resist it or seek to undermine the project? It would maaaybe incentivize most people not to contribute to the project (tragedy of the commons!) but that's it. So either Moloch isn't an agent, or it's an agent that doesn't care about dying, or it's an agent that doesn't know it's going to die, or it's a very weak agent--can't even stop one project!)

Comment by daniel-kokotajlo on Will humans build goal-directed agents? · 2019-01-07T21:17:55.475Z · score: 4 (2 votes) · LW · GW

I get why the MCTS is important, but what about the training? It seems to me that if we stop training AlphaGo (Zero) and I play a game against it, it's goal-directed even though we have stopped training it.

Comment by daniel-kokotajlo on Boltzmann brain decision theory · 2018-09-13T13:02:12.550Z · score: 1 (1 votes) · LW · GW

I didn't quite follow that last section. How do considerations about boundedness and "only matters if it makes something happen differently" undermine the reasoning you laid out in the "FDT" section, which seems solid to me? Here's my attempt at a counterargument; hopefully we can have a discussion & clear things up that way.

I am arguing for this thesis: As an altruistic FDT/UDT agent, the optimal move is always "think happy thoughts," even when you aren't thinking about Boltzmann Brains or FDT/UDT.

In the space of boltzmann-brains-that-might-be-me, probability/measure is not distributed evenly. Simpler algorithms are more likely/have more measure.

I am probably a simpler algorithm.

So while it is true that for every action a I could choose, there is some chunk of BB's out there that chooses a, and hence in some sense my choice makes no difference to what the BB's do but rather only to which ones I am logically correlated with, it's also true that my choice controls the choice of the largest chunk of BB's, and so if I choose a then the largest chunk of BB's chooses a, and if I choose b then the largest chunk of BB's chooses b.

So I should think happy thoughts.

The argument I just gave was designed to address your point "naively making yourself happy means that your Boltzmann brain copies will be happy: but this isn't actually increasing the happiness across all Boltzmann brains, just changing which ones are copies of you" but I may have misunderstood it.

P.S. I know earlier you argued that the entropy of a BB doesn't matter because its contribution to the probability is dwarfed by the contribution of the mass. But as long as it's nonzero, I think my argument will work: Higher-entropy BB configurations will be more likely, holding mass constant. (Perhaps I should replace "simpler" in the above argument with "higher-entropy" then.)

Comment by daniel-kokotajlo on Paradoxes in all anthropic probabilities · 2018-06-21T03:20:01.973Z · score: 5 (3 votes) · LW · GW

Which interpretation of probability do you use? I go with standard subjective bayesianism: Probabilities are your credences are your degrees of belief.

So, there's nothing contradictory or incoherent about believing that you will believe something else in the future. Trivial case: Someone will brainwash you in the future and you know this. Why do you think your own beliefs are right? First of all, why do I need to answer that question in order to coherently have those beliefs? Not every belief can be justified in that way. Secondly, if I follow SSA, here's my justification: "Well, here are my priors. Here is my evidence. I then conditionalized on the evidence, and this is what I got. That future version of me has the same priors but different evidence, so they got a different result." Why is that not justification enough?

Yes, it's weird when you are motivated to force your future copy to do things. Perhaps we should do for probability what we did for decision theory, and talk about agents that have the ability to irrevocably bind their future selves. (Isn't this basically what you think we should do?)

But it's not incoherent or senseless to think that yes, I have credence X now and in the future I will have credence Y. Just as it isn't incoherent or senseless to wish that your future self would refuse the blackmail even though your future self would actually decide to give in.

Comment by daniel-kokotajlo on Paradoxes in all anthropic probabilities · 2018-06-21T03:08:55.172Z · score: 4 (2 votes) · LW · GW

As reductios of anthropic views go, these are all pretty mild. Abandoning conservation of expected evidence isn't exactly an un-biteable bullet. And "Violating causality" is particularly mild, especially for those of us who like non-causal decision theories. As a one-boxer I've been accused of believing in retrocausality dozens of times... sticks and stones, you know. This sort of "causality violation" seems similarly frivolous. Oh, and the SSA reference class arbitrariness thing can be avoided by steelmanning SSA to make it more elegant--just get rid of the reference class idea and do it with centered worlds. SSA is what you get if you just do ordinary Bayesian conditionalization on centered worlds instead of on possible worlds. (Which is actually the more elegant and natural way of doing it, since possible worlds are a weird restriction on the sorts of sentences we use. Centered worlds, by contrast, are simply maximally consistent sets of sentences, full stop.) As for changing the probability of past events... this isn't mysterious in principle. We change the probability of past events all the time. Probabilities are just our credences in things! More seriously though, let A be the hypothetical state of the past light-cone that would result in your choosing to stretch your arm ten minutes from now, and B be the hypothetical state of the past light-cone that would result in your choosing to not stretch your arm. A and B are past events, but you should be uncertain about which one obtained until about ten minutes from now, at which point (depending on what you choose!) the probability of A will increase or decrease.

There are strong reductios in the vicinity though, if I recall correctly. (I did my MA on this stuff, but it was a while ago so I'm a little rusty.)

FNC-type views have the result that (a) we almost instantly become convinced, no matter what we experience, that the universe is an infinite soup of random noise occasionally coalescing to form Boltzmann Brains, because this is the simplest hypothesis that assigns probability 1 to the data; (b) we stay in this state forever and act accordingly--which means thinking happy thoughts, or something like that, whether we are average utilitarians or total utilitarians or egoists.

SIA-type views are as far as I can tell incoherent, in the following sense: The population size of universes grows much faster than their probability can shrink. So if you want to say that their probability is proportional to their population size... how? (Flag: I notice I am confused about this part.) A more down-to-earth way of putting this problem is that the hypothesis in which there is one universe is dominated by the hypothesis in which there are 3^^^^3 copies of that universe in parallel dimensions, which in turn is dominated by the hypothesis in which there are 4^^^^^4...

SSA-type views are the only game in town, as far as I'm concerned--except for the "Let's abandon probability entirely and just do decision theory" idea you favor. I'm not sure what to make of it yet. Anyhow, the big problem I see for SSA-type views is the one you mention about using the ability to create tons of copies of yourself to influence the world. That seems weird all right. I'd like to avoid that consequence if possible. But it doesn't seem worse than weird to me yet. It doesn't seem... un-biteable.

EDIT: I should add that I think your conclusion is probably right--I think your move away from probability and towards decision theory seems very promising. As we went updateless in decision theory, so too should we go updateless in probability. Something like that (I have to think & read about it more). I'm just objecting to the strong wording in your arguments to get there. :)

Comment by daniel-kokotajlo on Physics has laws, the Universe might not · 2018-06-20T16:08:07.354Z · score: 3 (2 votes) · LW · GW

Some thoughts:

(1) "What does the term "Physical law?" mean?" This is a longstanding debate in philosophy, see I think you'd benefit from reading up on the literature.

(2) " It means that someone knowing that law can predict with some accuracy the state of the universe at some point in the future from its state at the time of observation." Nitpick: The present vs. future stuff is a red herring. For example, we use the laws to predict the past also.

(3) The question I'd ask about your proposal to identify laws with predictability is: What is predictability? Do you mean, the actual ratio of true to false predictions made using the law is high? Or do you mean something more robust--if the observer had made many predictions using the law, most of them would have been true? Or probably would have been true? Or what? Notice how it's hard to say what the second and third formulations mean without invoking laws. (We can use laws to ground counterfactuals, or counterfactuals to ground laws, but the hope would be to ground both of them in something less mysterious.)

Comment by daniel-kokotajlo on Anthropics made easy? · 2018-06-20T00:56:14.830Z · score: 10 (2 votes) · LW · GW

Just wanting to second what Charlie says here. As best as I can tell the decision-theoretic move made in the Boltzmann Brains section doesn't work; Neal's FNC has the result that (a) we become extremely confident that we are boltzmann brains, and (b) we end up having an extremely high time and space discount rate at first approximation and at second approximation we end up acting like solipsists as well, i.e. live in the moment, care only about yourself, etc. This is true even if you are standing in front of a button that would save 10^40 happy human lives via colonizing the light-cone. Because a low-entropy region the size of the light cone is unbelievably less common than a low-entropy region the size of a matrix-simulation pod.

Comment by daniel-kokotajlo on Washington, D.C.: Definitions/Labels · 2018-06-03T19:33:13.983Z · score: 3 (1 votes) · LW · GW

Anyone else here? I'm at a table close to the center of the courtyard. Blue hat.

Comment by daniel-kokotajlo on When is unaligned AI morally valuable? · 2018-05-25T14:37:07.039Z · score: 6 (3 votes) · LW · GW

A paperclip-maximizer could turn out to be much, much worse than a nuclear war extinction, depending on how suffering subroutines and acausal trade works.

An AI dedicated to the preservation of the human species but not aligned to any other human values would, I bet, be much much worse than a nuclear war extinction. At least please throw in some sort of " good health and happiness" condition! (And that would not be nearly enough in my opinion)

Comment by daniel-kokotajlo on Decoupling vs Contextualising Norms · 2018-05-18T23:42:58.324Z · score: 11 (3 votes) · LW · GW

The example you use is already CW-enough that high-decouplers may be suspicious or hostile of the point you are trying to make.

Then again, maybe anything elsewould be too far removed from our shared experience that it wouldn't serve as a quick and powerful illustration of your point.

Here are some suggestions made with both of these points in mind:

--The original example Scott uses about a Jew in future Czarist Russia constantly hearing about how powerful Jews are and how evil Israel is.

--Flipping the script a bit, how about an example in which someone goes around saying "86% of rationalists are straight white men" (or something like that, I don't know the actual number).

--Or: "Effective Altruists are usually people who are biased towards trying to solve their problems using math."

Come to think of it, I think including one of those flip-script examples would be helpful in other ways as well.