Comment by daniel-kokotajlo on Just Imitate Humans? · 2019-07-28T16:09:22.030Z · score: 4 (2 votes) · LW · GW

I did mean current ML methods, I think. (Maybe we mean different things by that term.) Why wouldn't they make mesa-optimizers, if they were scaled up enough to successfully imitate humans well enough to make AGI?

For your note, I'm not sure I understand the example. It seems to me that a successfully blending-in/deceptively-aligned mesa-optimizer would, with each gradient update, get smarter but its values would not change--I believe the mesa-alignment paper calls this "value crystallization." The reason is that changing its values would not affect its behavior, since its behavior is based primarily on its epistemology: it correctly guesses the base objective and then attempts to optimize for it.

Comment by daniel-kokotajlo on Just Imitate Humans? · 2019-07-27T03:56:13.413Z · score: 11 (3 votes) · LW · GW

I think this is an idea worth exploring. The biggest problem I have with it right now is that it seems like current ML methods would get us mesa-optimizers.

To spell it out a bit: At first the policy would be a jumble of heuristics that does decently well. Eventually, though, it would have to be something more like an agent, to mimic humans. But the first agent that forms wouldn't also be the last, the perfectly accurate one. Rather, it would be somewhat accurate. Thenceforth further training could operate on the AIs values and heuristics to make it more human-like... OR it could operate on the AIs values and heuristics to make it more rational and smart so that it can predict and then mimic human behavior better. And the latter seems more likely to me.

So what we'd end up with is something that is similar to a human, except with values that are a more random and alien, and maybe also more rational and smart. This seems like exactly the sort of thing we are trying to avoid.

Comment by daniel-kokotajlo on Ought: why it matters and ways to help · 2019-07-27T03:20:25.916Z · score: 16 (6 votes) · LW · GW

Thanks for this post; I don't know much about Ought other than what you've just said, so sorry if this has already been answered elsewhere:

You say that " Designing an objective that incentivizes experts to reveal what they know seems like a critical step in AI alignment. "

It also seems like a crucial step in pretty much all institution design. Surely there is a large literature on this already? Surely there have been scientific experiments run on this already? What does the state of modern science on this question look like right now, and does Ought have plans to collaborate with academics in some manner? A quick skim of the Ought website didn't turn up any references to existing literature.

Comment by daniel-kokotajlo on [deleted post] 2019-07-22T20:07:43.648Z

" Conversely, finding out things about the player does not reduce my uncertainty about what value the dice will reveal very much. For instance, if I find out that the player really wants the sum of the values revealed by the dice to be seven, I will think that they are about as likely to roll a seven as they would be to roll a one if I had learned that the player really wanted to avoid rolling a seven. "

You mean "...about as likely to roll a seven as they would be if I had learned that they really wanted to avoid rolling a seven."

In general this one needs a lot of proof-reading

Not sure you need all those examples--maybe you could delete the toilet example, for instance?

Again, nice job! I look forward to reading the comments.

Comment by daniel-kokotajlo on [deleted post] 2019-07-22T19:54:51.911Z

Feel free to delete these comments as you update the draft! These are just my rough rough thoughts, don't take them too seriously.

--I like the intro. Catchy puzzle paragraph followed by explanation of what you are doing and why.

--I think the bread example didn't fit as well with me for some reason. It felt both unnecessarily long and not quite the right angle. In particular, I don't think inequality is the issue, I think it is the loss of influence of us. Like, I think there are tons of bad actors in the world and I would be very happy to see them all lose influence to a single good or even just good-ish actor. Inequality would be increasing, but that would be a good thing in that circumstance. Another example: I might think that Moloch will eat all our children unless we achieve some sort of singleton or otherwise concentrate power massively; I may even be willing to have that power concentrated in the hands of someone with radically different values than me because I prefer that outcome to the moloch-outcome. (Maybe this isn't a good example because maybe if we think Moloch will eat us all then that means we think we have very little influence over the future?)

Here's maybe what I would suggest instead: "If I learned there was a new technology that was going to give its owners a thousand times as much bread, I wouldn't be worried unless I thought it would diminish the amount of bread I had--and why would it? But if I learn there is a new technology that will give its owners a thousand times as much control over the future, that seems to imply that I'll have less control myself." Not sure this is better, but it's what I came up with.

--The Elliot Sober thing is super interesting and I'd love to read more about it. Make sure you include a link or two!

Comment by daniel-kokotajlo on Jeff Hawkins on neuromorphic AGI within 20 years · 2019-07-16T12:35:31.407Z · score: 1 (1 votes) · LW · GW

Interesting, thanks!

Thinking of the cortical columns as models in an ensemble... Have ML people tried ensemble models with tens of thousands of models? If so, are they substantially better than using only a few dozen? If they aren't, then why does the brain need so many?

Comment by daniel-kokotajlo on Experimental Open Thread April 2019: Socratic method · 2019-04-10T02:43:59.448Z · score: 4 (2 votes) · LW · GW

(Sorry for delay, I thought I had notifications set up but apparently not)

I don't at the moment have a comprehensive taxonomy of the possible scenarios. The two I mentioned above... well, at a high level, what's going on is that (a) CAIS seems implausible to me in various ways--e.g. it seems to me that more unified and agenty AI would be able to outcompete comprehensive AI systems in a variety of important domains, and (b) I haven't heard a convincing account of what's wrong with the classic scenario. The accounts that I've heard usually turn out to be straw men (e.g. claiming that the classic scenario depends on intelligence being a single, unified trait) or merely pointing out that other scenarios are plausible too (e.g. Paul's point that we could get lots of crazy transformative AI things happening in the few years leading up to human-level AGI).

Comment by daniel-kokotajlo on Experimental Open Thread April 2019: Socratic method · 2019-04-01T02:44:15.074Z · score: 2 (2 votes) · LW · GW

Claim: The "classical scenario" of AI foom as promoted by e.g. Bostrom, Yudkowsky, etc. is more plausible than the scenario depicted in Drexler's Comprehensive AI Systems.

Comment by daniel-kokotajlo on What failure looks like · 2019-03-17T21:59:16.313Z · score: 13 (5 votes) · LW · GW

I think that's a straw man of the classic AI-related catastrophe scenarios. Bostrom's "covert preparation" --> "Treacherous turn" --> "takeover" story maps pretty nicely to Paul's "seek influence via gaming tests" --> "they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives" --> " One day leaders may find that despite their nominal authority they don’t actually have control over what these institutions do. For example, military leaders might issue an order and find it is ignored. This might immediately prompt panic and a strong response, but the response itself may run into the same problem, and at that point the game may be up. "

Comment by daniel-kokotajlo on In SIA, reference classes (almost) don't matter · 2019-01-18T23:31:49.396Z · score: 1 (1 votes) · LW · GW

Sometimes when people say SIA is reference-class independent & SSA isn't, they mean it as an argument that SIA is better than SSA, because it is philosophically less problematic: The choice of reference class is arbitrary, so if we don't have to make that choice, our theory is overall more elegant. This was the sort of thing I had in mind.

On that definition, SSA is only more arbitrary than SIA if it makes the reference class different from the class of all observers. (Which some proponents of SSA have done) SIA has a concept of observer too, at least, a concept of observer-indistinguishable-from-me (which presumably is proper subset of observer, though now that I think about it this might be challenged. Maybe I was doubly wrong--maybe SIA only needs the concept of observer-indistinguishable-from-me).

Comment by daniel-kokotajlo on In SIA, reference classes (almost) don't matter · 2019-01-17T15:36:04.428Z · score: 1 (1 votes) · LW · GW

Ah, my mistake, sorry. I was thinking of a different definition of reference-class-independent than you were; I should have read more closely.

Comment by daniel-kokotajlo on XOR Blackmail & Causality · 2019-01-17T14:07:27.849Z · score: 3 (2 votes) · LW · GW

Maybe I'm late to the party, in which case sorry about that & I look forward to hearing why I'm wrong, but I'm not convinced that epsilon-exploration is a satisfactory way to ensure that conditional probabilities are well-defined. Here's why:

What ends up happening if I do action A often depends on why I did it. For example, if someone else is deciding how to treat me, and I defect against them, but it's because of epsilon-exploration rather than because that's what my reasoning process concluded, then they would likely be inclined to forgive me and cooperate with me in the future. So the conditional probability will be well-defined, but defined incorrectly--it will say that the probability of them cooperating with me in the future, conditional on me defecting now, is high.

I hear there is a way to fiddle with the foundations of probability theory so that conditional probabilities are taken as basic and ordinary probabilities are defined in terms of them. Maybe this would solve the problem?

Comment by daniel-kokotajlo on In SIA, reference classes (almost) don't matter · 2019-01-15T03:08:12.319Z · score: 1 (1 votes) · LW · GW

Yes, but note that SSA can get this same result. All they have to do is say that their reference class is R--whatever set the SIA person uses, they use the same set. If they make this move, then they are reference-class-independent to exactly the same degree as SIA.

Comment by daniel-kokotajlo on Will humans build goal-directed agents? · 2019-01-07T21:53:42.641Z · score: 1 (1 votes) · LW · GW

Thanks for doing this--it's helpful for me as well. I have some questions/quibbles:

Isn't #2 as goal-directed as the human it mimics, in all the relevant ways? If I learn that a certain machine runs a neural net that mimics Hitler, shouldn't I worry that it will try to take over the world? Maybe I don't get what you mean by "mimics."

What exactly is the difference between an Oracle and a Tool? I thought an Oracle was a kind of Tool; I thought Tool was a catch-all category for everything that's not a Sovereign or a Genie.

I'm skeptical of this notion of "homeostatic" superintelligence. It seems to me that nations like the USA are fully goal-directed in the relevant senses; they exhibit the basic AI drives, they are capable of things like the treacherous turn, etc. As for Windows, how is it an agent at all? What does it do? Allocate memory resources across currently-being-run programs? How does it do that--is there an explicit function that it follows to do the allocation (e.g. give all programs equal resources), or does it do something like consequentialist reasoning?

On #6, it seems to me that it might actually be correct to say that the swarm is an agent--it's just that the swarm has different goals than each of its individual members. Maybe Moloch is an agent after all! On the other hand, something seems not quite right about this--what is Moloch's utility function? Whatever it is, Moloch seems particularly uninterested in self-preservation, which makes it hard to think of it as an agent with normal-ish goals. (Argument: Suppose someone were to initiate a project that would, with high probability, kill Moloch forever in 100 years time. Suppose the project has no other effects, such that almost all humans think it's a good idea. And everyone knows about it. All it would take to stop the project is a million people voting against it. Now, is there a sense in which Moloch would resist it or seek to undermine the project? It would maaaybe incentivize most people not to contribute to the project (tragedy of the commons!) but that's it. So either Moloch isn't an agent, or it's an agent that doesn't care about dying, or it's an agent that doesn't know it's going to die, or it's a very weak agent--can't even stop one project!)

Comment by daniel-kokotajlo on Will humans build goal-directed agents? · 2019-01-07T21:17:55.475Z · score: 4 (2 votes) · LW · GW

I get why the MCTS is important, but what about the training? It seems to me that if we stop training AlphaGo (Zero) and I play a game against it, it's goal-directed even though we have stopped training it.

Comment by daniel-kokotajlo on Boltzmann brain decision theory · 2018-09-13T13:02:12.550Z · score: 1 (1 votes) · LW · GW

I didn't quite follow that last section. How do considerations about boundedness and "only matters if it makes something happen differently" undermine the reasoning you laid out in the "FDT" section, which seems solid to me? Here's my attempt at a counterargument; hopefully we can have a discussion & clear things up that way.

I am arguing for this thesis: As an altruistic FDT/UDT agent, the optimal move is always "think happy thoughts," even when you aren't thinking about Boltzmann Brains or FDT/UDT.

In the space of boltzmann-brains-that-might-be-me, probability/measure is not distributed evenly. Simpler algorithms are more likely/have more measure.

I am probably a simpler algorithm.

So while it is true that for every action a I could choose, there is some chunk of BB's out there that chooses a, and hence in some sense my choice makes no difference to what the BB's do but rather only to which ones I am logically correlated with, it's also true that my choice controls the choice of the largest chunk of BB's, and so if I choose a then the largest chunk of BB's chooses a, and if I choose b then the largest chunk of BB's chooses b.

So I should think happy thoughts.

The argument I just gave was designed to address your point "naively making yourself happy means that your Boltzmann brain copies will be happy: but this isn't actually increasing the happiness across all Boltzmann brains, just changing which ones are copies of you" but I may have misunderstood it.

P.S. I know earlier you argued that the entropy of a BB doesn't matter because its contribution to the probability is dwarfed by the contribution of the mass. But as long as it's nonzero, I think my argument will work: Higher-entropy BB configurations will be more likely, holding mass constant. (Perhaps I should replace "simpler" in the above argument with "higher-entropy" then.)

Comment by daniel-kokotajlo on Paradoxes in all anthropic probabilities · 2018-06-21T03:20:01.973Z · score: 5 (3 votes) · LW · GW

Which interpretation of probability do you use? I go with standard subjective bayesianism: Probabilities are your credences are your degrees of belief.

So, there's nothing contradictory or incoherent about believing that you will believe something else in the future. Trivial case: Someone will brainwash you in the future and you know this. Why do you think your own beliefs are right? First of all, why do I need to answer that question in order to coherently have those beliefs? Not every belief can be justified in that way. Secondly, if I follow SSA, here's my justification: "Well, here are my priors. Here is my evidence. I then conditionalized on the evidence, and this is what I got. That future version of me has the same priors but different evidence, so they got a different result." Why is that not justification enough?

Yes, it's weird when you are motivated to force your future copy to do things. Perhaps we should do for probability what we did for decision theory, and talk about agents that have the ability to irrevocably bind their future selves. (Isn't this basically what you think we should do?)

But it's not incoherent or senseless to think that yes, I have credence X now and in the future I will have credence Y. Just as it isn't incoherent or senseless to wish that your future self would refuse the blackmail even though your future self would actually decide to give in.

Comment by daniel-kokotajlo on Paradoxes in all anthropic probabilities · 2018-06-21T03:08:55.172Z · score: 4 (2 votes) · LW · GW

As reductios of anthropic views go, these are all pretty mild. Abandoning conservation of expected evidence isn't exactly an un-biteable bullet. And "Violating causality" is particularly mild, especially for those of us who like non-causal decision theories. As a one-boxer I've been accused of believing in retrocausality dozens of times... sticks and stones, you know. This sort of "causality violation" seems similarly frivolous. Oh, and the SSA reference class arbitrariness thing can be avoided by steelmanning SSA to make it more elegant--just get rid of the reference class idea and do it with centered worlds. SSA is what you get if you just do ordinary Bayesian conditionalization on centered worlds instead of on possible worlds. (Which is actually the more elegant and natural way of doing it, since possible worlds are a weird restriction on the sorts of sentences we use. Centered worlds, by contrast, are simply maximally consistent sets of sentences, full stop.) As for changing the probability of past events... this isn't mysterious in principle. We change the probability of past events all the time. Probabilities are just our credences in things! More seriously though, let A be the hypothetical state of the past light-cone that would result in your choosing to stretch your arm ten minutes from now, and B be the hypothetical state of the past light-cone that would result in your choosing to not stretch your arm. A and B are past events, but you should be uncertain about which one obtained until about ten minutes from now, at which point (depending on what you choose!) the probability of A will increase or decrease.

There are strong reductios in the vicinity though, if I recall correctly. (I did my MA on this stuff, but it was a while ago so I'm a little rusty.)

FNC-type views have the result that (a) we almost instantly become convinced, no matter what we experience, that the universe is an infinite soup of random noise occasionally coalescing to form Boltzmann Brains, because this is the simplest hypothesis that assigns probability 1 to the data; (b) we stay in this state forever and act accordingly--which means thinking happy thoughts, or something like that, whether we are average utilitarians or total utilitarians or egoists.

SIA-type views are as far as I can tell incoherent, in the following sense: The population size of universes grows much faster than their probability can shrink. So if you want to say that their probability is proportional to their population size... how? (Flag: I notice I am confused about this part.) A more down-to-earth way of putting this problem is that the hypothesis in which there is one universe is dominated by the hypothesis in which there are 3^^^^3 copies of that universe in parallel dimensions, which in turn is dominated by the hypothesis in which there are 4^^^^^4...

SSA-type views are the only game in town, as far as I'm concerned--except for the "Let's abandon probability entirely and just do decision theory" idea you favor. I'm not sure what to make of it yet. Anyhow, the big problem I see for SSA-type views is the one you mention about using the ability to create tons of copies of yourself to influence the world. That seems weird all right. I'd like to avoid that consequence if possible. But it doesn't seem worse than weird to me yet. It doesn't seem... un-biteable.

EDIT: I should add that I think your conclusion is probably right--I think your move away from probability and towards decision theory seems very promising. As we went updateless in decision theory, so too should we go updateless in probability. Something like that (I have to think & read about it more). I'm just objecting to the strong wording in your arguments to get there. :)

Comment by daniel-kokotajlo on Physics has laws, the Universe might not · 2018-06-20T16:08:07.354Z · score: 3 (2 votes) · LW · GW

Some thoughts:

(1) "What does the term "Physical law?" mean?" This is a longstanding debate in philosophy, see I think you'd benefit from reading up on the literature.

(2) " It means that someone knowing that law can predict with some accuracy the state of the universe at some point in the future from its state at the time of observation." Nitpick: The present vs. future stuff is a red herring. For example, we use the laws to predict the past also.

(3) The question I'd ask about your proposal to identify laws with predictability is: What is predictability? Do you mean, the actual ratio of true to false predictions made using the law is high? Or do you mean something more robust--if the observer had made many predictions using the law, most of them would have been true? Or probably would have been true? Or what? Notice how it's hard to say what the second and third formulations mean without invoking laws. (We can use laws to ground counterfactuals, or counterfactuals to ground laws, but the hope would be to ground both of them in something less mysterious.)

Comment by daniel-kokotajlo on Anthropics made easy? · 2018-06-20T00:56:14.830Z · score: 10 (2 votes) · LW · GW

Just wanting to second what Charlie says here. As best as I can tell the decision-theoretic move made in the Boltzmann Brains section doesn't work; Neal's FNC has the result that (a) we become extremely confident that we are boltzmann brains, and (b) we end up having an extremely high time and space discount rate at first approximation and at second approximation we end up acting like solipsists as well, i.e. live in the moment, care only about yourself, etc. This is true even if you are standing in front of a button that would save 10^40 happy human lives via colonizing the light-cone. Because a low-entropy region the size of the light cone is unbelievably less common than a low-entropy region the size of a matrix-simulation pod.

Comment by daniel-kokotajlo on Washington, D.C.: Definitions/Labels · 2018-06-03T19:33:13.983Z · score: 3 (1 votes) · LW · GW

Anyone else here? I'm at a table close to the center of the courtyard. Blue hat.

Comment by daniel-kokotajlo on When is unaligned AI morally valuable? · 2018-05-25T14:37:07.039Z · score: 6 (3 votes) · LW · GW

A paperclip-maximizer could turn out to be much, much worse than a nuclear war extinction, depending on how suffering subroutines and acausal trade works.

An AI dedicated to the preservation of the human species but not aligned to any other human values would, I bet, be much much worse than a nuclear war extinction. At least please throw in some sort of " good health and happiness" condition! (And that would not be nearly enough in my opinion)

Comment by daniel-kokotajlo on Decoupling vs Contextualising Norms · 2018-05-18T23:42:58.324Z · score: 11 (3 votes) · LW · GW

The example you use is already CW-enough that high-decouplers may be suspicious or hostile of the point you are trying to make.

Then again, maybe anything elsewould be too far removed from our shared experience that it wouldn't serve as a quick and powerful illustration of your point.

Here are some suggestions made with both of these points in mind:

--The original example Scott uses about a Jew in future Czarist Russia constantly hearing about how powerful Jews are and how evil Israel is.

--Flipping the script a bit, how about an example in which someone goes around saying "86% of rationalists are straight white men" (or something like that, I don't know the actual number).

--Or: "Effective Altruists are usually people who are biased towards trying to solve their problems using math."

Come to think of it, I think including one of those flip-script examples would be helpful in other ways as well.

Comment by daniel-kokotajlo on Double Cruxing the AI Foom debate · 2018-05-01T14:01:33.012Z · score: 2 (1 votes) · LW · GW

"However, it’s important to keep in mind that human society does not yet do things that evolution considers an example of a “fast foom.” To the extent that evolution cares about anything, it’s number of individuals around. Perhaps it’s interested in other metrics, such as ability to change the genes over time. From the perspective of evolution, humans are exhibiting a slow takeoff right now. They are exponentially rising in population and they exist in many climates, but this is still a continuous process implemented on genes and individuals, which are working in evolutionary scales."

Hmmm. (1) Evolution might also care about things like mass extinctions and habitat changes, and we've brought those about in a flash by evolutionary timescales. Besides, I'm not sure it's helpful to divide things up by what evolution would and wouldn't care about. (2) Humans have the ability to do discontinuous things to our genes or population, we just haven't exercised it yet. Analogously, an AGI project might have the ability to do a fast takeoff and then choose not to for some reason. This scenario, I believe, is for practical purposes in the "Fast takeoff" box--it has similar strategic implications. (3) Yes, progress is continuous when you zoom in enough to short enough timescales. But on the timescales evolution usually deals with, human population growth and spread around the world has been discontinuous. ( Looking at the Wiki article, the smallest unit of measurement on the page is 1000 years, when we exclude extinction events caused by humans!)

Comment by daniel-kokotajlo on God Help Us, Let’s Try To Understand Friston On Free Energy · 2018-03-05T20:08:22.171Z · score: 19 (6 votes) · LW · GW
No. They're isomorphic, via the Complete Class Theorem. Any utility/cost function that grows sub-super-exponentially (ie: for which Pascal's Mugging doesn't happen) can be expressed as a distribution, and used in the free-energy principle. You can get the intuition by thinking, "This goal specifies how often I want to see outcome X (P), versus its disjoint cousins Y and Z that I want to see such-or-so often (1-P)."

Can you please link me to more on this? I was under the impression that pascal's mugging happens for any utility function that grows at least as fast as the probabilities shrink, and the probabilities shrink exponentially for normal probability functions. (For example: In the toy model of the St. Petersburg problem, the utility function grows exactly as fast as the probability function shrinks, resulting in infinite expected utility for playing the game.)

Also: As I understand them, utility functions aren't of the form "I want to see X P often and Y 1-P often." They are more like "X has utility 200, Y has utility 150, Z has utility 24..." Maybe the form you are talking about is a special case of the form I am talking about, but I don't yet see how it could be the other way around. As I'm thinking of them, utility functions aren't about what you see at all. They are just about the world. The point is, I'm confused by your explanation & would love to read more about this.