Posts
Comments
Whether or not to get insurance should have nothing to do with what makes one sleep – again, it is a mathematical decision with a correct answer.
Don't be overly naive consequentialist about this. "Nothing" is an overstatement.
Peace of mind can absolutely be one of the things you are purchasing with an insurance contract. If your Kelly calculation says that motorcycle insurance is worth $899 a month, and costs $900 a month, but you'll spend time worrying about not being insured if you don't buy it, and won't if you do, I fully expect that is worth more than $1 a month.
But do be actual consequentialist about it. If the value of the insurance is more like $10, but the cost is $900, I doubt peace of mind about this one thing is worth $890 a month.
I have a habit of reading footnotes as soon as they are linked, and your footnote says that you won with queen odds before the call to guess what odds you'd win at, creating a minor spoiler.
Is it important that negentropy be the result of subtracting from the maximum entropy? It seemed a sensible choice, up until it introduces infinities, and made every state's negentropy infinite. (And also that, if you subtract from 0, then two identical states should have the same negentropy, even in different systems. Unsure if that's useful, or harmful).
Though perhaps that's important for the noting that reducing an infinite system to a finite macrostate is an infinite reduction? I'm not sure if I understand how (or perhaps when?) that's more useful than having it be defined as subtracted from 0, such that finite macrostates have finite negentropy, and infinite macrostates have -infinite negentropy (showing that you really haven't reduced it at all, which, as far as I understand with infinities, you haven't, by definition).
Back in Reward is not the optimization target, I wrote a comment, which received a (small I guess) amount of disagreement.
I intended the important part of that comment to be the link to Adaptation-Executers, not Fitness-Maximizers. (And more precisely the concept named in that title, and less about things like superstimuli that are mentioned in the article) But the disagreement is making me wonder if I've misunderstood both of these posts more than I thought. Is there not actually much relation between those concepts?
There was, obviously, other content to the comment, and that could be the source of disagreement. But I only have that there was disagreement to go on, and I think it would be bad for my understanding of the issue to assume that's where the disagreement was, if it wasn't.
When I tried to answer why we don't trade with ants myself, communication was one of the first things (I can't remember what was actually first) I considered. But I worry it may be more analogous to AI than argued here.
We sort of can communicate with ants. We know to some degree what makes them tick, it's just we mostly use that communication to lie to them and tell them this poison is actually really tasty. The issue may be less that communication is impossible, and more that it's too costly to figure out, and so no one tries to become Antman even if they could cut their janatorial costs by a factor of 7.
The next thought I had was that, if I were to try to get ants to clean my room, I think it's likely that the easiest line towards that is not figuring out how to communicate, but breeding some ants with different behavior (e.g. search for small bits of food, instead of large bits. This seems harder than that sentence suggests, but still probably easier than learning to speak ant). I don't like what that would be analogous to in human-AI interactions.
I think it's possible that an AI could fruitfully trade with humans. While it lacks a body, posting an ad on Craigslist to get someone to move something heavy is probably easier than figuring out how to hijack a wifi-enabled crane or something.
But I don't know how quickly that changes. If the AI is trying to build a sci-fi gadget, it's possible that an instruction set to build it is long or complicated enough that a human has trouble following it accurately. The costs of writing intuitive instructions, and also designing the gadget such that idiot-proof construction is possible could be high enough that it's better to do it itself.
I interpret OP (though this is colored by the fact that I was thinking this before I read this) as saying Adaptation-Executers, not Fitness-Maximizers, but about ML. At which point you can open the reference category to all organisms.
Gradient descent isn't really different from what evolution does. It's just a bit faster, and takes a slightly more direct line. Importantly, it's not more capable of avoiding local maxima (per se, at least).
So, I want to note a few things. The original Eliezer post was intended to argue against this line of reasoning:
I occasionally run into people who say something like, "There's a theoretical limit on how much you can deduce about the outside world, given a finite amount of sensory data."
He didn't worry about compute, because that's not a barrier on the theoretical limit. And in his story, the entire human civilization had decades to work on this problem.
But you're right, in a practical world, compute is important.
I feel like you're trying to make this take as much compute as possible.
Since you talked about headers, I feel like I need to reiterate that, when we are talking to a neural network, we do not add the extra data. The goal is to communicate with the neural network, so we intentionally put it in easier to understand formats.
In the practical cases for this to come up (e.g. a nascent superintelligence figuring out physics faster than we expect), we probably will also be inputting data in an easy to understand format.
Similarly, I expect you don't need to check every possible esoteric format. The likelihood of the image using an encoding like 61 bits per pixel, with 2 for red, 54 for green and 5 for blue is just, very low, a priori. I do admit I'm not sure if only using "reasonable" formats would cut down the possibilities into the computable realm (obviously depends on definitions of reasonable, though part of me feels like you could (with a lot of work) actually have an objective likeliness score of various encodings). But certainly it's a lot harder to say that it isn't than just saying "f(x) = (63 pick x), grows very fast."
Though, since I don't have a good sense for whether "reasonable" ones would be a more computable number, I should update in your direction. (I tried to look into something sort of analogous, and the most common 200 passwords cover a little over 4% of all used passwords, which, isn't large enough for me to feel comfortable expecting that the most "likely" 1,000 formats would cover a significant quantity of the probability space, or anything.)
(Also potentially important. Modern neural nets don't really receive things as a string of bits, but instead as a string of numbers, nicely split up into separate nodes. (yes, those numbers are made of bits, but they're floating point numbers, and the way neural nets interact with them is through all the floating point operations, so I don't think the neural net actually touches the bit representation of the number in any meaningful way.)
"you're jumping to the conclusion that you can reliably differentiate between..."
I think you absolutely can, and the idea was already described earlier.
You pay attention to regularities in the data. In most non-random images, pixels near to each other are similar. In an MxN image, the pixel below is a[i+M], whereas in an NxM image, it's a[i+N]. If, across the whole image, the difference between a[i+M] is less than the difference between a[i+N], it's more likely an MxN image. I expect you could find the resolution by searching all possible resolutions from 1x<length> to <length>x1, and finding which minimizes average distance of "adjacent" pixels.
Similarly (though you'd likely do this first), you can tell the difference between RGB and RGBA. If you have (255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0), this is probably 4 red pixels in RGB, and not a fully opaque red pixel, followed by a fully transparent green pixel, followed by a fully transparent blue pixel in RGBA. It could be 2 pixels that are mostly red and slightly green in 16 bit RGB, though. Not sure how you could piece that out.
Aliens would probably do a different encoding. We don't know what the rods and cones in their eye-equivalents are, and maybe they respond to different colors. Maybe it's not Red Green Blue, but instead Purple Chartreuse Infrared. I'm not sure this matters. It just means your eggplants look red.
I think, even if it had 5 (or 6, or 20) channels, this regularity would be born out, between bit i and bit i+5 being less than bit i and i+1, 2, 3, or 4.
Now, there's still a lot that that doesn't get you yet. But given that there are ways to figure out those, I kinda think I should have decent expectations there's ways to figure out other things, too, even if I don't know them.
I do also think it's important to zoom out to the original point. Eliezer posed this as an idea about AGI. We currently sometimes feed images to our AIs, and when we do, we feed them as raw RGB data, not encoded, because we know that would make it harder for the AI to figure out. I think it would be very weird, if we were trying to train an AI, to send it compressed video, and much more likely that we do, in fact, send it raw RGB values frame by frame.
I will also say that the original claim (by Eliezer, not the top of this thread), was not physics from one frame, but physics from like, 3 frames, so you get motion, and acceleration. 4 frames gets you to third derivatives, which, in our world, don't matter that much. Having multiple frames also aids in ideas like the 3d -> 2d projection, since motion and occlusion are hints at that.
And I think the whole question is "does this image look reasonable", which you're right, is not a rigorous mathematical concept. But "'looks reasonable' is not a rigorous concept" doesn't get followed by "and therefore is impossible" Above are some of the mathematical descriptions of what 'reasonable' means in certain contexts. Rendering a 100x75 image as 75x100 will not "look reasonable". But it's not beyond thinking and math to determine what you mean by that.
"the addition of an unemployable worker causes ... the worker's Shapley values to drop to $208.33 (from $250)."
I would emphasize here that the "workers'" includes the unemployed one. It was not obvious to me, until about halfway through the next paragraph, and I think the next paragraph would read better with that in mind from the start.
I'd be interested to know why you think that.
I'd be further interested if you would endorse the statement that your proposed plan would fully bridge that gap.
And if you wouldn't, I'd ask if that helps illustrate the issue.
It seems odd to suggest that the AI wouldn't kill us because it needs our supply chain. If I had the choice between "Be shut down because I'm misaligned" (or "Be reprogrammed to be aligned" if not corrigible) and "Have to reconstruct the economy from the remnants of human civilization," I think I'm more likely to achieve my goals by trying to reconstruct the economy.
So if your argument was meant to say "We'll have time to do alignment while the AI is still reliant on the human supply chain," then I don't think it works. A functional AGI would rather destroy the supply chain and probably fail at its goals, than be realigned and definitely fail at its goals.
(Also, I feel this is mostly a minor thing, but I don't really understand your reference class on novel technologies. Why is the time measured from "proof of concept submarine" to "capable of sinking a warship"? Or from "theory of relativity" to "Atom Bomb being dropped"? Maybe that was just the data available, but why isn't it "Wright brothers start attempting heavier than air flight" to "Wright brothers do heavier than air flight"? Because when reading my mind immediately wondered how much of the 36 year gap on mRNA vaccines was from "here's a cool idea" to "here's a use case", instead of "here's a cool idea" to "we can actually do that")
Surely creating the full concrete details of the strategy is not much different from "putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections." I at least don't see why the same mechanism couldn't be used here (i.e. apply this definition iteration to the word "good", and then have the AI do that, and apply it to "bad" and have the AI avoid that). If you see it as a different thing, can you explain why?
Exactly. I notice you aren't who I replied to, so the canned response I had won't work. But perhaps you can see why most of his objections to my objections would apply to objections to that plan?
Let me ask you this. Why is "Have the AI do good things, and not do bad things" a bad plan?
I think you missed the point. I'd trust an aligned superintelligence to solve the objections. I would not trust a misaligned one. If we already have an aligned superintelligence, your plan is unnecessary. If we do not, your plan is unworkable. Thus, the problem.
If you still don't see that, I don't think I can make you see it. I'm sorry.
It seems simple and effective because you don't need to put weight on it. We're talking a superintelligence, though. Your definition will not hold when the weight of the world is on it.
And the fact that you're just reacting to my objections is the problem. My objections are not the ones that matter. The superintelligence's objections are. And it is, by definition, smarter than me. If your definition is not something like provably robust, then you won't know if it will hold to a superintelligent objection. And you won't be able to react fast enough to fix it in that case.
You can't bandaid a solution into working, because if a human can point out a flaw, you should expect a superintelligence to point out dozens, or hundreds, or thousands.
I don't know how else to get you to understand this central objection. Robustness is required. Provable robustness is, while not directly required, kinda the only way we can tell if something is actually robust.
I'm not sure this is being productive. I feel like I've said the same thing over and over again. But I've got one more try: Fine, you don't want to try to define "reason" in math. I get it, that's hard. But just try defining it in English.
If I tell the machine "I want to be happy." And it tries to determine my reason for that, what does it come up with? "I don't feel fulfilled in life"? Maybe that fits, but is it the reason, or do we have to go back more: "I have a dead end job"? Or even more "I don't have enough opportunities"?
Or does it go a completely different tack and say my reason is "My pleasure centers aren't being stimulated enough" or "I don't have enough endorphins."
Or, does it say the reason I said that was because my fingers pressed keys on a keyboard.
To me, as a human, all of these fit the definition of "reasons." And I expect they could all be true. But I expect some of them are not what you mean. And not even in the sense of some of them being a different definition for "reason." How would you try to divide what you mean and what you don't mean?
Then do that same thought process on all the other words.
No, they really don't. I'm not trying to be insulting. I'm just not sure how to express the base idea.
The issue isn't exactly that computers can't understand this, specifically. It's that no one understands what those words mean enough. Define reason. You'll notice that your definition contains other words. Define all of those words. You'll notice that those are made of words as well. Where does it bottom out? When have you actually, rigorously, objectively defined these things? Computers only understand that language, but the fact that a computer wouldn't understand your plan is just illustrative of the fact that it is not well defined. It just seems like it is, because you have a human brain that fills in all the gaps seamlessly. So seamlessly you don't even notice that there were gaps that need filling.
This is why there's an emphasis on thinking about the problem like a computer programmer. Misalignment thrives in those gaps, and if you gloss over them, they stay dangerous. The only way to be sure you're not glossing over them is to define things with something as rigorous as Math. English is not that rigorous.
In short, the difference between the two is Generality. A system that understands the concepts of computational resources and algorithms might do exactly that to improve it's text prediction. Taking the G out of AGI could work, until the tasks get complex enough they require it.
Again, what is a "reason"? More concretely, what is the type of a "reason"? You can't program an AI in English, it needs to be programmed in code. And code doesn't know what "reason" means.
It's not exactly that your plan "fails" anywhere particularly. It's that it's not really a plan. CEV says "Do what humans would want if they were more the people they want to be." Cool, but not a plan. The question is "How?" Your answer to that is still under specified. You can tell by the fact you said things like "the AI could just..." and didn't follow it with "add two numbers" or something simple (we use the word "primitive"), or by the fact you said "etc." in a place where it's not fully obvious what the rest actually would be. If you want to make this work, you need to ask "How?" to every single part of it, until all the instructions are binary math. Or at least something a python library implements.
The quickest I can think of is something like "What does this mean?" Throw this at every part of what you just said.
For example: "Hear humanity's pleas (intuitions+hopes+worries)" What is an intuition? What is a hope? What is a worry? How does it "hear"?
Do humans submit English text to it? Does it try to derive "hopes" from that? Is that an aligned process?
An AI needs to be programmed, so you have to think like a programmer. What is the input and output type of each of these (e.g. "Hear humanity's pleas" takes in text, and outputs... what? Hopes? What does a hope look like if you have to represent it to a computer?).
I kinda expect that the steps from "Hear humanity's pleas" to "Develop moral theories" relies on some magic that lets the AI go from what you say to what you mean. Which is all well and good, but once you have that you can just tell it, in unedited English "figure out what humanity wants, and do that" and it will. Figuring out how to do that is the heart of alignment.
This is the sequence post on it: https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message, it's quite a fun read (to me), and should explain why something smart that thinks at transistor speeds should be able to figure things out.
For inventing nanotechnology, the given example is AlphaFold 2.
For killing everyone in the same instant with nanotechnology, Eliezer often references Nanosystems by Eric Drexler. I haven't read it, but I expect the insight is something like "Engineered nanomachines could do a lot more than those limited by designs that have a clear evolutionary path from chemicals that can form randomly in the primordial ooze of Earth."
For how a system could get that smart, the canonical idea is recursive self improvement (i.e. an AGI capable of learning AGI engineering could design better versions of itself, which could in turn better design better versions, etc, to whatever limit.). But more recent history in machine learning suggests you might be able to go from sub-human to overwhelmingly super-human just by giving it a few orders of magnitude more compute, without any design changes.
In addition to the mentions in the post about Facebook AI being rather hostile to the AI safety issue in general, convincing them and top people at OpenAI and Deepmind might still not be enough. You need to prevent every company who talks to some venture capitalists and can convince them how profitable AGI could be. Hell, depending on how easy the solution ends up being, you might even have to prevent anyone with a 3080 and access to arXiv from putting something together in their home office.
This really is "uproot the entire AI research field" and not "tell Deepmind to cool it."
To start, it's possible to know facts with confidence, without all the relevant info. For example I can't fit all the multiplication tables into my head, and I haven't done the calculation, but I'm confident that 2143*1057 is greater than 2,000,000.
Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.
I believe the necessary knowledge to be confident in each of these facts is not too big to fit in a human brain.
You may be referring to other things, which have similar paths to high confidence (e.g. "Why are you confident this alignment idea won't work." "I've poked holes in every alignment idea I've come across. At this point, Bayes tells me to expect new ideas not to work, so I need proof they will, not proof they won't."), but each path might be idea specific.
Question. Even after the invention of effective contraception, many humans continue to have children. This seems a reasonable approximation of something like "Evolution in humans partially survived." Is this somewhat analogous to "an [X] percent chance of killing less than a billion people", and if so, how has this observation changed your estimate of "disassembl[ing] literally everyone"? (i.e. from "roughly 1" to "I suppose less, but still roughly 1" or from "roughly 1" to "that's not relevant, still roughly 1"? Or something else.)
(To take a stab at it myself, I expect that, conditional on us not all dying, we should expect to actually fully overcome evolution in a short enough timescale that it would still be a blink in evolutionary time. Essentially this observation is saying something like "We won't get only an hour to align a dangerous AI before it kills us, we'll probably get two hours!", replaced with whatever timescale you expect for fooming.)
(No, I don't think that works. First, it's predicated on an assumption of what our future relationship with evolution will be like, which is uncertain. But second, those future states also need to be highly specific to be evolution-less. For example, a future where humans in the Matrix who "build" babies still does evolution, just not through genes (does this count? is this a different thing?[1]), so it may not count. Similarly one where we change humans so contraception is "on" by default, and you have to make a conscious choice to have kids, would not count.)
(Given the footnote I just wrote, I think a better take is something like "Evolution is difficult to kill, in a similar way to how gravity is hard to kill. Humans die easier. The transformation of human evolution pre-contraception to human evolution post-contraception is, if not analogous to a replacement of humanity with an entity that is entirely not human, is at least analogous to creating a future state humans-of-today would not want (that is, human evolutionary course post-contraception is not what evolution pre-contraception would've "chosen"). The fact that evolution survived at all is not particularly hope inducing.)
[1] Such a thing is still evolution in the mathematical sense (that a^x > (b+constant)^x after some x iff a>b), but it does seem like, in a sense, biological evolution would no longer "recognize" these humans. Analogous to an AGI replacing all humans with robots that do their jobs more efficiently. Maybe it's still "society" but still seems like humanity has been removed.
If you only kept promises when you want to, they wouldn't be promises. Does your current self really think that feeling lazy is a good reason to break the promise? I kinda expect toy-you would feel bad about breaking this promise, which, even if they do it, suggests they didn't think it was a good idea.
If the gym was currently on fire, you'd probably feel more justified breaking the promise. But the promise is still broken. What's the difference in those two breaks, except that current you thinks "the gym is on fire" is a good reason, and "I'm feeling lazy" is a bad reason? You could think about this as "what would your past self say if you gave this excuse?" Which could be useful, but can only be judged based on what your current self thinks.
Promises should be kept. It's not only a virtue, but useful for pre-commitment if you can keep your promises.
But, if you make a promise to someone, and later both of you decide it's a bad idea to keep the promise, you should be able to break it. If that someone is your past self, this negotiation is easy: If you think it's a good idea to break the promise, they would be convinced the same way you were. You've run that experiment.
So, you don't really have much obligation to your past self. If you want your future self to have obligation to you, you are asking them to disregard any new evidence they may encounter. Maybe you want that sometimes. But that feels like it should be a rare thing.
On a society level, this argument might not work, though. Societal values might change because people who held the old value died. We can't necessarily say "they would be convinced the same way we were." I don't know what causes societal values to change, or what the implications therefore are.
Space pirates can profit by buying shares in the prediction market that pay money if Ceres shifts to a pro-Earth stance and then invading Ceres.
Has this line got a typo, or am I misunderstanding? Don't the pirates profit by buying pro-Mars shares, then invading to make Ceres pro-Mars (because Ceres is already pro-Earth)?
Mars bought pro-Earth to make pro-Mars more profitable, in the hope that pirates would buy pro-Mars and then invade.
I doubt my ability to be entertaining, but perhaps I can be informative. The need for mathematical formulation is because, due to Goodhart's law, imperfect proxies break down. Mathematics is a tool which is rigorous enough to get us from "that sounds like a pretty good definition" (like "zero correlation" in the radio signals example), to "I've proven this is the definition" (like "zero mutual information").
The proof can get you from "I really hope this works" to "As long as this system satisfies the proof's assumptions, this will work", because the proof states it's assumptions clearly, while "this has worked previously" could, and likely does, rely on a great number of unspecified commonalities previous instances had.
It gets precise and pedantic because it turns out that the things we often want to define for this endeavor are based on other things. "Mutual information" isn't a useful formulation without a formulation for "information". Similarly, in trying to define morality, it's difficult to define what an agent should do in the world (or even what it means for an agent to do things in the world), without ideas of agency and doing, and the world. Every undefined term you use brings you further from a formulation you could actually use to create a proof.
In all, mathematical formulation isn't the goal, it's the prerequisite. "Zero correlation" was mathematically formalized, but that was not sufficient.
I understand the point of your dialog, but I also feel like I could model someone saying "This Alignment Researcher is really being pedantic and getting caught in the weeds." (especially someone who wasn't sure why these questions should collapse into world models and correspondence.)
(After all, the Philosopher's question probably didn't depend on actual apples, and was just using an apple as a stand-in for something with positive utility. So, the inputs of the utility functions could easily be "apples" (where an apple is an object with 1 property, "owner". Alice prefers apple.owner="alice" (utility(a): return int(a.owner=='alice')), and Bob prefers apple.owner="bob") To sidestep the entire question of world models, and correspondence.)
I suspect you did this because the half formed question about apples was easier to come up with than a fully formed question that would necessarily require engagement with world models, and I'm not even sure that's the wrong choice. But this was the impression I got reading it.
In my post I wrote:
Am I correct after reading this that this post is heavily related to embedded agency? I may have misunderstood the general attitudes, but I thought of "future states" as "future to now" not "future to my action." It seems like you couldn't possibly create a thing that works on the last one, unless you intend it to set everything in motion and then terminate. In the embedded agency sequence, they point out that embedded agents don't have well defined i/o channels. One way is that "action" is not a well defined term, and is often not atomic.
It also sounds like you're trying to suggest that we should be judging trajectories, not states? I just want to note that this is, as far as I can tell, the plan: https://www.lesswrong.com/posts/K4aGvLnHvYgX9pZHS/the-fun-theory-sequence
Life's utility function is over 4D trajectories, not just 3D outcomes.
From the synopsis of High Challenge
instead of making a corrigible AGI.
I'm not sure I interpret corrigibility as exactly the same as "preferring the humans remain in control" (I see you suggest this yourself in Objection 1, I wrote this before I reread that, but I'm going to leave it as is) and if you programmed that preference into a non-corrigible AI, it would still seize the future into states where the humans have to remain in control. Better than doom, but not ideal if we can avoid it with actual corrigibility.
But I think I miscommunicated, because, besides the above, I agree with everything else in those two paragraphs.
See discussion under “Objection 1” in my post.
I think I maintain that this feels like it doesn't solve much. Much of the discussion in the Yudkowsky conversations was that there's a concern on how to point powerful systems in any direction. Your response to objection 1 admits you don't claim this solves that, but that's most of the problem. If we do solve the problem of how to point a system at some abstract concept, why would we choose "the humans remain in control" and not "pursue humanity's CEV"? Do you expect "the humans remain in control" (or the combination of concepts you propose as an alternative) to be easier to define? Easier enough to define that it's worth pursuing even if we might choose the wrong combination of concepts? Do you expect something else?
I'm a little confused what it hopes to accomplish. I mean, to start I'm a little confused by your example of "preferences not about future states" (i.e. 'the pizza shop employee is running around frantically, and I am laughing' is a future state).
But to me, I'm not sure what the mixing of "paperclips" vs "humans remain in control" accomplishes. On the one hand, I think if you can specify "humans remain in control" safely, you've solved the alignment problem already. On another, I wouldn't want that to seize the future: There are potentially much better futures where humans are not in control, but still alive/free/whatever. (e.g. the Sophotechs in the Golden Oecumene are very much in control). On a third, I would definitely, a lot, very much, prefer a 3 star 'paperclips' and 5 star 'humans in control' to a 5 star 'paperclips' and a 3 star 'humans in control', even though both would average 4 stars?
I don't know if there's much counterargument beyond "no, if you're building an ML system that helps you think longer about anything important, you already need to have solved the hard problem of searching through plan-space for actually helpful plans."
This is definitely a problem, but I would say human amplification further isn't a solution because humans aren't aligned.
I don't really have a good what human values are, even in an abstract English definition sense, but I'm pretty confident that "human values" are not, and are not easily transformable from, "a human's values."
Though maybe that's just most of the reason why you'd have to have your amplifier already aligned, and not a separate problem itself.
Maybe I misunderstand your use of robust, but this still seems to me to be breadth. If an optima is broader, samples are more likely to fall within it. I took broad to mean "has a lot of (hyper)volume in the optimization space", and robust to mean "stable over time/perturbation". I still contend that those optimization processes are unaware of time, or any environmental variation, and can only select for it in so far as it is expressed as breadth.
The example I have in my head is that if you had an environment, and committed to changing some aspect of it after some period of time, evolution or SGD would optimize the same as if you had committed to a different change. Which change you do would affect the robustness of the environmental optima, but the state of the environment alone determines their breadth. The processes cannot optimize based on your committed change before it happens, so they cannot optimize for robustness.
Given what you said about random samples, I think you might be working under definitions along the lines of "robust optima are ones that work in a range of environments, so you can be put in a variety of random circumstances, and still have them work" and (at this point I struggled a bit to figure out what a "broad" optima would be that's different, and this is what I came up with?) "broad optima are those that you can do approximately and still get a significant chunk of the benefit." I feel like these can still be unified into one thing, because I think approximate strategies in fixed environments are similar to fixed strategies in approximate environments? Like moving a little to the left is similar to the environment being a little to the right?
"Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are [broad and robust]"
I understand why you think that broad is true. But I'm not sure I get robust. In fact, robust seems to make intuitive dis-sense to me. Your examples are gradient descent and evolution, neither of which have memory, so, how would they be able to know how "robust" an optima is? Part of me thinks that the idea comes from how, if a system optimized for a non-robust optima, it wouldn't internally be doing anything different, but we probably would say it failed to optimize, so it looks like successful optimizers optimize for robust optima. Plus that broad optima are more likely to be robust. I'm not sure, but I do notice my confused on the inclusion of "robust". My current intuition is kinda like "Broadness and robustness of optima are very coupled. But, given that, optimization for robust optima only happens insofar as it is really optimization for broad optima. Optimization for robust but not broad optima does not happen, and optimization for statically broad but more robust optima does not happen better."
Can I try to parse out what you're saying about stacked sigmoids? Because it seems weird to me. Like, in that view, it still seems like showing a trendline is some evidence that it's not "interesting". I feel like this because I expect the asymptote of the AlphaGo sigmoid to be independent of MCTS bots, so surely you should see some trends where AlphaGo (or equivalent) was invented first, and jumped the trendline up really fast. So not seeing jumps should indicate that it is more a gradual progression, because otherwise, if they were independent, about half the time the more powerful technique should come first.
The "what counter argument can I come up with" part of me says, tho, that how quickly the sigmoid grows likely depends on lots of external factors (like compute available or something). So instead of sometimes seeing a sigmoid that grows twice as fast as the previous ones, you should expect one that's not just twice as tall, but twice as wide, too. And if you have that case, you should expect the "AlphaGo was invented first" sigmoid to be under the MCTS bots sigmoid for some parts of the graph, where it then reaches the same asymptote as AlphaGo in the mainline. So, if we're in the world where AlphaGo is invented first, you can make gains by inventing MCTS bots, which will also set the trendline. And so, seeing a jump would be less "AlphaGo was invented first" and more "MCTS bots were never invented during the long time when they would've outcompeted AlphaGo version -1"
Does that seem accurate, or am I still missing something?
So, I'm not sure if I'm further down the ladder and misunderstanding Richard, but I found this line of reasoning objectionable (maybe not the right word):
"Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals."
My initial (perhaps uncharitable) response is something like "Yeah, you could build a safe system that just prints out plans that no one reads or executes, but that just sounds like a complicated way to waste paper. And if something is going to execute them, then what difference is it whether that's humans or the system itself?"
This, with various mention of manipulating humans, seems to me to like it would most easily arise from an imagined scenario of AI "turning" on us. Like that we'd accidentally build a Paperclip Maximizer, and it would manipulate people by saying things like "Performing [action X which will actually lead to the world being turned into paperclips] will end all human suffering, you should definitely do it." And that this could be avoided by using an Oracle AI that just will tell us "If you perform action X, it will turn the world into paperclips." And then we can just say "oh, that's dumb, let's not do that."
And I think that this misunderstands alignment. An Oracle that tells you only effective and correct plans for achieving your goals, and doesn't attempt to manipulate you into achieving its own goals, because it doesn't have its own goals besides providing you with effective and correct plans, is still super dangerous. Because you'll ask it for a plan to get a really nice lemon poppy seed muffin, and it will spit out a plan, and when you execute the plan, your grandma will die. Not because the system was trying to kill your grandma, but because that was the most efficient way to get a muffin, and you didn't specify that you wanted your grandma to be alive.
(And you won't know the plan will kill your grandma, because if you understood the plan and all its consequences, it wouldn't be superintelligent)
Alignment isn't about guarding against an AI that has cross purposes to you. It's about building something that understands that when you ask for a muffin, you want your grandma to still be alive, without you having to say that (because there's a lot of things you forgot to specify, and it needs to avoid all of them). And so even an Oracle thing that just gives you plans is dangerous unless it knows those plans need to avoid all the things you forgot to specify. This was what I got out of the Outcome Pump story, and so maybe I'm just saying things everyone already knows...
"Since my expectations sometimes conflict with my subsequent experiences, I need different names for the thingies that determine my experimental predictions and the thingy that determines my experimental results. I call the former thingies 'beliefs', and the latter thingy 'reality'."
I think this is a fine response to Mr. Carrico, but not to the post-modernists. They can still fall back to something like "Why are you drawing a line between 'predictions' and 'results'? Both are simply things in your head, and since you can't directly observe reality, your 'results' are really just your predictions of the results based off of the adulterated model in your head! You're still just asserting your belief is better."
The tack I came up with in the meditation was that the "everything is a belief" might be a bit falsely dichotomous. I mean, it would seem odd, given that everything is a belief, to say that Anne telling you the marble is in the basket is just as good evidence as actually checking the basket yourself. It would imply weird things like, once you check and find it in the box, you should be only 50% sure of where the marble is, because Anne's statement is weighed equally.
(And thought it's difficult to put my mind in this state, I can think of this as not in service of determining reality, but instead as trying to inform my belief that, after I reach into the box, I will believe that I am holding a marble.)
Once you concede that different beliefs can weigh as different evidence, you can use Bayesian ideas to reconcile things. Something like "nothing is 'true' in the sense of deserving 100% credence assigned to it (saying something is true really does just mean that you really really believe it, or, more charitably, that belief has informed your future beliefs better than before you believed it), but you can take actions to become more 'accurate' in the sense of anticipating your future beliefs better. While they're both guesses (you could be hallucinating, or something), your guess before checking is likely to be worse, more diluted, filtered through more layers from direct reality, than your guess after checking."
I may be off the mark if the post-modernist claim is that reality doesn't exist, not just that no one's beliefs about it can be said to be better than anyone else's.
In a Newcombless problem, where you can either have $1,000 or refuse it and have $1,000,000, you could argue that the rational choice is to take the $1,000,000, and then go back for the $1,000 when people's backs were turned, but it would seem to go against the nature of the problem.
In much the same way, if Omega is a perfect predictor, there is no possible world where you receive $1,000,000 and still end up going back for the second. Either Rachel wouldn't have objected, or the argument would've taken more than 5 minutes, and the boxes disappear, or something.
I'm not sure how Omega factors in the boxes' contents in this "delayed decision" version. Like, let's say Irene is will, absent external forces, one box, and Rachel, if Irene receives $1,000,000, will threaten Irene sufficiently to take the second box, and will do nothing if Irene receives nothing. (Also they're automatons, and these are descriptions of their source code, and so no other unstated factors are able to be taken into account)
Omega simulates reality A, with the box full, sees that Irene will 2 box after threat by Rachel.
Omega simulates reality B, with the box empty, and sees that Irene will 1 box.
Omega, the perfect predictor, cannot make a consistent prediction, and, like the unstoppable force meeting the immovable object, vanishes in a puff of logic.
I think, if you want to aim at this sort of thing, the better formulation is to just claim that Omega is 90% accurate. Then there's no (immediate) logical contradiction in receiving the $1,000,000 and going back for the second box. And the payoffs should still be correct.
1 box: .9*1,000,000 + .1*0 = 900,000
2 box: .9*1,000 + .1*1,001,000 = 101,000
I expect that this formulation runs folly of what was discussed in this post around the Smoking Lesions problem, where repeated trials may let you change things you shouldn't be able to (in their example, if you chose to smoke every time, then if the correlation between smoking and lesions was held, then you can change the base rate of the lesions).
That is, I expect that if you ran repeated simulations, to try things out, then strategies like "I will one box, and iff it is full, then I will go back for the second box" will make it so Omega is incapable of predicting at the proposed 90% rate.
I think all of these things might be related to the problem of embedded agency, and people being confused (even if they don't put it in these terms) that they have an atomic free will that can think about things without affecting or being affected by the world. I'm having trouble resolving this confusion myself, because I can't figure out what Omega's prediction looks like instead of vanishing in a puff of logic. It may just be that statements like "I will turn the lever on if, and only if, I expect the lever to be off at the end" are a nonsense decision criteria. But the problem as stated doesn't seem like it should be impossible, so... I am confused.
In much the same way, estimates of value and calculations based on the number of permutations of atoms shouldn't be mixed together. There being a googleplex possible states in no way implies that any of them have a value over 3 (or any other number). It does not, by itself, imply that any particular state is better than any other. Let alone that any particular state should have value proportional to the total number of states possible.
Restricting yourself to atoms within 8000 light years, instead of the galaxy, just compounds the problem as well, but you noted that yourself. The size of the galaxy wasn't actually a relevant number, just a (maybe) useful comparison. It's like when people say that chess has more possible board states than there are atoms in the observable universe times the number of seconds since the Big Bang. It's not that there's any specifically useful interaction between atoms and seconds and chess, it's just to recognize the scale of the problem.
I still think the argument holds in this case, because even computer software isn't atom-less. It needs to be stored, or run, or something somewhere.
I don't doubt that you could drastically reduce the number of atoms required for many products today. For example, you could in future get a chip in your brain that makes typing without a keyboard possible. That chip is smaller than a keyboard, so represents lots of atoms saved. You could go further, and have that chip be an entire futuristic computer suite, by reading and writing your brain inputs and outputs directly it could replace the keyboard, mouse, monitors, speakers, and entire desktop, plus some extra stuff, like also acting as a VR Headset, or video game console, or whatever. Lets say you manage to squeeze all that into a single atom. Cool. That's not enough. For this growth to go on for those ~8000 years, you'd need to have that single-atom brain chip be as valuable as everything on Earth today. Along with every other atom in the galaxy.
I think at some point, unless the hottest thing in the economy becomes editing humans to value specific atoms arbitrary amounts (which sounds bad, even if it would work), you can't get infinite value out of things. I'm not even sure human minds have the capability of valuing things infinitely. I think even with today's economy, you'd start to hit some asymptotes (i.e. if one person had everything in the world, I'm not sure what they'd do with it all. I'm also not sure they'd actually value it any more than if they just had 90% of everything, except maybe the value on saying "I have it all", which wouldn't be represented in our future economy)
And still, the path to value per atom has to come from somewhere, and in general it's going to be making stuff more useful, or smaller, but there's only so useful a single atom can be, and there's only so small a useful thing can be. (I imagine some math on the number of ways you could arrange a set of particles, multiplied by the number of ways a particular arrangement could be used, as an estimate. But a quick guess says that neither of those values are infinite, and, I expect that number to be dominated by ways of arranging particles, not by number of uses, considering that even software on a computer is actually different arrangements of the electrons.)
So I guess that's the heart of it to me, there's certainly a lot more value we can squeeze out of things, but if there's not literally infinite, it will run out at some point, and that ~8000 year estimate is looking pretty close to whatever the limit is, if it's not already over it.
I think this is a useful post, but I don't think the water thing helped in understanding:
"In the Twin Earth, XYZ is "water" and H2O is not; in our Earth, H2O is "water" and XYZ is not."
This isn't an answer, this is the question. The question is "does the function, curried with Earth, return true for XYZ, && does the function, curried with Twin Earth, return true for H2O?"
Now, this is a silly philosophy question about the "true meaning" of water, and the real answer should be something like "If it's useful, then yes, otherwise, no." But I don't think this is a misunderstanding of 2-place functions. At least, thinking about it as a 2-place function that takes a world as an argument doesn't help dissolve the question.
I was thinking about applying the currying to topic, instead of world, (e.g. "heavy water" returns true for an isWater("in the ocean"), but not for an isWater("has molar mass ~18")), but this felt like a motivated attempt to apply the concept, when the answer is just [How an Algorithm Feels from the Inside](https://www.lesswrong.com/posts/yA4gF5KrboK2m2Xu7/how-an-algorithm-feels-from-inside).
Either way, the Sexiness example is better.
I feel like I might be being a little coy stating this, but I feel like "heterogeneous preferences" may not be as inadequate as it seems. At least, if you allow that those heterogeneous preferences are not only innate like taste preference for apples over oranges.
If I have a comparative advantage in making apples, I'm going to have a lot of apples, and value the marginal apple less than the marginal orange. I don't think this is a different kind of "preference" than liking the taste of oranges better: Both base out in me preferring an orange to an apple. And so we engage in trade for exactly that reason. In fact, I predict we stop trading once I value the marginal apple more than the marginal orange (or you, vice versa), regardless of the state of my comparative advantage. (That is, in a causal sense, the comparative advantage is covered by my marginal value assignments. My comparative advantage may inform my value assignments, but once you know those, you don't need to know if I still have a comparative advantage or not for the question "will I trade apples for oranges?")
Comparative advantage is why trade is useful, but I don't know if it's really accurate to say that heterogeneous preferences are an "inadequate answer to why we trade."
I'm glad to hear that the question of what hypotheses produce actionable behavior is on people's minds.
I modeled Murphy as an actual agent, because I figured a hypothesis like "A cloaked superintelligence is operating the area that will react to your decision to do X by doing Y" is always on the table, and is basically a template for allowing Murphy to perform arbitrary action Y.
I feel like I didn't quite grasp what you meant by "a constraint on Murphy is picked according to this probability distribution/prior, then Murphy chooses from the available options of the hypothesis they picked"
But based on your explanation after, it sounds like you essentially ignore hypotheses that don't constrain Murphy, because they act as an expected utility drop on all states, so it just means you're comparing -1,000,000 and -999,999, instead of 0 and 1. For example, there's a whole host of hypotheses of the form "A cloaked superintelligence converts all local usable energy into a hellscape if you do X", and since that's a possibility for every X, no action X is graded lower than the others by its existence.
That example is what got me thinking, in the first place, though. Such hypotheses don't lower everything equally, because, given other Laws of Physics, the superintelligence would need energy to hell-ify things. So arbitrarily consuming energy would reduce how bad the outcomes could be if a perfectly misaligned superintelligence was operating in the area. And, given that I am positing it as a perfectly misaligned superintelligence, we should both expect it to exist in the environment Murphy chooses (what could be worse?) and expect any reduction of its actions to be as positive of changes as a perfectly aligned superintelligence's actions could be, since preventing a maximally detrimental action should match, in terms of Utility, enabling a maximally beneficial action. Therefore, entropy-bombs.
Thinking about it more, assuming I'm not still making a mistake, this might just be a broader problem, not specific to this in any way. Aren't I basically positing Pascal's Mugging?
Anyway, thank you for replying. It helped.
I'm still confused. My biology knowledge is probably lacking, so maybe that's why, but I had a similar thought to dkirmani after reading this: "Why are children born young?" Given that sperm cells are active cells (which should give transposons opportunity to divide), why do they not produce children with larger transposon counts? I would expect whatever sperm divide from to have the same accumulation of transposons that causes problems in the divisions off stem cells.
Unless piRNA and siRNA are 100% at their jobs, and nothing is explicitly removing transposons in sperm/eggs better than in the rest of the body, then surely there should be at least a small amount of accumulation of transposons across generations. Is this something we see?
I vaguely remember that women are born with all the egg cells they'll have, so, if that's true, then maybe that offers a partial explanation (only half the child genome should be as infected with transposons?). I'm not sure it holds water, because since egg cells are still alive, even if they aren't dividing more, they should present opportunities for transposons to multiply.
Another possible explanation I thought of was that, in order to be as close to 100% as possible, piRNA and siRNA work more than normal in the gonads, which does hurt the efficacy of sperm, but because you only need 1 to work, that's ok. Still, unless it is actually 100%, there should be that generational accumulation.
This isn't even just about transposons. It feels like any theory of aging would have to contend with why sperm and eggs aren't old when they make a child, so I'm not sure what I'm missing.
A little late to the party, but
I'm confused about the minimax strategy.
The first thing I was confused about was what sorts of rules could constrain Murphy, based on my actions. For example, in a bit-string environment, the rule "every other bit is a 0" constrains Murphy (he can't reply with "111..."), but not based on my actions. It doesn't matter what bits I flip, Murphy can always just reply with the environment that is maximally bad, as long as it has 0s in every other bit. Another example would be if you have the rule "environment must be a valid chess board," then you can make whatever moves you want, and Murphy can just return the environment with the rule "if you make that move, then the next board state is you in checkmate", after all, you being in checkmate is a valid chessboard, and therefore meets the only rule you know. And you can't know what other rules Murphy plays by. You can't really run minimax on that, then, because all of Murphy's moves look like "set the state to the worst allowable state."
So, what kind of rules actually constrain Murphy based on my actions? My first take was "rules involving time," for instance if you have the rule "only one bit can be flipped per timestep" then you can constrain Murphy. If you flip a bit, then within the next timestep, you've eliminated some possibilities (they would require flipping that bit back and doing something else), so you can have a meaningful minimax on which action to take.
This didn't feel like the whole story though, so I had a talk with my friend about it, and eventually, we generalized it to "rules that consume resources." An example would be, if you have the rule "for every bit you flip, you must also flip one of the first 4 bits from a 1 to a 0", then we can constrain Murphy. If I flip any bit, that leaves 1 less bit for Murphy to use to mess with me.
But then the minimax strategy started looking worrying to me. If the only rules that you can use to constrain Murphy are ones that use resources, then wouldn't a minimax strategy have some positive preference for destroying resources in order to prevent Murphy from using them? It seems like a good way to minimize Murphy's best outcomes.