Posts
Comments
Thanks for coming. :)
I am confused by your confusion. Your basic question is "what is the source of the adversarial selection". The answer is "the system itself" (or in some cases, the training/search procedure that produces the system satisfying your specification). In your linked comment, you say "There's no malicious ghost trying to exploit weaknesses in our alignment techniques." I think you've basically hit on the crux, there. The "adversarially robust" frame is essentially saying you should think about the problem in exactly this way.
I think Eliezer has conceded that Stuart Russel puts the point best. It goes something like: "If you have an optimization process in which you forget to specify every variable that you care about, then unspecified variables are likely to be set to extreme values." I would tack on that due to the fragility of human value, it's much easier to set such a variable to an extremely bad value than an extremely good one.
Basically, however the goal of the system is specified or represented, you should ask yourself if there's some way to satisfy that goal in a way that doesn't actually do what you want. Because if there is, and it's simpler than what you actually wanted, then that's what will happen instead. (Side note: the system won't literally do something just because you hate it. But the same is true for other Goodhart examples. Companies in the Soviet Union didn't game the targets because they hated the government, but because it was the simplest way to satisfy the goal as given.)
"If the system is trying/wants to break its safety properties, then it's not safe/you've already made a massive mistake somewhere else." I mean, yes, definitely. Eliezer makes this point a lot in some Arbital articles, saying stuff like "If the system is spending computation searching for things to harm you or thwart your safety protocols, then you are doing the fundamentally wrong thing with your computation and you should do something else instead." The question is how to do so.
Also from your linked comment: "Cybersecurity requires adversarial robustness, intent alignment does not." Okay, but if you come up with some scheme to achieve intent alignment, you should naturally ask "Is there a way to game this scheme and not actually do what I intended?" Take this Arbital article on the problem of fully-updated deference. Moral uncertainty has been proposed as a solution to intent alignment. If the system is uncertain as to your true goals, then it will hopefully be deferential. But the article lays out a way the system might game the proposal. If the agent can maximize its meta-utility function over what it thinks we might value, and still not do what we want, then clearly this proposal is insufficient.
If you propose an intent alignment scheme such that when we ask "Is there any way the system could satisfy this scheme and still be trying to harm us?", the answer is "No", then congrats, you've solved the adversarial robustness problem! That seems to me to be the goal and the point of this way of thinking.
I mean, fair enough, but I can't weigh it up against every other opportunity available to you on your behalf. I did try to compare it to learning other languages. I'll toss into the post that I also think it's comparatively easy to learn.
FWIW I genuinely think ASL is easy to learn with the videos I linked above. Overall I think sign is more worthwhile to learn than most other languages, but yes, not some overwhelming necessity. Just very personally enriching and neat. :)
Thanks for the feedback!
It's entirely just a neat thing. I think most people should consider learning to sign, and the idea of it becoming a rationalist "thing" just sounded fun to me. I did try to make that clear, but apologies if it wasn't. And as I said, sorry this is kind of off topic, it's just been a thing bouncing around in my head.
Honestly I found ASL easier to learn than, say, the limited Spanish I tried to learn in high school. Maybe because it doesn't conflict with the current way you communicate. Just from watching the ASL 1 - 4 lectures I linked to, I was surprisingly able to manage once dropped in a one-on-one conversation with a deaf person.
It would definitely be good to learn with a buddy. My wife hasn't explicitly learned it yet, but she's picked up some from me. Israel is a tough choice, I'm not sure what the learning resources are like for it.
...and now I am also feeling like I really should have realized this as well.
I agree that there isn’t an “obvious” set of assumptions for the latter question that yields a unique answer. And granted I didn’t really dig into why entropy is a good measure, but I do think it ultimately yields the unique best guess given the information you have. The fact that it’s not obvious is rather the point! The question has a best answer, even if you don’t know what it is or how to give it.
In any real-life inference problem, nobody is going to tell you: "Here is the exact probability space with a precise, known probability for each outcome." (I literally don't know what such a thing would mean anyway). Is all inference thereby undefined? Like Einstein said, "As far as laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality". If you can't actually fulfill the axioms in real life, what's the point?
If you still want to make inferences anyway, I think you're going to have to adopt the Bayesian view. A probability distribution is never handed to us, and must always be extracted from what we know. And how you update your probabilities in response to new evidence also depends on what you know. If you can formalize exactly how, then you have a totally well-defined mathematical problem, hooray!
My point, then, is that we feel a problem isn't well-defined exactly when we don't know how to convert what we know into clear mathematics. (I'm really not trying to play a semantics game. This was an attempt to dissolve the concept of "well-defined" for probability questions.) But you can see a bit of a paradox when adding more information makes the mathematical problem harder, even though this shouldn't make the problem any less "well-defined".
Would one be allowed to make multiple submissions distilling different posts? I don't know if I would necessarily want to do that, but I'm at least curious about the ruling.
Cool, makes sense. I was planning on making various inquiries along these lines starting in a few weeks, so I may reach out to you then. Would there be a best way to do that?
But even in the case of still being in school, one would require the background of having proved non-trivial original theorems? All this sounds exactly like the research agenda I'm interested in. I have a BS in math and am working on an MS in computer science. I have a good math background, but not at that level yet. Should I consider applying or no?
I said nothing about an arbitrary utility function (nor proof for that matter). I was saying that applying utility theory to a specific set of terminal values seems to basically get you an idealized version of utilitarianism, which is what I thought the standard moral theory was around here.
None of what you have linked so far has particularly conveyed any new information to me, so I think I just flatly disagree with you. As that link says, the "utility" in utilitarianism just means some metric or metrics of "good". People disagree about what exactly should go into "good" here, but godshatter refers to all the terminal values humans have, so that seems like a perfectly fine candidate for what the "utility" in utilitarianism ought to be. The classic "higher pleasures" in utilitarianism lends credence toward this fitting into the classical framework; it is not a new idea that utilitarianism can include multiple terminal values with relative weighting.
Under utilitarianism, we are then supposed to maximize this utility. That is, maximize the satisfaction of the various terminal goals we are taking as good, aggregated into a single metric. And separately, there happens to be this elegant idea called "utility theory", which tells us that if you have various preferences you are trying to maximize, there is a uniquely rational way to do that, which involves giving them relative weights and aggregating into a single metric... You seriously think there's no connection here? I honestly thought all this was obvious.
In that last link, they say "Now, it is sometimes claimed that one may use decision-theoretic utility as one possible implementation of the utilitarian’s 'utility'" then go on to say why this is wrong, but I don't find it to be a knockdown argument; that is basically what I believe and I think I stand by it. Like, if you plug "aggregate human well-being along all relevant dimensions" into the utility of utility theory, I don't see how you don't get exactly utilitarianism out of that, or at least one version of it?
EDIT: Please also see in the above post under "You should never try to reason using expected utilities again. It is an art not meant for you. Stick to intuitive feelings henceforth." It seems to me that Eliezer goes on to consistently use the "expected utilities" of utility theory to be synonymous to the "utilities" of utilitarianism and the "consequences" of consequentialism. Do you agree that he's doing this? If so, I assume you think he's wrong for doing it? Eliezer tends to call himself a utilitarian. Do you agree that he is one, or is he something else? What would you call "using expected utility theory to make moral decisions, taking the terminal value to be human well-being"?
I meant to convey a utility function with certain human values as terminal values, such as pleasure, freedom, beauty, etc.; godshatter was a stand-in.
If the idea of a utility function has literally nothing to do with moral utilitarianism, even around here, I would question why in the above when Eliezer is discussing moral questions he references expected utility calculations? I would also point to “intuitions behind utilitarianism“ as pointing at connections between the two? Or “shut up and multiply”? Need I go on?
I know classical utilitarianism is not exactly the same, but even in what you linked, it talks about maximizing the total sum of human happiness and sacrificing some goods for others, measured under a single metric “utility”. That sounds an awful lot like a utility function trading off human terminal values? I don’t see how what I’m pointing at isn’t just a straightforward idealization of classical utilitarianism.
Hm, I worry I might be a confused LWer. I definitely agree that "having a utility function" and "being a utilitarian" are not identical concepts, but they're highly related, no? Would you agree that, to a first-approximation, being a utilitarian means having a utility function with the evolutionary godshatter as terminal values? Even this is not identical to the original philosophical meaning I suppose, but it seems highly similar, and it is what I thought people around here meant.
I'm curious about what continued role you do expect yourself to have. I think you could still have a lot of value in helping train up new researchers at MIRI. I've read you saying you've developed a lot of sophisticated ideas about cognition that are hard to communicate, but I imagine could be transmitted easier within MIRI. If we need a continuing group of sane people to be on the lookout for positive miracles, would you still take a relatively active role in passing on your wisdom to new MIRI researchers? I would genuinely imagine that being in more direct mind-to-mind contact with you would be useful, so I hope you don't become a hermit.
What do you think about ironically hiring Terry Tao?
Eliezer, do you have any advice for someone wanting to enter this research space at (from your perspective) the eleventh hour? I’ve just finished a BS in math and am starting a PhD in CS, but I still don’t feel like I have the technical skills to grapple with these issues, and probably won’t for a few years. What are the most plausible routes for someone like me to make a difference in alignment, if any?
I think I might actually be happy to take e.g. the Bellman equation, a fundamental equation in RL, as a basic expression of consistent utilities and thereby claim value iteration, Q-learning, and deep Q-learning all as predictions/applications of utility theory. Certainly this seems fair if you claim applications of the central limit theorem for probability theory.
To expand a bit, the Bellman equation only expresses a certain consistency condition among utilities. The expected utility of this state must equal its immediate utility plus the best expected utility among each possible next state I may choose. Start with some random utilities assigned to states, gradually update them to be consistent, and you get optimal behavior. Huge parts of RL are centered around this equation, including e.g. DeepMind using DQNs to crack Atari games.
I understand Eliezer's frustration in answering this question. The response to "What predictions/applications does utility theory have?" in regards to intelligent behavior is, essentially, "Everything and nothing."
Hey all, organizer here. I don't know if you'll automatically get notified of this message, but I don't have emails for everyone. I just wanted to give some parking info. You can get a daily parking pass here: https://parking.ucf.edu/permits/visitor-permits/. You can get a virtual pass and they'll check by plate. It's $3. I'd recommend parking in Garage A or I. Hope to see everyone there!
From what I understand, it's difficult enough to get an abortion as it is. Clinics are rather rare, insurance doesn't always cover it, there may be mandatory waiting periods and counseling, etc. I don't think it would be impossible to still get one, but the added inconvenience is not trivial. At minimum, a big increase in travel time and probable insurance complications. But if someone here knows more than me, I'd very much like to hear it.
I'd like to note that Texas is passing strong restrictions on abortion. They've passed a "heartbeat bill" banning abortions after six weeks, and it seems likely that they'll pass a trigger bill outlawing abortion almost entirely, contingent on the Supreme Court overturning Roe v Wade.
I'm not a Supreme Court expert, but I know people who are sincerely worried about Roe v Wade being undone. This would be a pretty big deal breaker for my fiancée (and by extension myself). From what I read, the Supreme Court will make a Roe v Wade ruling in the middle of 2022.
Does this factor into your considerations? I feel like this would be a pretty big deal for the rationalist community at large.
I think you need to fix the days listed on the application form, they say August 17th - 20th.
I don't have a wonderful example of this insta-feedback (which definitely sounds ideal for learning), but I've gotten annoyed lately with any math book that doesn't have exercises. Some of the books on MIRI's Research Guide list are like this, and it really boggles my mind how anyone could learn math from a book when they don't have anything to practice with. So I'm getting more selective.
Even some books with exercises are super hard, and really don't have any kind of walkthrough process. AI: A Modern Approach is a critically-acclaimed textbook, but has little to no build up in the difficulty of its exercises, and little to help you if you get lost. Right now I'm reading How to Prove It, which is *super* good. The whole book is one big walkthrough on mathematical proof, down to how to organize your scratch work. It has tons of exercises with varying difficulty, with some answers and hints. It's much better feedback, and is helping me a lot more, although the material is comparatively simple.
Well, I tend to throw them onto my general to-read list, so I'm not entirely sure. A few I remember are Godel, Escher, Bach, Judgment Under Uncertainty: Heuristics and Biases, Influence: Science and Practice, The End of Time, QED: The Strange Theory of Light and Matter, The Feynman Lectures, Surely You're Joking Mr. Feynman, Probability Theory: The Logic of Science, Probabilistic Inference in Intelligent Systems, and Player of Games. There's a longer list here, but it's marked as outdated.
Sounds awesome! A meatspace group would be great, I'm sure. One of my issues with self-study is having nobody to go to when I have questions or don't understand something. Having an empirical goal can also tell you if you've succeeded or failed in your attempt to learn the art.
I definitely agree that there's a bigger issue, but I think this could be a good small-scale test. Can we apply or own individual rationality to pick up skills relevant to us and distinguish between good and bad practices? Are we able to coordinate as a community to distinguish between good and bad science? Rationality should in theory be able to work on big problems, but we're never going to be able to craft the perfect art without being able to test it on smaller problems first and hone the skill.
So yeah. I think a guide putting together good resources and also including practical advice in the posts and comments could be useful. Something like this could be the start of answering Eliezer's questions "how do we test which schools of rationality work" and "how do we fight akrasia". That second question might be easier once we've seen the skills work in practice. Maybe I should make a guide first to get the ball rolling, but I'm not sure I know a topic in-depth enough to craft one just yet.