## Posts

The Artificial Intentional Stance 2019-07-27T07:00:47.710Z · score: 14 (5 votes)
Some Comments on Stuart Armstrong's "Research Agenda v0.9" 2019-07-08T19:03:37.038Z · score: 22 (7 votes)
Training human models is an unsolved problem 2019-05-10T07:17:26.916Z · score: 16 (6 votes)
Value learning for moral essentialists 2019-05-06T09:05:45.727Z · score: 13 (5 votes)
Humans aren't agents - what then for value learning? 2019-03-15T22:01:38.839Z · score: 20 (6 votes)
How to get value learning and reference wrong 2019-02-26T20:22:43.155Z · score: 40 (10 votes)
Philosophy as low-energy approximation 2019-02-05T19:34:18.617Z · score: 40 (21 votes)
Can few-shot learning teach AI right from wrong? 2018-07-20T07:45:01.827Z · score: 16 (5 votes)
Boltzmann Brains and Within-model vs. Between-models Probability 2018-07-14T09:52:41.107Z · score: 19 (7 votes)
Is this what FAI outreach success looks like? 2018-03-09T13:12:10.667Z · score: 53 (13 votes)
Book Review: Consciousness Explained 2018-03-06T03:32:58.835Z · score: 101 (27 votes)
A useful level distinction 2018-02-24T06:39:47.558Z · score: 26 (6 votes)
Explanations: Ignorance vs. Confusion 2018-01-16T10:44:18.345Z · score: 18 (9 votes)
Empirical philosophy and inversions 2017-12-29T12:12:57.678Z · score: 8 (3 votes)
Dan Dennett on Stances 2017-12-27T08:15:53.124Z · score: 8 (4 votes)
Philosophy of Numbers (part 2) 2017-12-19T13:57:19.155Z · score: 11 (5 votes)
Philosophy of Numbers (part 1) 2017-12-02T18:20:30.297Z · score: 25 (9 votes)
Limited agents need approximate induction 2015-04-24T21:22:26.000Z · score: 1 (1 votes)

Comment by charlie-steiner on A Critique of Functional Decision Theory · 2019-09-15T07:07:46.868Z · score: 2 (1 votes) · LW · GW

There's an interesting relationship with mathematizing of decision problems here, which I think is reflective of normal philosophy practice.

For example, in the Smoking Lesion problem, and in similar cases where you consider an agent to have "urges" or "dispositions" et c., it's important to note that these are pre-mathematical descriptions of something we'd like our decision theory to consider, and that to try to directly apply them to a mathematical theory is to commit a sort of type error.

Specifically, a decision-making procedure that "has a disposition to smoke" is not FDT. It is some other decision theory that has the capability to operate in uncertainty about its own dispositions.

I think it's totally reasonable to say that we want to research decision theories that are capable of this, because this epistemic state of not being quite sure of your own mind is something humans have to deal with all the time. But one cannot start with a mathematically specified decision theory like proof-based UDT or causal-graph-based CDT and then ask "what it would do if it had the smoking lesion." It's a question that seems intuitively reasonable but, when made precise, is nonsense.

I think what this feels like to philosophers is giving the verbal concepts primacy over the math. (With positive associations to "concepts" and negative associations to "math" implied). But what it leads to in practice is people saying "but what about the tickle defense?" or "but what about different formulations of CDT" as if they were talking about different facets of unified concepts (the things that are supposed to have primacy), when these facets have totally distinct mathematizations.

At some point, if you know that a tree falling in the forest makes the air vibrate but doesn't lead to auditory experiences, it's time to stop worrying about whether it makes a sound.

So obviously I (and LW orthodoxy) are on the pro-math side, and I think most philosophers are on the pro-concepts side (I'd say "pro-essences," but that's a bit too on the nose). But, importantly, if we agree that this descriptive difference exists, then we can at least work to bridge it by being clear about whether were's using the math perspective or the concept perspective. Then we can keep different mathematizations strictly separate when using the math perspective, but work to amalgamate them when talking about concepts.

Comment by charlie-steiner on SSC Meetups Everywhere: Champaign-Urbana, IL · 2019-09-14T19:52:47.446Z · score: 4 (2 votes) · LW · GW

I'll probably show up and bring snacks. Specifically, snacks that I won't mind eating if no-one else is there :P

Comment by charlie-steiner on Theory of Ideal Agents, or of Existing Agents? · 2019-09-14T03:02:43.276Z · score: 2 (1 votes) · LW · GW

Yeah. As someone more on the description side, I would say the problems are even more different than you make out, because the description problem isn't just "find a model that fits humans." It's figuring out ways to model the entire world in terms of the human perceived environment, and using the right level of description in the right context.

Comment by charlie-steiner on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-11T18:00:33.451Z · score: 4 (2 votes) · LW · GW

The Ernie Davis interview was pretty interesting as a really good delve into what people are thinking when they don't see AI alignment work as important.

• The disagreement on how impactful superintelligent AI would be seems important, but not critically important. As long as you agree the impact of AIs that make plans about the real world will be "big enough," you're probably on board with wanting them to make plans that are aligned with human values.
• The "common sense" disagreement definitely seems more central. The argument goes something like "Any AI that actually makes good plans has to have common sense about the world, it's common sense that killing is wrong, the AI won't kill people."
• Put like this, there's a bit of a package deal fallacy, where common sense is treated as a package deal even though "fire is hot" and "killing is bad" are easy to logically separate.
• But we can steelman this by talking about learning methods - if we have a learning method that works for all the common sense that learns "fire is hot," wouldn't it be easy to also use that to learn "killing is bad?" Well, maybe not necessarily, because of the is/ought distinction. If the AI represents "is" statements with a world model, and then rates actions in the world by using an "ought" model, then it's possible for a method to do really well at learning "is"'s without being good at learning "ought"s.
• Thinking in terms of learning methods also opens up a second question - is it really necessary for an AI to have human-like common sense? If you just throw a bunch of compute at the problem, could you get an AI that takes clever actions in the world without ever learning the specific fact "fire is hot"? How likely is this possibility?
Comment by charlie-steiner on Algorithmic Similarity · 2019-08-23T18:10:54.460Z · score: 3 (2 votes) · LW · GW

Hey, this is well written.

Out of curiosity, how do you feel about Rice's Theorem?

Comment by charlie-steiner on A basic probability question · 2019-08-23T08:50:45.097Z · score: 3 (2 votes) · LW · GW

To elaborate, A->B is an operation with a truth table:

A    B    A->B
T    T    T
T    F    F
F    T    T
F    F    T


The only thing that falsifies A->B is if A is true but B is false. This is different from how we usually think about implication, because it's not like there's any requirement that you can deduce B from A. It's just a truth table.

But it is relevant to probability, because if A->B, then you're not allowed to assign high probability to A but low probability to B.

EDIT: Anyhow I think that paragraph is a really quick and dirty way of phrasing the incompatibility of logical uncertainty with normal probability. The issue is that in normal probability, logical steps are things that are allowed to happen inside the parentheses of the P() function. No matter how complicated the proof of φ, as long as the proof follows logically from premises, you can't doubt φ more than you doubt the premises, because the P() function thinks that P(premises) and P(logical equivalent of premises according to Boolean algebra) are "the same thing."

Comment by charlie-steiner on Vague Thoughts and Questions about Agent Structures · 2019-08-23T07:03:33.564Z · score: 3 (2 votes) · LW · GW

Ah, MIRI summer fellows! Maybe that's why there's so many posts today.

I think that if there's a dichotomy, it's "abstract/ideal agents" vs. "physical 'agents'".

Physical agents, like humans, don't have to be anything like agent clusters - there doesn't have to be any ideal agent hiding inside them. Instead, thinking about them as agents is a descriptive step taken by us, the people modeling them. The key philosophical technology is the intentional stance.

(Yeah, I do feel like "read about the intentional stance" is this year's "read the sequences")

On to the meat of the post - agents are already very general, especially if you allow preferences over world-histories, at which point they become really general. Maybe it makes more sense to think of these things as languages in which some things are simple and others are complicated? At which point I think you have a straightforward distance function between languages (how surprising is one language one average to another), but no sense of equivalency aside from identical rankings.

Comment by charlie-steiner on Does Agent-like Behavior Imply Agent-like Architecture? · 2019-08-23T04:34:35.930Z · score: 5 (3 votes) · LW · GW

Consider the Sphex wasp, doing the same thing in response to the same stimulus. Would you say that this is not an agent, or would you say that it is part of an agent, and that extended agent did search in a "world model" instantiated in the parts of the world inhabited by ancestral wasps?

At this point, if you allow "world model" to be literally anything with mutual information including other macroscopic situations in the world, and "search" to be any process that gives you information about outcomes, then yes, I think you can guarantee that, probabilistically, getting a specific outcome requires information about that outcome (no free lunch), which implies "search" on a "world model." As for goals, we can just ignore the apparent goals of the Sphex wasp and define a "real" agent (evolution) to have a goal defined by whatever informative process was at work (survival).

Comment by Charlie Steiner on [deleted post] 2019-08-21T00:57:14.001Z

Well, maybe I didn't do a good job understanding your question :)

Decision procedures that don't return an answer, or that fail to halt, for some of the "possible" histories, seem like a pretty broad category. Ditto for decision procedures that always have an answer.

But I guess a lot of those decision procedures are boring or dumb. So maybe you were thinking about a question like "for sufficiently 'good' decision theories, do they all end up specifying responses for all counter-logical histories, or do they leave free parameters?"

Am I on the right track?

Comment by charlie-steiner on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-20T07:39:24.259Z · score: 4 (2 votes) · LW · GW

Sure. On the one hand, xkcd. On the other hand, if it works for you, that's great and absolutely useful progress.

I'm a little worried about direct applicability to RL because the model is still not fully naturalized - actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the "right" answer is "sophisticated common sense," but an ad-hoc mostly-answer would still be useful conceptual progress.

Comment by charlie-steiner on Self-Supervised Learning and AGI Safety · 2019-08-19T19:28:24.917Z · score: 3 (2 votes) · LW · GW

The search thing is a little subtle. It's not that search or optimization is automatically dangerous - it's that I think the danger is that search can turn up adversarial examples / surprising solutions.

I mentioned how I think the particular kind of idiot-proofness that natural language processing might have is "won't tell an idiot a plan to blow up the world if they ask for something else." Well, I think that as soon as the AI is doing a deep search through outcomes to figure out how to make Alzheimer's go away, you lose a lot of that protection and I think the AI is back in the category of Oracles that might tell an idiot a plan to blow up the world.

Going beyond human knowledge

You make some good points about even a text-only AI having optimization pressure to surpass humans. But for the example "GPT-3" system, even if it in some sense "understood" the cure for Alzheimer's, it still wouldn't tell you the cure for Alzheimer's in response to a prompt, because it's trying to find the continuation of the prompt with highest probability in the training distribution.

The point isn't about text vs. video. The point is about the limitations of trying to learn the training distribution.

To the extent that understanding the world will help the AI learn the training distribution, in the limit of super-duper-intelligent AI it will understand more and more about the world. But it will filter that all through the intent to learn the training distribution. For example, if human text isn't trustworthy on a certain topic, it will learn to not be trustworthy on that topic either.

Comment by charlie-steiner on Distance Functions are Hard · 2019-08-16T05:02:43.998Z · score: 2 (1 votes) · LW · GW

Sure. In the case of Lincoln, I would say the problem is solved by models even as clean as Pearl-ian causal networks. But in math, there's no principled causal network model of theorems to support counterfactual reasoning as causal calculus.

Of course, I more or less just think that we have an unprincipled causality-like view of math that we take when we think about mathematical counterfactuals, but it's not clear that this is any help to MIRI understanding proof-based AI.

Comment by charlie-steiner on Distance Functions are Hard · 2019-08-15T17:42:02.632Z · score: 2 (1 votes) · LW · GW

I feel like this is practically a frequentist/bayesian disagreement :D It seems "obvious" to me that "If Lincoln were not assassinated, he would not have been impeached" can be about the real Lincoln as much as me saying "Lincoln had a beard" is, because both are statements made using my model of the world about this thing I label Lincoln. No reference class necessary.

Comment by charlie-steiner on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-14T18:09:13.773Z · score: 24 (10 votes) · LW · GW

Honestly? I feel like this same set of problems gets re-solved a lot. I'm worried that it's a sign of ill health for the field.

I think we understand certain technical aspects of corrigibility (indifference and CIRL), but have hit a brick wall in certain other aspects (things that require sophisticated "common sense" about AIs or humans to implement, philosophical problems about how to get an AI to solve philosophical problems). I think this is part of what leads to re-treading old ground when new people (or a person wanting to apply a new tool) try to work on AI safety.

On the other hand, I'm not sure if we've exhausted Concrete Problems yet. Yes, the answer is often "just have sophisticated common sense," but I think the value is in exploring the problems and generating elegant solutions so that we can deepen our understanding of value functions and agent behavior (like TurnTrout's work on low-impact agents). In fact, Tom's a co-author on a very good toy problems paper, many of which require similar sort of one-off solutions that still might advance our technical understanding of agents.

Comment by charlie-steiner on Categorial preferences and utility functions · 2019-08-10T18:00:53.697Z · score: 5 (3 votes) · LW · GW

I think the most "native" representation of utility functions is actually as a function from ordered triples of outcomes to real numbers. Rather than having an arbitrary (affine symmetry breaking) scale for strength of preference, set the scale of a preference by comparing to a third possible outcome.

The function is the "how much better?" function. Given possible outcomes A, B, and X, how many times better is A (relative to X) than B (relative to X).

If A is chocolate cake, and B is ice cream, and X is going hungry, maybe the chocolate cake preference is 1.25 times stronger, so the function Betterness(chocolate cake, ice cream, going hungry) = 1.25.

This is the sort of preference that you would elicit from a gamble (at least from a rational agent, not necessarily from a human). If I am indifferent to a gamble with a probability 1 of ice cream, and a probability 0.8 of chocolate cake and 0.2 of going hungry, this tells you that betterness-value above.

Anyhow, interesting post, I'm just idly commenting.

Comment by charlie-steiner on Self-Supervised Learning and AGI Safety · 2019-08-10T06:41:42.439Z · score: 6 (4 votes) · LW · GW

This is definitely an interesting topic, and I'll eventually write a related post, but here are my thoughts at the moment.

1 - I agree that using natural language prompts with systems trained on natural language makes for a much easier time getting common-sense answers. A particular sort of idiot-proofing that prevents the hypothetical idiot from having the AI tell them how to blow up the world. You use the example of "How would we be likely to cure Alzheimer's?" - but for a well-trained natural language Oracle, you could even ask "How should we cure Alzheimer's?"

If it was an outcome pump with no particular knowledge of humans, it would give you a plan that would set off our nuclear arsenals. A superintelligent search process with an impact penalty would tell you how to engineer a very unobtrusive virus. A perfect world model with no special knowledge of humans would tell you a series of configurations of quantum fields. These are all bad answers.

What you want the Oracle to tell you is the sort of plan that might practically be carried out, or some other useful information, that leads to an Alheimer cure in the normal way that people mean when talking about diseases and research and curing things. Any model that does a good job predicting human natural language will take this sort of thing for granted in more or less the way you want it to.

2 - But here's the problem with curing Alzheimer's: it's hard. If you train GPT-3 on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer's, it won't tell you a cure, it will tell you what humans have said about curing Alzheimer's.

If you train a simultaneous model (like a neural net or a big transformer or something) of human words, plus sensor data of the surrounding environment (like how an image captioning ai can be thought of as having a simultaneous model of words and pictures), and figure out how to control the amount of detail of verbal output, you might be able to prompt an AI with text about an Alzheimer's cure, have it model a physical environment that it expects those words to take place in, and then translate that back into text describing the predicted environment in detail. But it still wouldn't tell you a cure. It would just tell you a plausible story about a situation related to the prompt about curing Alzheimer's, based on its training data. Rather than a logical Oracle, this image-captioning-esque scheme would be an intuitive Oracle, telling you things that make sense based on associations already present within the training set.

What am I driving at here, by pointing out that curing Alzheimer's is hard? It's that the designs above are missing something, and what they're missing is search.

I'm not saying that getting a neural net to directly output your cure for Alzheimer's is impossible. But it seems like it requires there to already be a "cure for Alzheimer's" dimension in your learned model. The more realistic way to find the cure for Alzheimer's, if you don't already know it, is going to involve lots of logical steps one after another, slowly moving through a logical space, narrowing down the possibilities more and more, and eventually finding something that fits the bill. In other words, solving a search problem.

So if your AI can tell you how to cure Alzheimer's, I think either it's explicitly doing a search for how to cure Alzheimer's (or worlds that match your verbal prompt the best, or whatever), or it has some internal state that implicitly performs a search.

And once you realize you're imagining an AI that's doing search, maybe you should feel a little less confident in the idiot-proofness I talked about in section 1. Maybe you should be concerned that this search process might turn up the equivalent of adversarial examples in your representation.

3 - Whenever I see a proposal for an Oracle, I tend to try to jump to the end - can you use this Oracle to immediately construct a friendly AI? If not, why not?

A perfect Oracle would, of course, immediately give you FAI. You'd just ask it "what's the code for a friendly AI?", and it would tell you, and you would run it.

Can you do the same thing with this self-supervised Oracle you're talking about? Well, there might be some problems.

One problem is the search issue I just talked about - outputting functioning code with a specific purpose is a very search-y sort of thing to do, and not a very big-ol'-neural-net thing to do, even moreso than outputting a cure for Alzheimer's. So maybe you don't fully trust the output of this search, or maybe there's no search and your AI is just incapable of doing the task.

But I think this is a bit of a distraction, because the basic question is whether you trust this Oracle with simple questions about morality. If you think the AI is just regurgitating an average answer to trolley problems or whatever, should you trust it when you ask for the FAI's code?

There's an interesting case to be made for "yes, actually," here, but I think most people will be a little wary. And this points to a more general problem with definitions - any time you care about getting a definition having some particularly nice properties beyond what's most predictive of the training data, maybe you can't trust this AI.

Comment by charlie-steiner on Why Gradients Vanish and Explode · 2019-08-09T21:45:56.273Z · score: 3 (2 votes) · LW · GW

That proof of the instability of RNNs is very nice.

The version of the vanishing gradient problem I learned is simply that if you're updating weights proportional to the gradient, then if your average weight somehow ends up as 0.98, as you increase the number of layers your gradient, and therefore your update size, will shrink kind of like (0.98)^n, which is not the behavior you want it to have.

Comment by Charlie Steiner on [deleted post] 2019-08-06T11:09:19.423Z

One sufficient condition for always defining actions is when a decision theory can give decisions as a function of the state of the world. For example, CDT evaluates outcomes in a way purely dependent on the world's state. A more complicated way of doing this is if your decision theory takes in a model of the world and outputs a policy, which tells you what to do in each state of the world.

Comment by charlie-steiner on Very different, very adequate outcomes · 2019-08-03T04:57:14.364Z · score: 2 (1 votes) · LW · GW

And of course you can go further and have different that all have similarly valid claims to be , because they're all similarly good generalizations of our behavior into a consistent function on a much larger domain.

Comment by charlie-steiner on Just Imitate Humans? · 2019-07-29T13:25:54.722Z · score: 5 (2 votes) · LW · GW

Yeah I agree that this might secretly be the same as a question about uploads.

If you're only trying to copy human behavior in a coarse-grained way, you immediately run into a huge generalization problem because your human-imitation is going to have to make plans where it can copy itself, think faster as it adds more computing power, can't get a hug, etc, and this is all outside of the domain it was trained on.

So if people aren't being very specific about human imitations, I kind of assume they're really talking and thinking about basically-uploads (i.e. imitations that generalize to this novel context by having a model of human cognition that attempts to be realistic, not merely predictive).

Comment by charlie-steiner on What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) · 2019-07-29T09:38:53.480Z · score: 4 (2 votes) · LW · GW

Could you expand on why you think that information / entropy doesn't match what you mean by "amount of optimization done"?

E.g. suppose you're training a neural network via gradient descent. If you start with weights drawn from some broad distribution, after training they will end up in some narrower distribution. This seems like a good metric of "amount of optimization done to the neural net."

I think there are two categories of reasons why you might not be satisfied - false positive and false negative. False positives would be "I don't think much optimization has been done, but the distribution got a lot narrower," and false negatives would be "I think more optimization is happening, but the distribution isn't getting any narrower." Did you have a specific instance of one of these cases in mind?

Comment by charlie-steiner on The Self-Unaware AI Oracle · 2019-07-22T22:11:11.370Z · score: 3 (2 votes) · LW · GW

Here's a more general way of thinking about what you're saying that I find useful: It's not that self-awareness is the issue per se, it's that you're avoiding building an agent - by a specific technical definition of "agent."

Agents, in the sense I think is most useful when thinking about AI, are things that choose actions based on the predicted consequences of those actions.

On some suitably abstract level of description, agents must have available actions, they must have some model of the world that includes a free parameter for different actions, and they must have a criterion for choosing actions that's a function of what the model predicts will happen when it takes those actions. Agents are what is dangerous, because they steer the future based on their criterion.

What you describe in this post is an AI that has actions (outputting text to a text channel), and has a model of the world. But maybe, you say, we can make it not an agent, and therefore a lot less dangerous, by making it so that there is no free parameter in the model for the agent to try out different actions. and instead of choosing its action based on consequences, it will just try to describe what its model predicts.

Thinking about it in terms of agents like this explains why "knowing that it's running on a specific computer" has the causal powers that it does - it's a functional sort of "knowing" that involves having your model of the world impacted by your available actions in a specific way. Simply putting "I am running on this specific computer" into its memory wouldn't make it an agent - and if it chooses what text to output based on predicted consequences, it's an agent whether or not it has "I am running on this specific computer" in its memory.

So, could this work? Yes. It would require a lot of hard, hard work on the input/output side, especially if you want reliable natural language interaction with a model of the entire world, and you still have to worry about the inner optimizer problem, particularly e.g. if you're training it in a way that creates an incentive for self-fulfilling prophecy or some other implicit goal.

The basic reason I'm pessimistic about the general approach of figuring out how to build safe non-agents is that agents are really useful. If your AI design requires a big powerful model of the entire world, that means that someone is going to build an agent using that big powerful model very soon after. Maybe this tool gives you some breathing room by helping suppress competitors, or maybe it makes it easier to figure out how to build safe agents. But it seems more likely to me that we'll get a good outcome by just directly figuring out how to build safe agents.

Comment by charlie-steiner on Some Comments on Stuart Armstrong's "Research Agenda v0.9" · 2019-07-16T19:12:44.242Z · score: 2 (1 votes) · LW · GW
"I don't trust humans to be a trusted source when it comes to what an AI should do with the future lightcone."
First, let's acknowledge that this is a new objection you are raising which we haven't discussed yet, eh? I'm tempted to say "moving the goalposts", but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)

Sure :) I've said similar things elsewhere, but I suppose one must sometimes talk to people who haven't read one's every word :P

We're being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn't just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.

There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.

Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call "better" between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.

I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by "take this into account," I'm pretty sure that means model the human and treat preferences as objects in the model.

Skipping over the intervening stuff I agree with, here's that Eliezer quote:

Eliezer Yudkowsky wrote: "If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back."
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can't be trusted to build it right. So we might as well just give up now.

I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.

Though I'm not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn't really matter if it's passing the buck or not.

But my original thought wasn't about uploads (though that's definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.

Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book - Zendegi?

so far, it hasn't really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven't really needed to develop special methods to solve this specific type of problem. (Correct me if I'm wrong.)

There are some cases where the AI specifically has a model of the human, and I'd call those "special methods." Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like "value iteration networks." This is the sort of development I'm thinking of that helps AI do a better job at generalizing human values - I'm not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.

Comment by charlie-steiner on Some Comments on Stuart Armstrong's "Research Agenda v0.9" · 2019-07-15T01:05:31.307Z · score: 2 (1 votes) · LW · GW

Ah, but I don't trust humans to be a trusted source when it comes to what an AI should do with the future lightcone. I expect you'd run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.

As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is "learning and respecting human preferences", object recognition is "human preferences about how to categorize images", and sentiment analysis is "human preferences about how to categorize sentences"

I somewhat agree, but you could equally well call them "learning human behavior at categorizing images," "learning human behavior at categorizing sentences," etc. I don't think that's enough. If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.

So this is two separate problems: one, I think humans can't reliably tell an AI what they value through a text channel, even with prompting, and two, I think that mimicking human behavior, even human behavior on moral questions, is insufficient to deal with the possibilities of the future.

I've never heard anyone in machine learning divide the field into cases where we're trying to generalize about human values and cases where we aren't. It seems like the same set of algorithms, tricks, etc. work either way.

It also sounds silly to say that one can divide the field into cases where you're doing model-based reinforcement learning, and cases where you aren't. The point isn't the division, it's that model-based reinforcement learning is solving a specific type of problem.

Let me take another go at the distinction: Suppose you have a big training set of human answers to moral questions. There are several different things you could mean by "generalize well" in this case, which correspond to solving different problems.

The first kind of "generalize well" is where the task is to predict moral answers drawn from the same distribution as the training set. This is what most of the field is doing right now for Ian Goodfellow's examples of categorizing images or categorizing sentences. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing the test set.

Another sort of "generalize well" might be inferring a larger "real world" distribution even when the training set is limited. For example, if you're given labeled data for handwritten digits 0-20 into binary outputs, can you give the correct binary output for 21? How about 33? In our moral questions example, this would be like predicting answers to moral questions spawned by novel situations not seen in training. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing examples later drawn from the real world.

Let's stop here for a moment and point out that if we want generalization in the second sense, algorithmic advances in the first sense might be useful, but they aren't sufficient. For the classifier to output the binary for 33, it probably has to be deliberately designed to learn flexible representations, and probably get fed some additional information (e.g. by transfer learning). When the training distribution and the "real world" distribution are different, you're solving a different problem than when they're the same.

A third sort of "generalize well" is to learn superhumanly skilled answers even if the training data is flawed or limited. Think of an agent that learns to play Atari games at a superhuman level, from human demonstrations. This generalization task often involves filling in a complex model of the human "expert," along with learning about the environment - for current examples, the model of the human is usually hand-written. The better we get at generalizing in this way, the more the AI's answers will be like "what we meant" (either by some metric we kept hidden from the AI, or in some vague intuitive sense) even if they diverge from what humans would answer.

(I'm sure there are more tasks that fall under the umbrella of "generalization," but you'll have to suggest them yourself :) )

So while I'd say that value learning involves generalization, I think that generalization can mean a lot of different tasks - a rising tide of type 1 generalization (which is the mathematically simple kind) won't lift all boats.

Comment by charlie-steiner on Some Comments on Stuart Armstrong's "Research Agenda v0.9" · 2019-07-14T00:07:18.517Z · score: 2 (1 votes) · LW · GW

Yes, I agree that generalization is important. But I think it's a bit too reductive to think of generalization ability as purely a function of the algorithm.

For example, an image-recognition algorithm trained with dropout generalizes better, because dropout acts like an extra goal telling the training process to search for category boundaries that are smooth in a certain sense. And the reason we expect that to work is because we know that the category boundaries we're looking for are in fact usually smooth in that sense.

So it's not like dropout is a magic algorithm that violates a no-free-lunch theorem and extracts generalization power from nowhere. The power that it has comes from our knowledge about the world that we have encoded into it.

(And there is a no free lunch theorem here. How to generalize beyond the training data is not uniquely encoded in the training data, every bit of information in the generalization process has to come from your model and training procedure.)

For value learning, we want the AI to have a very specific sort of generalization skill when it comes to humans. It has to not only predict human actions, it has to make a very particular sort of generalization ("human values"), and single out part of that generalization to make plans with. The information to pick out one particular generalization rather than another has to come from humans doing hard, complicated work, even if it gets encoded into the algorithm.

Comment by charlie-steiner on Please give your links speaking names! · 2019-07-12T05:54:06.506Z · score: 7 (3 votes) · LW · GW

I'm guilty, I'll try to do better :)

Comment by charlie-steiner on Some Comments on Stuart Armstrong's "Research Agenda v0.9" · 2019-07-12T05:33:35.150Z · score: 2 (1 votes) · LW · GW
I don't understand why you're so confident. It doesn't seem to me that my values are divorced from biology (I want my body to stay healthy) or population statistics (I want a large population of people living happy lives).

When I say your preference is "more abstract than biology," I'm not saying you're not allowed to care about your body, I'm saying something about what kind of language you're speaking when you talk about the world. When you say you want to stay healthy, you use a fairly high-level abstraction ("healthy"), you don't specify which cell organelles should be doing what, or even the general state of all your organs.

This choice of level of abstraction matters for generalization. At our current level of technology, an abstract "healthy" and an organ-level description might have the same outcomes, but at higher levels of technology, maybe someone who preferred to be healthy would be fine becoming a cyborg, while someone who wanted to preserve some lower-level description of their body would be against it.

"Once it starts encoding the world differently than we do, it won't have the generalization properties we want - we'd be caught cheating, as it were."
Are you sure?

I think the right post to link here is this one by Kaj Sotala. I'm not totally sure - there may be some way to "cheat" in practice - but my default view is definitely that if the AI carves the world up along different boundaries than we do, it won't generalize in the same way we would, given the same patterns.

Nice find on the Bostrom quote btw.

I think your claim proves too much. Different human brains have different encodings, and yet we are still able to learn the values of other humans (for example, when visiting a foreign country) reasonably well when we make an honest effort.

I would bite this bullet, and say that when humans are doing generalization of values into novel situations (like trolley problems, or utopian visions), they can end up at very different places even if they agree on all of the everyday cases.

If you succeed at learning the values of a foreigner, so well that you can generalize those values to new domains, I'd suspect that the simplest way for you to do it involves learning about what concepts they're using well enough to do the right steps in reasoning. If you just saw a snippet of their behavior and couldn't talk to them about their values, you'd probably do a lot worse - and I think that's the position many current value learning schemes place AI in.

Comment by charlie-steiner on Some Comments on Stuart Armstrong's "Research Agenda v0.9" · 2019-07-12T05:07:44.006Z · score: 2 (1 votes) · LW · GW

I'd definitely be interested in your thoughts about preferences when you get them into a shareable shape.

In some sense, what humans "really" have is just atoms moving around, all talk of mental states and so on is some level of convenient approximation. So when you say you want to talk about a different sort of approximation from Stuart, my immediate thing I'm curious about is "how can you make your way of talking about humans convenient for getting an AI to behave well?"

Comment by charlie-steiner on IRL in General Environments · 2019-07-12T04:06:24.205Z · score: 3 (2 votes) · LW · GW

A good starting point. I'm reminded of an old Kaj Sotala post (which then later provided inspiration for me writing a sort of similar post) about trying to ensure that the AI has human-like concepts. If the AI's concepts are inhuman, then it will generalize in an inhuman way, so that something like teaching a policy though demonstrations might not work.

But of course having human-like concepts is tricky and beyond the scope of vanilla IRL.

Comment by charlie-steiner on Learning biases and rewards simultaneously · 2019-07-08T22:18:23.344Z · score: 3 (2 votes) · LW · GW

I like this example of "works in practice but not in theory." Would you associate "ambitious value learning vs. adequate value learning" with "works in theory vs. doesn't work in theory but works in practice"?

One way that "almost rational" is much closer to optimal than "almost anti-anti-rational" is ye olde dot product, but a more accurate description of this case would involve dividing up the model space into basins of attraction. Different training procedures will divide up the space in different ways - this is actually sort of the reverse of a monte carlo simulation where one of the properties you might look for is ergodicity (eventually visiting all points in the space).

Comment by charlie-steiner on Modeling AGI Safety Frameworks with Causal Influence Diagrams · 2019-06-30T17:35:02.741Z · score: 2 (1 votes) · LW · GW

All good points.

The paper you linked was interesting - the graphical model is part of an AI design that actually models other agents using that graph. That might be useful if you're coding a simple game-playing agent, but I think you'd agree that you're using CIDs in a more communicative / metaphorical way?

Comment by charlie-steiner on GreaterWrong Arbital Viewer · 2019-06-28T08:36:32.257Z · score: 6 (3 votes) · LW · GW

Neat! I finally found the Solomonoff induction AI dialogue that someone made me curious about in the previous thread.

Comment by charlie-steiner on Modeling AGI Safety Frameworks with Causal Influence Diagrams · 2019-06-24T06:06:41.494Z · score: 11 (7 votes) · LW · GW

The reason I don't personally find these kinds of representation super useful is because each of those boxes is a quite complicated function, and what's in the boxes usually involves many more bits worth of information about an AI system than how the boxes are connected. And sometimes one makes different choices in how to chop an AI's operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms).

I actually have a draft sitting around of how one might represent value learning schemes with a hierarchical diagram of information flow. Eventually I decided that the idea made lots of sense for a few paradigm cases and was more trouble than it was worth for everything else. When you need to carefully refer to the text description to understand a diagram, that's a sign that maybe you should use the text description.

This isn't to say I think one should never see anything like this. Different ways of presenting the same information (like diagrams) can help drive home a particularly important point. But I am skeptical that there's a one-size-fits-all solution, and instead think that diagram usage should be tailored to the particular point it's intended to make.

Comment by charlie-steiner on Is your uncertainty resolvable? · 2019-06-21T14:27:36.853Z · score: 5 (3 votes) · LW · GW

I think I must be the odd one out here in terms of comfort using probabilities close to 1 and 0. Because 90% and 99% are not "near certainty" to me.

How sure are you that the English guy who you've been told helped invent calculus and did stuff with gravity and optics was called "Isaac Newton"? We're talking about probabilities like 99.99999% here. (Conditioning on no dumb gotchas from human communication, e.g. me using a unicode character from a different language and claiming it's no longer the same, which has suddenly become much more salient to you and me both. An "internal" probability, if you will.)

Maybe it would help to think of this as about 20 bits of information past 50%? Every bit of information you can specify about something means you are assigning a more extreme probability distribution about that thing. The probability of the answer being "Isaac Newton" has a very tiny prior for any given question, and only rises to 50% after lots of bits of information. And if you could get to 50%, it's not strange that you could have quite a few more bits left over, before eventually running into the limits set by the reliability of your own brain.

So when you say some plans require near certainty, I'm not sure if you mean what I mean but chose smaller probabilities, or if you mean some somewhat different point about social norms about when numbers are big/small enough that we are allowed to stop/start worrying about them. Or maybe you mean a third thing about legibility and communicability that is correlated with probability but not identical?

Comment by charlie-steiner on Is "physical nondeterminism" a meaningful concept? · 2019-06-19T21:14:38.880Z · score: 4 (2 votes) · LW · GW
"At the most basic level of description, things—the quantum fields or branes or whatever—just do what they do. They don’t do it nondeterministically, but they also don’t do it deterministically"
How do you know? If that claim isn't based on a model, what is it based on?

I'm happy to reply that the message of my comment as a whole applies to this part - my claim about "what things do at a basic level of description" is a meta-model claim about what you can say about things at different levels of description.

It's human nature to interpret this as a claim that things have some essence of "just doing what they do" that is a competitor to the essences of determinism and nondeterminism, but there is no such essence for the same reasons I'm already talking about in the comment. Maybe I could have worded it more carefully to prevent this reading, but I figure that would sacrifice more clarity than it gained.

The point is not about some "basic nature of things," the point is about some "basic level of description." We might imagine someone saying "I know there are some deterministic models of atoms and some nondeterministic models, but are the atoms really deterministic or not?" Where this "really" seems to mean some atheoretic direct understanding of the nature of atoms. My point, in short, is that atheoretic understanding is fruitless ("It's just one damn thing after another") and the instinct that says it's desirable is misleading.

"much better explanation of wetness would be in terms of surface tension and intermolecular forces and so on"
Why? Because they are real properties?

Because they're part of a detailed model of the world that helps tell a "functional and causal story" about the phenomenon. If I was going to badmouth one set of essences just to prop up another, I would have said so :P

You can explain away some properties in terms of others, but there is, going to be some residue

My point is that this residue is never going to be the "Real Properties," they're just going to be the same theory-laden properties as always.

What makes a theory of everything a theory of everything is not that it provides a final answer for which properties are the real properties that atoms have in some atheoretic direct way. It's that it provides a useful framework in which we can understand all (literally all) sorts of stuff.

Comment by charlie-steiner on Research Agenda v0.9: Synthesising a human's preferences into a utility function · 2019-06-19T20:47:38.946Z · score: 4 (2 votes) · LW · GW

Now that I think about it, it's a pretty big PR problem if I have to start every explanation of my value learning scheme with "humans don't have actual preferences so the AI is just going to try to learn something adequate." Maybe I should figure out a system of jargon such that I can say, in jargon, that the AI is learning peoples' actual preferences, and it will correspond to what laypeople actually want from value learning.

I'm not sure whether such jargon would make actual technical thinking harder, though.

Comment by charlie-steiner on Research Agenda v0.9: Synthesising a human's preferences into a utility function · 2019-06-18T11:01:25.935Z · score: 9 (5 votes) · LW · GW

This is really similar to some stuff I've been thinking about, so I'll be writing up a longer comment with more compare/contrast later.

But one thing really stood out to me - I think one can go farther in grappling with and taking advantage of "where lives." doesn't live inside the human, it lives in the AI's model of the human. Humans aren't idealized agents, they're clusters of atoms, which means they don't have preferences except after the sort of coarse-graining procedure you describe, and this coarse-graining procedure lives with a particular model of the human - it's not inherent in the atoms.

This means that once you've specified a value learning procedure and human model, there is no residual "actual preferences" the AI can check itself against. The challenge was never to access our "actual preferences," it was always to make a best effort to model humans as they want to be modeled. This is deeply counterintuitive ("What do you mean, the AI isn't going to learn what humans' actual preferences are?!"), but also liberating and motivating.

Comment by charlie-steiner on Is "physical nondeterminism" a meaningful concept? · 2019-06-16T21:31:16.055Z · score: 6 (3 votes) · LW · GW

Interesting question! My answer is basically a long warning about essentialism: this question might seem like it's stepping down from the realm of human models to the realm of actual things, to ask about the essence of those things. But I think better answers are going to come from stepping up from the realm of models to the realm of meta-models, to ask about the properties of models.

At the most basic level of description, things - the quantum fields or branes or whatever - just do what they do. They don't do it nondeterministically, but they also don't do it deterministically! Without recourse to human models, all us humans can say is that the things just do what they do - models are the things that make talk about categories possible in the first place.

Any answer of the sort "no, things can always be rendered into a deterministic form by treating 'random' results as fixed constants" or "yes, there are perfectly valid classes of models that include nondeterminism" is going to be an answer about models, within some meta-level framework. And that's fine!

This can seem unsatisfying because it goes against our essentialist instinct - that the properties in our models should reflect the real properties that things have. If water is considered wet, it's because water has the basic property or essence of wetness (so the instinct goes).

Note that this doesn't explain any of the mechanics or physics of wetness. If you could look inside someone's head as they were performing this essentialist maneuver, they would start with a model ("water is wet"), then they would notice their model ("I model water as wet"), then they would justify themselves to themselves, in a sort of reassuring pat on the back ("I model water as wet because water is really, specially wet").

I think that this line of self-reassuring reasoning is flawed, and a much better explanation of wetness would be in terms of surface tension and intermolecular forces and so on - illuminating the functional and causal story behind our model of the world, rather than believing you've explained wetness in terms of "real wetness". Also see the story about Bleggs and Rubes.

Long story short, any good explanation for why we should or shoudn't have nondeterminism in a model is either going to be about how to choose good models, or it's going to be a causal and functional story that doesn't preserve nondeterminism (or determinism) as an essence.

I think there's an interesting question about physics in whether or not (and how) we should include nondeterminism as an option in fundamental theories. But first I just wanted to warn that the question "models aside, are things really nondeterministic" is not going to have an interesting answer.

Comment by charlie-steiner on What kind of thing is logic in an ontological sense? · 2019-06-13T01:17:08.530Z · score: 6 (3 votes) · LW · GW

https://www.lesswrong.com/posts/mHNzpX38HkZQrdYn3/philosophy-of-numbers-part-2

Basically, we have a mental model of logic the same way we have a mental model of geography. It's useful to say that logical facts have referents for the same internal reason it's useful to say that geographical facts have referents. But if you looked at a human from outside, the causal story behind logical facts vs. geographical facts would be different.

Comment by charlie-steiner on Logic, Buddhism, and the Dialetheia · 2019-06-10T20:38:53.913Z · score: 10 (5 votes) · LW · GW

As a practicing quantum mechanic, I'd warn you against the claim that dialetheism is used in quantum computers. Qubits are sometimes described as taking "both states at the same time," but that's not precisely what's going on, and people who actually work on quantum computers use a more precise understanding that doesn't involve interpreting intermediate qubits as truth values.

There are also two people I wanted to see in your post: Russell and Gödel - mathematicians rather than philosophers. Russell's type theory was one of the main attempts to eliminate paradoxes in mathematical logic. Gödel showed how that doesn't quite work, but also showed how things become a lot clearer if you consider provability as well as truth value.

Comment by charlie-steiner on Learning magic · 2019-06-10T16:21:07.831Z · score: 19 (6 votes) · LW · GW
On Wednesdays at the Princeton Graduate College, various people would come in to give talks. The speakers were often interesting, and in the discussions after the talks we used to have a lot of fun. For instance, one guy in our school was very strongly anti-Catholic, so he passed out questions in advance for people to ask a religious speaker, and we gave the speaker a hard time.
Another time somebody gave a talk about poetry. He talked about the structure of the poem and the emotions that come with it; he divided everything up into certain kinds of classes. In the discussion that came afterwards, he said, “Isn’t that the same as in mathematics, Dr. Eisenhart?”
Dr. Eisenhart was the dean of the graduate school and a great professor of mathematics. He was also very clever. He said, “I’d like to know what Dick Feynman thinks about it in reference to theoretical physics.” He was always putting me on in this kind of situation.
I got up and said, “Yes, it’s very closely related. In theoretical physics, the analog of the word is the mathematical formula, the analog of the structure of the poem is the interrelationship of the theoretical bling-bling with the so-and so”—and I went through the whole thing, making a perfect analogy. The speaker’s eyes were beaming with happiness.
Then I said, “It seems to me that no matter what you say about poetry, I could find a way of making up an analog with any subject, just as I did for theoretical physics. I don’t consider such analogs meaningful.”
In the great big dining hall with stained-glass windows, where we always ate, in our steadily deteriorating academic gowns, Dean Eisenhart would begin each dinner by saying grace in Latin. After dinner he would often get up and make some announcements. One night Dr. Eisenhart got up and said, “Two weeks from now, a professor of psychology is coming to give a talk about hypnosis. Now, this professor thought it would be much better if we had a real demonstration of hypnosis instead of just talking about it. Therefore he would like some people to volunteer to be hypnotized.
I get all excited: There’s no question but that I’ve got to find out about hypnosis. This is going to be terrific!
Dean Eisenhart went on to say that it would be good if three or four people would volunteer so that the hypnotist could try them out first to see which ones would be able to be hypnotized, so he’d like to urge very much that we apply for this. (He’s wasting all this time, for God’s sake!)
Eisenhart was down at one end of the hall, and I was way down at the other end, in the back. There were hundreds of guys there. I knew that everybody was going to want to do this, and I was terrified that he wouldn’t see me because I was so far back. I just had to get in on this demonstration!
Finally Eisenhart said, “And so I would like to ask if there are going to be any volunteers …”
I raised my hand and shot out of my seat, screaming as loud as I could, to make sure that he would hear me: “MEEEEEEEEEEE!”
He heard me all right, because there wasn’t another soul. My voice reverberated throughout the hall—it was very embarrassing. Eisenhart’s immediate reaction was, “Yes, of course, I knew you would volunteer, Mr. Feynman, but I was wondering if there would be anybody else.”
Finally a few other guys volunteered, and a week before the demonstration the man came to practice on us, to see if any of us would be good for hypnosis. I knew about the phenomenon, but I didn’t know what it was like to be hypnotized.
He started to work on me and soon I got into a position where he said, “You can’t open your eyes.”
I said to myself, “I bet I could open my eyes, but I don’t want to disturb the situation: Let’s see how much further it goes.” It was an interesting situation: You’re only slightly fogged out, and although you’ve lost a little bit, you’re pretty sure you could open your eyes. But of course, you’re not opening your eyes, so in a sense you can’t do it.
He went through a lot of stuff and decided that I was pretty good.
When the real demonstration came he had us walk on stage, and he hypnotized us in front of the whole Princeton Graduate College. This time the effect was stronger; I guess I had learned how to become hypnotized. The hypnotist made various demonstrations, having me do things that I couldn’t normally do, and at the end he said that after I came out of hypnosis, instead of returning to my seat directly, which was the natural way to go, I would walk all the way around the room and go to my seat from the back.
All through the demonstration I was vaguely aware of what was going on, and cooperating with the things the hypnotist said, but this time I decided, “Damn it, enough is enough! I’m gonna go straight to my seat.”
When it was time to get up and go off the stage, I started to walk straight to my seat. But then an annoying feeling came over me: I felt so uncomfortable that I couldn’t continue. I walked all the way around the hall.
I was hypnotized in another situation some time later by a woman. While I was hypnotized she said, “I’m going to light a match, blow it out, and immediately touch the back of your hand with it. You will feel no pain.”
I thought, “Baloney!” She took a match, lit it, blew it out, and touched it to the back of my hand. It felt slightly warm. My eyes were closed throughout all of this, but I was thinking, “That’s easy. She lit one match, but touched a different match to my hand. There’s nothin’ to that; it’s a fake!”
When I came out of the hypnosis and looked at the back of my hand, I got the biggest surprise: There was a burn on the back of my hand. Soon a blister grew, and it never hurt at all, even when it broke.
So I found hypnosis to be a very interesting experience. All the time you’re saying to yourself, “I could do that, but I won’t”—which is just another way of saying that you can’t.
• Surely You Must Be Joking Mr Feynman
Comment by charlie-steiner on References that treat human values as units of selection? · 2019-06-10T16:04:18.234Z · score: 3 (2 votes) · LW · GW
You can’t talk sensibly about what values are right, or what we ‘should’ build into intelligent agents.

I agree that in our usual use of the word, it doesn't make sense to talk about what (terminal) values are right.

But you agree that (within a certain level of abstraction and implied context) you can talk as if you should take certain actions? Like "you should try this dessert" is a sensible English sentence. So what about actions that impact intelligent agents?

Like, suppose there was a pill you could take that would make you want to kill your family. Should you take it? No, probably not. But now we've just expressed a preference about the values of an intelligent agent (yourself).

Modifying yourself to want bad things is wrong in the same sense that the bad things are wrong in the first place: they are wrong with respect to your current values, which are a thing we model you as having within a certain level of abstraction.

Comment by charlie-steiner on Major Update on Cost Disease · 2019-06-07T12:37:57.967Z · score: 4 (2 votes) · LW · GW

We have to separate colleges from public K-12 education. Colleges are the place where you hear about increasing numbers of non-teaching staff. K-12 actually has fewer administrators per student than 20 years ago (in most places).

Comment by charlie-steiner on How is Solomonoff induction calculated in practice? · 2019-06-05T11:53:27.708Z · score: 9 (3 votes) · LW · GW

Minimum message length fitting uses an approximation of K-complexity and gets used sometimes when people want to fit weird curves in a sort of principled way. But "real" Solomonoff induction is about feeding literally all of your sensory data into the algorithm to get predictions for the future, not just fitting curves.

So I guess I'd say that it's possible to approximate K-complexity and use that in your prior for curve fitting, and people sometimes do that. But that's not necessarily going to be your best estimate, because your best estimate is going to take into account all of the data you've already seen, which becomes impossible very quickly (even if you just want a controlled approximation).

Comment by charlie-steiner on How is Solomonoff induction calculated in practice? · 2019-06-05T11:44:54.768Z · score: 3 (2 votes) · LW · GW

Sure. The OP might more accurately have asked "How is the Solomonoff prior calculated?"

Comment by charlie-steiner on Selection vs Control · 2019-06-03T06:29:56.753Z · score: 5 (3 votes) · LW · GW

Yeah I think this is definitely a "stance" thing.

Take the use of natural selection and humans as examples of optimization and mesa-optimization - the entire concept of "natural selection" is a human-convenient way of describing a pattern in the universe. It's approximately an optimizer, but in order to get rid of that "approximately" you have to reintroduce epicycles until your model is as complicated as a model of the world again. Humans aren't optimizers either, that's just a human-convenient way of describing humans.

More abstractly, the entire process of recognizing a mesa-optimizer - something that models the world and makes plans - is an act of stance-taking. Or Quinean radical translation or whatever. If a cat-recognizing neural net learns an attention mechanism that models the world of cats and makes plans, it's not going to come with little labels on the neurons saying "these are my input-output interfaces, this is my model of the world, this is my planning algorithm." It's going to be some inscrutable little bit of linear algebra with suspiciously competent behavior.

Not only could this competent behavior be explained either by optimization or some variety of "rote behavior," but the neurons don't care about these boundaries and can occupy a continuum of possibilities between any two central examples. And worst of all, the same neurons might have multiple different useful ways of thinking about them, some of which are in terms of elements like "goals" and "search," and others are in terms of the elements of rote behavior.

In light of this, the problem of mesa-optimizers is not "when will this bright line be crossed?" but "when will this simple model of the AI's behavior be predictable useful?" Even though I think the first instinct is the opposite.

Comment by charlie-steiner on Uncertainty versus fuzziness versus extrapolation desiderata · 2019-06-01T09:21:11.282Z · score: 5 (3 votes) · LW · GW

Nice post. I suspect you'll still have to keep emphasizing that fuzziness can't play the role of uncertainty in a human-modeling scheme (like CIRL), and is instead a way of resolving human behavior into a utility function framework. Assuming I read you correctly.

I think that there are some unspoken commitments that the framework of fuzziness makes for how to handle extrapolating irrational human behavior. If you represent fuzziness as a weighting over utility functions that gets aggregated linearly (i.e. into another utility function), this is useful for the AI making decisions but can't be the same thing that you're using to model human behavior, because humans are going to take actions that shouldn't be modeled as utility maximization.

To bridge this gap from human behavior to utility function, what I'm interpreting you as implying is that you should represent human behavior in terms of a patchwork of utility functions. In the post you talk about frequencies in a simulation, where small perturbations might lead a human to care about the total or about the average. Rather than the AI creating a context-dependent model of the human, we've somehow taught it (this part might be non-obvious) that these small perturbations don't matter, and should be "fuzzed over" to get a utility function that's a weighted combination of the ones exhibited by the human.

But we could also imagine unrolling this as a frequency over time, where an irrational human sometimes takes the action that's best for the total and other times takes the action that's best for the average. Should a fuzzy-values AI represent this as the human acting according to different utility functions at different times, and then fuzzing over those utility functions to decide what is best?

Comment by charlie-steiner on A shift in arguments for AI risk · 2019-05-29T09:35:58.254Z · score: 7 (3 votes) · LW · GW

An alternate framing could be about changing group boundaries rather than changing demographics in an isolated group.

There were surely people in 2010 who thought that the main risk from AI was it being used by bad people. The difference might not be that these people have popped into existence or only recently started talking - it's that they're inside the fence more than before.

And of course, reality is always complicated. One of the concerns in the "early LW" genre is value stability and self-trust under self-modification, which has nothing to do with sudden growth. And one of the "recent" genre concerns is arms races, which are predicated on people expecting sudden capability growth to give them a first mover advantage.

Comment by charlie-steiner on Why the empirical results of the Traveller’s Dilemma deviate strongly away from the Nash Equilibrium and seems to be close to the social optimum? · 2019-05-25T04:41:23.097Z · score: 2 (1 votes) · LW · GW

I would guess that people don't actually compute the Nash equilibrium or expect other people to.

Instead, they use the same heuristic reasoning methods that they evolved to learn, and which have served them well in social situations their entire life, and expect other people to do the same.

I think we should expect these heuristics to be close to rational (not for the utilities of humans, but for the fitness of genes) in the ancestral environment. But there's no particular reason to think they're going to be rational by any standard in games chosen specifically because the Nash equilibrium is counterintuitive to humans.

Comment by charlie-steiner on Free will as an appearance to others · 2019-05-23T22:20:50.385Z · score: 4 (2 votes) · LW · GW

If I may toot my own horn: https://www.lesswrong.com/posts/yex7E6oHXYL93Evq6/book-review-consciousness-explained

I'll admit I'm not totally sure what Said Achmiz means by his comparison, though :)