Hierarchical planning: context agents 2020-12-19T11:24:09.064Z
Modeling humans: what's the point? 2020-11-10T01:30:31.627Z
What to do with imitation humans, other than asking them what the right thing to do is? 2020-09-27T21:51:36.650Z
Charlie Steiner's Shortform 2020-08-04T06:28:11.553Z
Constraints from naturalized ethics. 2020-07-25T14:54:51.783Z
Meta-preferences are weird 2020-07-16T23:03:40.226Z
Down with Solomonoff Induction, up with the Presumptuous Philosopher 2020-06-12T09:44:29.114Z
The Presumptuous Philosopher, self-locating information, and Solomonoff induction 2020-05-31T16:35:48.837Z
Life as metaphor for everything else. 2020-04-05T07:21:11.303Z
Meta-preferences two ways: generator vs. patch 2020-04-01T00:51:49.086Z
Gricean communication and meta-preferences 2020-02-10T05:05:30.079Z
Impossible moral problems and moral authority 2019-11-18T09:28:28.766Z
What's the dream for giving natural language commands to AI? 2019-10-08T13:42:38.928Z
The AI is the model 2019-10-04T08:11:49.429Z
Can we make peace with moral indeterminacy? 2019-10-03T12:56:44.192Z
The Artificial Intentional Stance 2019-07-27T07:00:47.710Z
Some Comments on Stuart Armstrong's "Research Agenda v0.9" 2019-07-08T19:03:37.038Z
Training human models is an unsolved problem 2019-05-10T07:17:26.916Z
Value learning for moral essentialists 2019-05-06T09:05:45.727Z
Humans aren't agents - what then for value learning? 2019-03-15T22:01:38.839Z
How to get value learning and reference wrong 2019-02-26T20:22:43.155Z
Philosophy as low-energy approximation 2019-02-05T19:34:18.617Z
Can few-shot learning teach AI right from wrong? 2018-07-20T07:45:01.827Z
Boltzmann Brains and Within-model vs. Between-models Probability 2018-07-14T09:52:41.107Z
Is this what FAI outreach success looks like? 2018-03-09T13:12:10.667Z
Book Review: Consciousness Explained 2018-03-06T03:32:58.835Z
A useful level distinction 2018-02-24T06:39:47.558Z
Explanations: Ignorance vs. Confusion 2018-01-16T10:44:18.345Z
Empirical philosophy and inversions 2017-12-29T12:12:57.678Z
Dan Dennett on Stances 2017-12-27T08:15:53.124Z
Philosophy of Numbers (part 2) 2017-12-19T13:57:19.155Z
Philosophy of Numbers (part 1) 2017-12-02T18:20:30.297Z
Limited agents need approximate induction 2015-04-24T21:22:26.000Z


Comment by charlie-steiner on Hierarchical planning: context agents · 2021-01-23T18:26:13.543Z · LW · GW

Oh wait, are you the first author on this paper? I didn't make the connection until I got around to reading your recent post.

So when you talk about moving to a hierarchical human model, how practical do you think it is to also move to a higher-dimensional space of possible human-models, rather than using a few hand-crafted goals? This necessitates some loss function or prior probability over models, and I'm not sure how many orders of magnitude more computationally expensive it makes everything.

Comment by charlie-steiner on The Problem of the Criterion · 2021-01-22T14:35:24.065Z · LW · GW

This article and comment have used the word "true" so much that I've had that thing happen where when a word gets used too much it sort of loses all meaning and becomes a weird sequence of symbols or sounds. True. True true true. Truetrue true truetruetruetrue.

Comment by charlie-steiner on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-22T03:16:13.620Z · LW · GW

My first idea is, you take your common sense AI, and rather than saying "build me a spaceship, but, like, use common sense," you can tell it "do the right thing, but, like, use common sense." (Obviously with "saying" and "tell" in invisible finger quotes.) Bam, Type-1 FAI.

Of course, whether this will go wrong or not depends on the specifics. I'm reminded of Adam Shimi et al's recent post that mentioned "Ideal Accomplishment" (how close to an explicit goal a system eventually gets) and "Efficiency" (how fast it gets there). If you have a general purpose "common sensical optimizer" that optimizes any goal but, like, does it in a common sense way, then before you turn it on you'd better know whether it's affecting ideal accomplishment, or just efficiency.

That is to say, if I tell it to make me the best spaceship it can or something similarly stupid, will the AI "know that the goal is stupid" and only make a normal spaceship before stopping? Or will it eventually turn the galaxy into spaceship, just taking common-sense actions along the way? The truly idiot-proof common sensical optimizer changes its final destination so that it does what we "obviously" meant, not what we actually said. The flaws in this process seem to determine if it's trustworthy enough to tell to "do the right thing," or trustworthy enough to tell to do anything at all.

Comment by charlie-steiner on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-22T02:48:54.131Z · LW · GW

I'm a lot less excited about the literature of the world's philosophy than I am about the living students of it.

Of course, there are some choices in designing an AI that are ethical choices, for which there's no standard by which one culture's choice is better than another's. In this case, incorporating diverse perspectives is "merely" a fair way to choose how to steer the future - a thing to do because we want to, not because it solves some technical problem.

But there are also philosophical problems faced in the construction of AI that are technical problems, and I think the philosophy literature is just not going to contain a solution to these problems, because they require highly specific solutions that you're not going to think of if you're not even aware of the problem. You bring up ontological shifts, and I think the Madhyamaka Buddhist sutra you quote is a typical example - it's interesting as a human, especially with the creativity in interpretation afforded to us by hindsight, but the criteria for "interesting as a human" are so much fewer and more lenient than what's necessary to design a goal system that responds capably to ontological shifts.

The Anglo-American tradition of philosophy is in no way superior to Buddhist philosophy on this score. What is really necessary is "bespoke" philosophy oriented to the problems at hand in AI alignment. This philosophy is going to superficially sound more like analytic philosophy than, say, continental philosophy or vedic philosophy, just because of what we need it to do, but that doesn't mean it can't benefit from a diversity of viewpoints and mental toolboxes.

Comment by charlie-steiner on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-21T22:03:19.196Z · LW · GW

Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all.

Children also learn right from wrong - I'd be interested in where you draw the line between "An AI that learns common sense" and "An AI that learns right from wrong." (You say this argument doesn't apply in the case of human values, but it seems like you mean only explicit human values, not implicit ones.)

My suspicion, which is interesting to me so I'll explain it even if you're going to tell me that I'm off base, is that you're thinking that part of common sense is to avoid uncertain or extreme situations (e.g. reshaping the galaxy with nanotechnology), and that common sense is generally safe and trustworthy for an AI to follow, in a way that doesn't carry over to "knowing right from wrong." An AI that has learned right from wrong to the same extent that humans learn it might make dangerous moral mistakes.

But when I think about it like that, it actually makes me less trusting of learned common sense. After all, one of the most universally acknowledged things about common sense is that it's uncommon among humans! Merely doing common sense as well as humans seems like a recipe for making a horrible mistake because it seemed like the right thing at the time - this opens the door to the same old alignment problems (like self-reflection and meta-preferences [or should that be meta-common-sense]).


P.S. I'm not sure I quite agree with this particular setting of normativity. The reason is the possibility of "subjective objectivity", where you can make what you mean by "Quality Y" arbitrarily precise and formal if given long enough to split hairs. Thus equipped, you can turn "Does this have quality Y?" into an objective question by checking against the (sufficiently) formal, precise definition.

The point is that the aliens are going to be able to evaluate this formal definition just as well as you. They just don't care about it. Even if you both call something "Quality Y," that doesn't avail you much if you're using that word to mean very different things. (Obligatory old Eliezer post)

Anyhow, I'd bet that xuan is not saying that it is impossible to build an AI with common sense - they're saying that building an AI with common sense is in the same epistemological category as building an AI that knows right from wrong.

Comment by charlie-steiner on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-21T20:51:42.853Z · LW · GW

Huh! How on earth did you come across this abstract?

Comment by charlie-steiner on A Democratic Currency · 2021-01-19T11:10:51.582Z · LW · GW

Pegging to a commodity is not different from convincing people to sell things in exchange for the currency, though. If I start a new currency pegged to gold, what I'm doing is saying "I promise to sell you gold for this currency." This promise is never of infinite strength, and so it can be evaluated with the same logic as the promise my supermarket makes to sell me milk in exchange for dollars.

If the promises of the supermarket are strong, there is no need for you to promise that you'll sell me gold for dollars.

So, how could I get my supermarket to accept UBI-coin, in addition to dollars? Well, what if I promised to pay them some dollars to do this as a promotional thing, and also promised that I would be going to other local businesses and giving them the same deal, so that the decision-makers of the supermarket would be able to spend their UBI-coin on local goods and services? If people took this deal, this would allow for UBI-coin to function as a currency without a need for me to make a "central" promise that I would sell gold for UBI-coin.

Comment by charlie-steiner on A Democratic Currency · 2021-01-19T08:47:32.071Z · LW · GW

Currency derives its "intrinsic value" from being able to buy things you want (and its currency-like properties from lots of people feeling the same way). If I dropped you off on a deserted island with a ton of gold, it would be practically useless to you because there would be nobody to exchange it with.

So I guess your question becomes "How would you convince people to buy and sell goods for UBI-coin?"

Comment by charlie-steiner on Literature Review on Goal-Directedness · 2021-01-18T16:40:27.011Z · LW · GW

The little quizzes were highly effective in getting me to actually read the post :)

I think depending on what position you take, there are difference in how much one thinks there's "room for a lot of work in this sphere." The more you treat goal-directedness as important because it's a useful category in our map for predicting certain systems, the less important it is to be precise about it. On the other hand if you want to treat goal-directedness in a human-independent way or otherwise care about it "for its own sake" for some reason, then it's a different story.

Comment by charlie-steiner on Why I'm excited about Debate · 2021-01-17T16:19:16.236Z · LW · GW

Good question. There's a big roadblock to your idea as stated, which is that asking something to define "alignment" is a moral question. But suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?"

I don't know - I think it still depends on how bad humans are as judges of arguments - we've made the domain more objective, but maybe there's some policy of argumentation that still wins by what we would consider cheating. I can imagine being convinced that it would work by seeing Debates play out with superhuman litigators, but since that's a very high bar maybe I should apply more creativity to my expextations.

Comment by charlie-steiner on Why I'm excited about Debate · 2021-01-17T08:24:47.634Z · LW · GW

I think the Go example really gets to the heart of why I think Debate doesn't cut it.

The reason Go is hard is that it has a large game tree despite simple rules. When we treat an AI game as information about the value of a state of the Go board, we know exactly what the rules are and how the game should be scored, the superhuman work the AIs are doing is in searching this game tree that's too big for us. The adversarial gameplay provides a check that the search through the game tree is actually finding high-scoring policies.

What does this framework need to apply to moral arguments? That humans "know the rules" of argumentation, that we can recognize good arguments when we see them, and that what we really need help with is searching the game tree of arguments to find high-scoring policies of argumentation.

This immediately should sound a little off. If humans have any exploits (or phrased differently, if there are places where our meta-preferences and our behavior conflict), then this search process will try to find them. We can imagine trying to patch humans (e.g. giving them computer assistants), but this patching process has to already be the process of bringing human behavior in line with human meta-preferences! It's the patching process that's doing all the alignment work, reducing the Debate part to a fancy search for high-approval actions.

No, the dream of Debate is that it's a game where human meta-preferences and behavior are already aligned. For all places where they diverge, the dream is that there's some argument that will point this out and permanently fix it, and that this inconsistency-resolution process does not itself violate too many of our meta-preferences. That Debate is fair like Go is fair - each move is incremental, you can't place a Go stone that changes the layout of the board to make it impossible for your opponent to win.

Comment by charlie-steiner on Transparency and AGI safety · 2021-01-15T07:36:52.318Z · LW · GW

Dropout makes interpretation easier because it disincentivizes complicated features where you can only understand the function of the parts in terms of their high-order correlations with other parts. This is because if a feature relies on such correlations, it will be fragile to some of the pieces being dropped out.

Anti-dropout promotes consolidation of similar features into one, but it also incentivizes that one feature to be maximally complicated and fragile.

Re: first idea. Yeah, something like that. Basically just an attempt at formalization of "functionally similar neurons," so that when you go to drop out a neuron, you actually drop out all functionally similar ones.

Comment by charlie-steiner on Transparency and AGI safety · 2021-01-12T11:38:46.615Z · LW · GW

Re: non-agenty AGI. The typical problem is that there are incentives for individual actors to build AI systems that pursue goals in the world. So even if you postulate non-agenty AGI, you then have to further figure out why nobody has asked the Oracle AI "What's the code to an AI that will make me rich?" or asked it for the motor output of a robot given various sense data and natural-language goals, then used that output to control a robot (also see ).

Comment by charlie-steiner on Transparency and AGI safety · 2021-01-12T11:34:31.977Z · LW · GW


I'm reminded a bit of the reason why Sudoku and quantum computing are difficult: the possibilities you have to track are not purely local, they can be a nonlocal combination of different things. General NNs seem like they'd be at least NP to interpret.

But this is what dropout is useful for, penalizing reliance on correlations. So maybe if you're having trouble interpreting something you can just crank up the dropout parameters. On the other hand, dropout also promotes redundancy, which might make interpretation confusing - perhaps there's something similar to dropout that's even better for interpretability.

Edit for unfiltered ideas:

You could automatically sample an image, find neurons excited, sample neurons, sample images based on how much they excite that neuron, etc, until you end up with a sampled pool of similar images and similar neurons. Then you drop out all similar neurons.

You could try anti-dropout: punishing the NN for redundancy and rewarding it for fragility/specificity. However, to avoid the incentive to create fine tuned activation/inhibition pairs, you only use positive activations for this step.

Comment by charlie-steiner on Are UFOs just drones? · 2021-01-12T04:41:47.294Z · LW · GW

Hm, so roughly speaking, how would you break down the probabilities of some different explanations, given a generic UFO sighting? E.g. just a shadow or reflection, natural object in the sky, man-made stationary object, human-piloted airplane, drone, actually aliens? Is there some common sub-type of UFO sighting that you think has low probability of all non-drone explanations, even accounting for all the faults of human memory and character?

Comment by charlie-steiner on Johannes Kepler, Sun Worshipper · 2021-01-09T07:50:21.494Z · LW · GW

\[T]/  ☀️

Comment by charlie-steiner on Are UFOs just drones? · 2021-01-09T02:56:46.789Z · LW · GW

Do UFOs need a single explanation? Just because we have one label on our map doesn't mean the underlying territory is unified.

Comment by charlie-steiner on Moral intuitions are surprisingly variable · 2021-01-08T07:45:16.247Z · LW · GW

I really need to get around to writing more anti-moral-uncertainty posts :P

What it functionally is here is aggregating different peoples' preferences via linear combination. And this is fine! But it's not totally unobjectionable - some humans may in fact object to it (not just to the particular weights, which are of course subjective). So moral uncertainty isn't a solution to meta-moral disagreement, it's a framework you can use only after you've already resolved it to your own satisfaction and decided that you want to aggregate linearly.

Comment by charlie-steiner on Moral intuitions are surprisingly variable · 2021-01-08T07:13:32.722Z · LW · GW

This is definitely an interesting topic (I recently listened to an interview with Joseph Heinrich, author of The WEIRDest People In The World). It's a serious problem if you're trying to find a unique True Morality and ground it in human intuitions, but if we've managed to move past that then there's still an interesting point here, the philosophical problem just gets turned inwards.

Of course different humans are allowed to prefer different things and even have different preferences about preference aggregation - so if you're trying to build some decisionmaking procedure that aggregates human preferences, rather than being able to delegate the choice of how to do that to some unique True Meta-Morality, you have to do the philosophical work of figuring out what you want out of this aggregation process.

Comment by charlie-steiner on What's up with the Targeting Aging with Metformin (TAME) trial? · 2021-01-08T06:53:23.998Z · LW · GW

Huh, that's concerning.

Comment by charlie-steiner on Hierarchical planning: context agents · 2021-01-07T13:05:42.439Z · LW · GW

Sorry for being slow :) No, I haven't read anything of Bratman's. Should I? The synopsis looks like it might have some interesting ideas but I'm worried he could get bogged down in what human planning "really is" rather than what models are useful.

I'd totally be happy to chat either here or in PMs. Full Bayesian reasoning seems tricky if the environment is complicated enough to make hierarchical planning attractive - or do you mean optimizing a model for posterior probability (the prior being something like MML?) by local search?

I think one interesting question there is if it can learn human foibles. For example, suppose we're playing a racing game and I want to win the race, but fail because my driving skills are bad. How diverse a dataset about me do you need to actually be able to infer that a) I am capable of conceptualizing how good my performance is b) I wanted it to be good c) It wasn't good, from a hierarchical perpective, because of the lower-level planning faculties I have. I think maybe you could actually learn this only from racing game data (no need to make an AGI that can ask me about my goals and do top-down inference), so long as you had diverse enough driving data to make the "bottom-up" generalization that my low-level driving skill can be modeled as bad almost no matter the higher-level goal, and therefore it's simplest to explain me not winning a race by taking the bad driving I display elsewhere as a given and asking what simple higher-level goal fits on top.

Comment by charlie-steiner on 2021 New Year Optimization Puzzles · 2021-01-02T02:54:49.052Z · LW · GW

FYI, I made my code a little more optimal and ran it overnight - I didn't find any scores smaller than what I already knew about, after searching through everything with only up to 1 square root and 2 factorials, never using any numbers larger than 16!.

Comment by charlie-steiner on What Are Some Alternative Approaches to Understanding Agency/Intelligence? · 2020-12-30T08:10:25.601Z · LW · GW

It seems a bit arrogant to just say "what I've been working on," but on the other hand, the things I've been working on have obviously often been my best ideas!

Right now I'm still thinking about how to allow for value specification in hierarchical models. There are two flanks to this problem: the problem of alien concepts and the problem of human underdetermination.

The problem of alien concepts is relatively well-understood: we want the AI to generalize in a human-like way, which runs unto trouble if there are "alien concepts" that predict the training data well but are unsafe to try to maximize. Solving this problem looks like skillful modeling of an environment that includes humans, progress in interpretability, and better learning from human feedback.

The problem of human underdetermination is a bit less appreciated: human behavior underdetermines a utility function, in the sense that you could fit many utility functions to human behavior, all equally well. But there's simultaneously a problem with human inconsistency with intuitive desiderata. Solving this problem looks like finding ways to model humans that strike a decent balance between our incompatible desiderata, or ways to encode and insert our desiderata to avoid "no free lunch" problems in general models of environments that contain humans. Wheras a lot of good progress has been made on the problem of alien concepts using fairly normal ML methods, I think the problem of human underdetermination requires a combination of philosophy, mathematical foundations, and empirical ML research.

Comment by charlie-steiner on AXRP Episode 1 - Adversarial Policies with Adam Gleave · 2020-12-30T07:17:58.880Z · LW · GW

Off to a good start!

Comment by charlie-steiner on Vanessa Kosoy's Shortform · 2020-12-29T08:54:42.780Z · LW · GW

What does it mean to have an agent in the information-state?

Nevermind, I think I was just looking at it with the wrong class of reward function in mind.

Comment by charlie-steiner on Vanessa Kosoy's Shortform · 2020-12-27T21:52:13.888Z · LW · GW

Ah, okay, I see what you mean. Like how preferences are divisible into "selfish" and "worldly" components, where the selfish component is what's impacted by a future simulation of you that is about to have good things happen to it.

(edit: The reward function in AMDPs can either be analogous to "wordly" and just sum the reward calculated at individual timesteps, or analogous to "selfish" and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)

I brought up the histories->states thing because I didn't understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?

In fact, to me the example is still somewhat murky, because you're talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions - so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).

Comment by charlie-steiner on Vanessa Kosoy's Shortform · 2020-12-27T01:40:41.218Z · LW · GW

Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states  with the memory states , etc. The action takes a robot in  to memory state , and a robot in  to one robot in  and another in .

(Skip this paragraph unless the specifics of what's going on aren't obvious: given a transition distribution  (P being the distribution over sets of states s'* given starting state s and policy ), we can define the memory transition distribution  given policy  and starting "memory state"  (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of  into the transition distribution as the current state.  Then for each  in the domain, for each element in  we concatenate that element onto the end of  and collect these  into a set , which is assigned the same probability .)

So now at time t=2, if you sample a robot, the probability that its state begins with  is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it's just that we've turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.

I'm still confused about what you mean by "Bayesian hypothesis" though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

Comment by charlie-steiner on Vanessa Kosoy's Shortform · 2020-12-26T08:50:16.822Z · LW · GW

Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a "memory MDP" that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.

Comment by charlie-steiner on Open & Welcome Thread - December 2020 · 2020-12-22T05:59:43.888Z · LW · GW

First I want to make sure we're splitting off the personal from the aesthetic here. By "the aesthetic," I mean the moral value from a truly outside perspective - like asking the question "if I got to design the universe, which way would I rather it be?" You don't anticipate being this person, you just like people from an aesthetic standpoint and want your universe to have some. For this type of preference, you can prefer the universe to be however you'd like (:P) including larger vs. smaller computers.

Second is the personal question. If the person being simulated is me, what would I prefer? I resolved these questions to my own satisfaction in Anthropic Selfish Preferences as a Modification of TDT ( ), but I'm not sure how helpful that post actually is for sharing insight.

Comment by charlie-steiner on Open & Welcome Thread - December 2020 · 2020-12-22T05:42:58.946Z · LW · GW

I think the main complaint about "signalling" is when it's a lie. E.g. if there's some product that claims to be sophisticated, but is in fact not a reliable signal of sophistication (being usable without sophistication at all). Then people might feel affronted by people who propogate the advertising claims because of honesty-based aesthetics. I'm happy to call this an important difference from non-lie signalling, and also from other aesthetic preferences.

Oh, and there's wasteful signalling, can't forget about that either.

Comment by charlie-steiner on Is there a community aligned with the idea of creating species of AGI systems for them to become our successors? · 2020-12-21T08:10:33.460Z · LW · GW

See also

"Free AI" is still something that humans would choose to build - you can't just heap a bunch of silicon into a pile and tell it "do what you want!" (Unless what silicon really wants is to sit very still.) So it's a bit of a weird category, and I think most "regulars" in the field don't think in terms of it.

However, I think your question can be fixed by asking about whether there's work about treating the AIs as moral ends in themselves, rather than as means to helping humans. Many philosophers adjacent to AI have written vague things about this, but I'm not sure of anything that's both good and non-vague.

Part of the issue is that this runs into an important question in meta-ethics: is it morally mandatory that we create more happy people, until the universe is as full as physically possible? And if not, where do you draw the line? The answer to this question is that our preferences about population ethics are a mixture of game theory and aesthetic preference - where by "aesthetic preference" I mean that if you find the idea of a galaxy-spanning civilization aesthetically pleasing, you don't need to justify this in terms of deep moral rules, that can just be how you'd prefer the universe to be.

So basically, I think you're asking "Is there some community of people thinking about AI that all find a future inherited by AIs aesthetically pleasing?"

And after all this: no, sorry, I don't know of such a community.

Comment by charlie-steiner on Hierarchical planning: context agents · 2020-12-19T23:37:07.447Z · LW · GW

Yeah, I agree, it seems both more human-like and more powerful to have a dynamical system where models are activating other models based on something like the "lock and key" matching of neural attention. But for alignment purposes, it seems to me that we need to not only optimize models for usefulness or similarity to actual human thought, but also for how similar they are to how humans think of human thought - when we imagine an AI with the goal of doing good, we want it to have decision-making that matches our understanding of "doing good." The model in this post isn't as neat and clean as utility maximization, but a lot of the overly-neat features have to do with making it more convenient to talk about it having a fixed, human-comprehensible goal.

Re: creativity, I see how you'd get that from what I wrote but I think that's only half right. The model laid out in this post is perfectly capable of designing new solutions to problems - it just tends to do it by making a deliberate choice to take a "design a new solution" action. Another source of creativity is finding surprising solutions to difficult search problems, which is perfectly possible in complicated contexts.

Another source of creativity is compositionality, which you can have in this formalism by attributing it to the transition function putting you ino to a composed context. Can you learn this while trying to mimic humans? I'm not sure, but it seems possible.

We might also attribute a deficit in creativity to the fact that the reward functions are only valid in-context, and aren't designed to generalize to new states, even if there were really apt ways of thinking about the world that involved novel contexts or adding new states to existing contexts. And maybe this is the important part, because I think this is a key feature, not at all a bug.

Comment by charlie-steiner on Why are delicious biscuits obscure? · 2020-12-08T07:13:38.316Z · LW · GW

There are bakeries all over the place that sell good cookies (I'd guess about 1 bakery with good cookies per 30,000 people). But your supermarket probably doesn't stock them because of high expense and shelf life of only a few days (all that golden syrup is hygroscopic, which I suspect would be the primary culprit, though maybe fragility or the fact that real butter spoils would be more important.)

Comment by charlie-steiner on D&D.Sci · 2020-12-06T21:22:19.907Z · LW · GW

There are 3 pairs of duplicate stats in the dataset. (1011, 4696) - different results. (3460, 5146) - different results. (4399, 5963) same result.  This reassures me that there was some random element and not some Zendo-like rulest, unless we're in the real complicated case where peoples' outcome was a deterministic function of their surrounding graduating class.

Comment by charlie-steiner on SETI Predictions · 2020-12-02T08:45:01.627Z · LW · GW

The twist at the end though. I had to go back and re-think my answers :P

Comment by charlie-steiner on Latent Variables and Model Mis-Specification · 2020-11-19T02:29:40.910Z · LW · GW

Just ended up reading your paper (well, a decent chunk of it), so thanks for the pointer :) 

Comment by charlie-steiner on The ethics of AI for the Routledge Encyclopedia of Philosophy · 2020-11-19T00:08:36.366Z · LW · GW

Congrats! Here are my totally un-researched first thoughts:

Pre-1950 History: Speculation about artificial intelligence (if not necessarily in the modern sense) dates back extremely far. Brazen head, Frankenstein, the mechanical turk, R.U.R. robots. Basically all of this treats the artificial intelligence as essentially a human, though the brazen head mythology is maybe more related to deals with djinni or devils, but basically all of it (together with more modern science fiction) can be lumped into a pile labeled "exposes and shapes human intuition, but not very serious".

Automating the work of logic was of interest to logicians such as Hilbert, Pierce, Frege, so there might be interesting discussions related to that. Obviously you'll also mention modern electronic computing, Turing, Good, Von Neumann, Wiener, etc.

There might actually be a decent amount already written about the ethics of AI for central planning of economies. Thinking about Wiener, and also about the references of the recent book Red Plenty, about Kantorovich and halfhearted soviet attempts to use optimization algorithms on the economy.

The most important modern exercise of AI ethics is not trolley problems for self-driving cars (despite the press), but interpretability. If the law says you can't discriminate based on race, for instance, and you want to use AI to make predictions about someone's insurance or education or what have you, that AI had better not only not discriminate, it has to interpretably not discriminate in a way that will stand up in court if necessary. Simultaneously, there's the famous story of Target sending out advertisements for baby paraphernalia, and Facebook targets ads to you that, if they don't use your phone's microphone, make good enough predictions to make people suspect they do. So the central present issue of AI ethics is the control of information - both its practicality in places we've already legislated that we want certain information not to be acted on, and examining the consequences in places where there's free rein.

And obviously autonomous weapons, the race to detect AI-faked images, the ethics of automation of large numbers of jobs, and then finally maybe we can get to the ethical issues raised by superhuman AI.

Comment by charlie-steiner on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2020-11-18T23:28:37.070Z · LW · GW

Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.

If you don't agree with me on this, why didn't you reply when I spent about six months just writing posts that were all variations of this idea? Here's Scott Alexander making the basic point.

It's like... is there a True rational approximation of pi? Well, 22/7 is pretty good, but 355/113 is more precise, if harder to remember. And just 3 is really easy to remember, but not as precise. And of course there's the arbitrarily large "approximation" that is 3.141592... Depending on what you need to use it for, you might have different preferences about the tradeoff between simplicity and precision. There is no True rational approximation of pi. True Human Values are similar, except instead of one tradeoff that you can make it's approximately one bajillion.

  • we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense

I have no idea why this would be tied to non-Cartesian-ness.

If a Cartesian agent was talking about their values, they could just be like "you know, those things that are specified as my values in the logic-stuff my mind is made out of." (Though this assumes some level of introspective access / genre savviness that needn't be assumed, so if you don't want to assume this then we can just say I was mistaken.). When a human talks about their values they can't take that shortcut, and instead have to specify values as a function of how they affect their behavior. This introduces the dependency on how we're breaking down the world into categories like "human behavior."

  • Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans [...] and satisfy the modeled values.

How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong.

Well, if there were unique values, we could say "maximize the unique values." Since there aren't, we can't. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we're going to have to do with that wrong-seeming.

Comment by charlie-steiner on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2020-11-18T18:39:15.558Z · LW · GW

I think that one of the problems in this post is actually easier in the real world than in the toy model.

In the toy model the AI has to succeed by maximizing the agent's True Values, which the agent is assumed to have as a unique function over its model of the world. This is a very tricky problem, especially when, as you point out, we might allow the agent's model of reality to be wrong in places.

But in the real world, humans don't have a unique set of True Values or even a unique model of the world - we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense.

Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans (and we might have desiderata about how it does that modeling and what the end results should contain), and satisfy the modeled values. And in some ways this is actually a bit reassuring, because I'm pretty sure that it's possible to get better final results on this problem than on than learning the toy model agent's True Values - maybe not in the most simple case, but as you add things like lack of introspection, distributional shift, meta-preferences like identifying some behavior as "bias," etc.

Comment by charlie-steiner on Learning Normativity: A Research Agenda · 2020-11-13T00:56:16.238Z · LW · GW

I'm pretty on board with this research agenda, but I'm curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.

And on the assumption that you have no idea what I'm referring to, here's the link to my post.

There are a couple different directions to go from here. One way is to try to collapse the recursion. Find a single agent-shaped model of humans that is (or approximates) a fixed point of this model-ratification process (and also hopefully stays close to real humans by some metric), and use the preferences of that. This is what I see as the endgame of the imitation / bootstrapping research.

Another way might be to imitate communication, and find a way to use recursive models such that we can stop the recursion early without much loss in effectiveness. In communication, the innermost layer of the model can be quite simplistic, and then the next is more complicated by virtue of taking advantage of the first, and so on. At each layer you can do some amount of abstracting away of the details of previous layers, so by the time you're at layer 4 maybe it doesn't matter that layer 1 was just a crude facsimile of human behavior.

Thinking specifically about this UTAA monad thing, I think it's a really clever way to think about what levers we have access to in the fixed-point picture. (If I was going to point out one thing it's lacking, it's that it's a little hazy on whether you're supposed to model V as now having meta-values about the state of the entire recursive tree of UTAAs, or whether your Q function is now supposed to learn about meta-preferences from some outside data source.) But it retains the things I'm worried about from this fixed-point picture, which is basically that I'm not sure it buys us much of anything if the starting point isn't benign in a quite strong sense.

Comment by charlie-steiner on Communication Prior as Alignment Strategy · 2020-11-13T00:27:40.446Z · LW · GW

Yeah, this is basically CIRL, when the human-model is smart enough to do Gricean communication. The important open problems left over after starting with CIRL are basically "how do you make sure that your model of communicating humans infers the right things about human preferences?", both due to very obvious problems like human irrationality, and also due to weirder stuff like the human intuition that we can't put complete confidence in any single model.

Comment by charlie-steiner on [AN #125]: Neural network scaling laws across multiple modalities · 2020-11-11T23:54:41.120Z · LW · GW

Cool papers, Flo!

Comment by charlie-steiner on Building AGI Using Language Models · 2020-11-09T22:50:28.594Z · LW · GW

Sure. It might also be worth mentioning multimodal uses of the transformer algorithm, or the use of verbal feedback as a reward signal to train reinforcement learning agents.

As for whether this is a fire alarm, this reminds me of the old joke: "In theory, there's no difference between theory and practice. But in practice, there is."

You sort of suggest that in theory, this would lead to an AGI, but that in practice, it wouldn't work. Well in theory, if it fails in practice that means you didn't use a good enough theory :)

Comment by charlie-steiner on Ethics in Many Worlds · 2020-11-07T18:01:28.879Z · LW · GW

If we're cosmopolitan, we might expect that the wavefunction of the universe at the current time contains more than just us. In fact, the most plausible state is that it has some amount (albeit usually tiny) of every possible state already.

And so there is no good sense in which time evolution of the universe produces "more" of me. It doesn't produce new states with me in them, because those states already exist, there's just probably not much quantum measure in them. And it doesn't produce new quantum measure out of thin air - it only distributes what I already have.

Comment by charlie-steiner on Draft papers for REALab and Decoupled Approval on tampering · 2020-11-06T19:39:42.772Z · LW · GW

Very interesting. Naturalizing feedback (as opposed to directly accessing True Reward) seems like it could lead to a lot of desirable emergent behaviors, though I'm somewhat nervous about reliance on a handwritten model of what reliable feedback is.

Comment by charlie-steiner on Defining capability and alignment in gradient descent · 2020-11-06T09:32:24.418Z · LW · GW

Interesting post. Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

Comment by charlie-steiner on Sleeping Julia: Empirical support for thirder argument in the Sleeping Beauty Problem · 2020-11-03T13:45:42.912Z · LW · GW

We can distinguish two cases - one case where there is some physical difference out there that you could find if you looked for it, and another case (e.g. trying to put probability distributions over what the universe looks like outside our lightcone) where you have different theories that really don't have any empirical consequence.

In the first case, I don't think it's begging the question at all to say that you should have some probability distribution over those future empirical results, because the best part of probability distributions is how the capture what we expect about future empirical results. This should not stop working just because there might be someone next door who has the same memories as me. And absolutely this is about epistemics. We can phrase the Sleeping Beauty problem entirely in terms of ordinary empirical questions about the outside world - if you can give me a probability distribution over what time it is and just call it ordinary belief, Sleeping Beauty can give me a probability distribution over what day it is and just call it ordinary belief. You can use the same reasoning process.

In the case where there is no empirical difference, then yes, I think it's ultimately about Solomonoff induction, which is significantly more subjective (allowing a choice of "programming language" that can change what you think is likely, with no empirical evidence to ever change your mind). But again this isn't about practical consequences. If we're in a simulation (I'm somewhat doubtful on the ancestor simulation premise, myself), I don't think the right answer is "somehow fool ourselves into thinking we're not in a simulation so we can take good actions." I'd rather correctly guess whether I'm in a simulation and then take good actions anyhow.

Comment by charlie-steiner on Sleeping Julia: Empirical support for thirder argument in the Sleeping Beauty Problem · 2020-11-03T06:20:39.985Z · LW · GW

I agree on the first bit, but on the second, anthropic uncertainty can actually be resolved into regular uncertainty just fine - just treat yourself as "known," and the rest of the universe as unknown. When I'm not sure what time it is, I look at a clock - we could call this "locating myself," but we could equally well call it "locating the universe." Similarly, we can cash out Sleeping Beauty's anthropic uncertainty into empirical questions about the universe - e.g. what will I see if I look at the calendar? Or in the version with copies, is there a copy of me next door that I can go talk to?

These are perfectly good empirical questions, and all the standard reasons for why probabilities are good things to have (Cox's theorem in particular here) apply.

Comment by charlie-steiner on Additive Operations on Cartesian Frames · 2020-10-30T19:42:27.507Z · LW · GW

Typo in the definition of product: b cdot e should be b star e.

Comment by charlie-steiner on Security Mindset and Takeoff Speeds · 2020-10-29T03:48:25.768Z · LW · GW

I also agree that direct jumps in capability due to research insight are rare. But in part I think that's just because things get tried at small scale first, and so there's always going to be some scaling-up period where the new insight gets fed more and more resources, eventually outpacing the old state of the art. From a coarse-grained perspective GPT-2 relative to your favorite LSTM model from 2018 is the "jump in capability" due to research insight, it just got there in a not-so-discontinuous way.

Maybe you're optimistic that in the future, everyone will eventually be doing safety checks of their social media recommender algorithms or whatever during training. But even if some company is partway through scaling up the hot new algorithm and (rather than training to completion) they trip the alarm that was searching for undesirable real-world behavior because of learned agent-like reasoning, what then? The assumption that progress will be slow relative to adaptation already seems to be out the window.

This is basically the punctuated equilibria theory of software evolution :P