Human instincts, symbol grounding, and the blank-slate neocortex 2019-10-02T12:06:35.361Z · score: 25 (10 votes)
Self-supervised learning & manipulative predictions 2019-08-20T10:55:51.804Z · score: 8 (4 votes)
In defense of Oracle ("Tool") AI research 2019-08-07T19:14:10.435Z · score: 19 (9 votes)
Self-Supervised Learning and AGI Safety 2019-08-07T14:21:37.739Z · score: 20 (9 votes)
The Self-Unaware AI Oracle 2019-07-22T19:04:21.188Z · score: 23 (8 votes)
Jeff Hawkins on neuromorphic AGI within 20 years 2019-07-15T19:16:27.294Z · score: 154 (55 votes)
Is AlphaZero any good without the tree search? 2019-06-30T16:41:05.841Z · score: 26 (7 votes)
1hr talk: Intro to AGI safety 2019-06-18T21:41:29.371Z · score: 29 (11 votes)


Comment by steve2152 on What's the dream for giving natural language commands to AI? · 2019-10-12T09:29:26.626Z · score: 1 (1 votes) · LW · GW

So here's a value learning scheme: try to squish the world and natural language into the same latent space, just with different input/output functions.

What's the input-output function in the two cases?

I'm also generally confused about why you're calling this thing "two linked models" rather than "one model". For example, I would say that a brain has one world model that is interlinked with speech and vision and action, etc. Right?

Comment by steve2152 on Minimization of prediction error as a foundation for human values in AI alignment · 2019-10-12T01:53:15.144Z · score: 1 (1 votes) · LW · GW

Abram—I've gone back and forth a few times, but currently think that "gradient descent is myopic" arguments don't carry through 100% when the predictions invoke memorized temporal sequences (and hierarchies or abstractions thereof) that extend arbitrarily far into the future. For example, if I think someone is about to start singing "Happy birthday", I'm directly making a prediction about the very next moment, but I'm implicitly making a prediction about the next 30 seconds, and thus the prediction error feedback signal is not just retrospective but also partly prospective.

I agree that we should NOT expect "outputs to strategically make future inputs easier to predict", but I think there might be a non-myopic tendency for outputs that strategically make the future conform to the a priori prediction. See here, including the comments, for my discussion, trying to get my head around this.

Anyway, if that's right, that would seem to be the exact type of non-myopia needed for a hierarchical Bayesian prediction machine to also be able to act as a hierarchical control system. (And sorry again if I'm just being confused.)

Comment by steve2152 on Thoughts on "Human-Compatible" · 2019-10-12T01:07:41.431Z · score: 6 (3 votes) · LW · GW

I'm all for it! See my post here advocating for research in that direction. I don't think there's any known fundamental problem, just that we need to figure out how to do it :-)

For example, with end-to-end training, it's hard to distinguish the desired "optimize for X then print your plan to the screen" from the super-dangerous "optimize the probability that the human operators thinks they are looking at a plan for X". (This is probably the kind of inner alignment problem that ofer is referring to.)

I proposed here that maybe we can make this kind of decoupled system with self-supervised learning, although I there are still many open questions about that approach, including the possibility that it's less safe than it first appears.

Incidentally, I like the idea of mixing Decoupled AI 1 and Decoupled AI 3 to get:

Decoupled AI 5: "Consider the (counterfactual) Earth with no AGIs, and figure out the most probable scenario in which a small group (achieves world peace / cures cancer / whatever), and then describe that scenario."

I think this one would be likelier to give a reasonable, human-compatible plan on the first try (though you should still ask follow-up questions before actually doing it!).

Comment by steve2152 on Misconceptions about continuous takeoff · 2019-10-09T11:25:23.377Z · score: 3 (2 votes) · LW · GW

the general strategy of "dealing with things as they come up" is much more viable under continuous takeoff. Therefore, if a continuous takeoff is more likely, we should focus our attention on questions which fundamentally can't be solved as they come up.

Can you give some examples of AI alignment failure modes which would be definitely (or probably) easy to solve if we had a reproducible demonstration of that failure mode sitting in front of us? It seems to me that none of the current ongoing work is in that category.

When I imagine that type of iterative debugging, the example in my mind is a bad reward function that the programmers are repeatedly patching, which would be a bad situation because it would probably amount to a "nearest unblocked strategy" loop.

Comment by steve2152 on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-08T00:11:31.922Z · score: 2 (2 votes) · LW · GW

Take the limit as we observe more and more behavior-- it takes a million bits to specify E, for example, or a billion. Then the utility maximizer and utility minimizer are both much much simpler (can be specified in fewer bits) than the Buddha-like zero utility agent (assuming E is in fact consistent with a simple utility function). Likewise, in that same limit, the true laws of physics plus initial conditions are much much simpler than saying "L=0 and E just happens". Right? Sorry if I'm misunderstanding, I haven't read A&M.

Comment by steve2152 on Human instincts, symbol grounding, and the blank-slate neocortex · 2019-10-06T15:38:07.775Z · score: 3 (2 votes) · LW · GW

Don't know much about category theory, guess I'll have to read your posts :)

Comment by steve2152 on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-06T11:16:52.012Z · score: 9 (6 votes) · LW · GW

Can you be more specific what you think the LW consensus is, that you're referring to? Recursive self-improvement and pessimism about AI existential risk? Or something else?

Comment by steve2152 on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-04T19:50:29.424Z · score: 15 (10 votes) · LW · GW

Yann's core argument for why AGI safety is easy is interesting, and actually echoes ongoing AGI safety research. I'll paraphrase his list of five reasons that things will go well if we're not "ridiculously stupid":

  1. We'll give AGIs non-open-ended objectives like fetching coffee. These are task-limited and therefore there's no more instrumental subgoals after the task is complete.
  2. We will put "simple terms in the objective" to prevent obvious problems, presumably things like "don't harm people", "don't violate laws", etc.
  3. We will put in "a mechanism" to edit the objective upon observing bad behavior;
  4. We can physically destroy a computer housing AGI;
  5. We can build a second AGI whose sole purpose is to destroy the first AGI if the first AGI has gotten out of control, and the latter will succeed because it's more specialized.

All of these are reasonable ideas on their face, and indeed they're similar to ongoing AGI safety research programs: (1) is myopic or task-limited AGI, (2) is related to AGI limiting and norm-following, (3) is corrigibility, (4) is boxing, and (5) is in the subfield of AIs-helping-with-AGI-safety (other things in this area include IDA, adversarial testing, recursive reward modeling, etc.).

The problem, of course, is that all five of these things, when you look at them carefully, are much harder and more complicated than they appear, and/or less likely to succeed. And meanwhile he's discouraging people from doing the work to solve those problems.. :-(

Comment by steve2152 on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-04T19:13:13.093Z · score: 6 (3 votes) · LW · GW

One thing you can do to stop a robot from destroying itself is to give it more-or-less any RL reward function whatsoever, and get better and better at designing it to understand the world and itself and act in the service of getting that reward (because of instrumental convergence). For example, each time the robot destroys itself, you build a new one seeded with the old one's memory, and tell it that its actions last time got a negative reward. Then it will learn not to do that in the future. Remember, an AGI doesn't need a robot body; a prototype AGI that accidentally corrupts its own code can be recreated instantaneously for zero cost. Why then build safeguards?

Safeguards would be more likely if the AGI were, say, causing infrastructure damage while learning. I can definitely see someone, say, removing internet access, after mishaps like that. That's still not an adequate safeguard, in that when the AGI gets intelligent enough, it could hack or social-engineer its way through safeguards that were working before.

Comment by steve2152 on List of resolved confusions about IDA · 2019-10-01T15:21:43.917Z · score: 18 (4 votes) · LW · GW

I'm not sure how "resolved" this confusion is, but I've gone back and forth a few times on what's the core reason(s) that we're supposed to expect IDA to create systems that won't do anything catastrophic: (1) because we're starting with human imitation / human approval which is safe, and the amplification step won't make it unsafe? (2) because "Corrigibility marks out a broad basin of attraction"? (3) because we're going to invent something along the lines of Techniques for optimizing worst-case performance? and/or (4) something else?

For example, in Challenges to Christiano’s capability amplification proposal Eliezer seemed to be under the impression that it's (1), but Paul replied that it was really (3), if I'm reading it correctly..?

Comment by steve2152 on Partial Agency · 2019-09-30T09:10:47.891Z · score: 2 (2 votes) · LW · GW

I agree that this is possible, but I would be very surprised if a mesa-optimizer actually did something like this. By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify

Let me tell a story for why I'm thinking this type of mesa-optimizer misalignment is realistic or even likely for the advanced AIs of the future. The starting point is that the advanced AI is a learning system that continually constructs a better and better world-model over time.

Imagine that the mesa-optimizer actually starts out properly inner-aligned, i.e. the AI puts a flag on "Concept X" in its world-model as its goal, and Concept X really does correspond to our intended supervisory signal of "Accurate answers to our questions". Now over time, as the AI learns more and more, it (by default) comes to have beliefs about itself and its own processing, and eventually develops an awareness of the existence of a RAM location storing the supervisory answer as Wei Dai was saying. Now there's a new "Concept Y" in its world-model, corresponding to its belief about what is in that RAM location.

Now, again, assume the AI is set up to build a better and better world-model by noticing patterns. So, by default, it will eventually notice that Concept X and Concept Y always have the same value, and it will then add into the world-model some kind of relationship between X and Y. What happens next probably depends on implementation details, but I think it's at least possible that the "goal-ness" flag that was previously only attached to X in the world-model, will now partly attach itself to Y, or even entirely transfer from X to Y. If that happens, the AI's mesa-goal has now shifted from aligned ("accurate answers") to misaligned ("get certain bits into RAM").

(This is kinda related to ontological crises.) (I also agree with Wei's comment but the difference is that I'm assuming that there's a training phase with a supervisory signal, then a deployment phase with no supervisory signal, and I'm saying that the mesa-optimizer can go from aligned to misaligned during the deployment phase even in that case. If the training signal is there forever, that's even worse, because like Wei said, Y would match that signal better than X (because of labeling errors) so I would certainly expect Y to get flagged as the goal in that case.)

Comment by steve2152 on The Zettelkasten Method · 2019-09-24T16:39:04.298Z · score: 3 (2 votes) · LW · GW

Many typewritten formats have limited access to math symbols

In case you don't already know, you can use unicode to type things like ω₀ ≲ ∫ ±√(Δμ)↦✔·∂∇² and so on directly into a web browser text box, or into almost any other text entry form of any computer program: I made a tutorial here with details .

There's a learning curve for sure, but I can now type my 10-20 favorite special characters & greek letters only slightly slower than typing normal text, or at least fast enough that I don't lose my train of thought.

It's obviously not a substitute for LaTeX or pen&paper, but I still find it very helpful for things like emails, python code, spreadsheets, etc., where LaTeX or pen&paper aren't really options.

Comment by steve2152 on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-15T17:51:56.588Z · score: 1 (1 votes) · LW · GW

What you say about is/ought is basically the alignment problem, right? My take is: I have high confidence that future AIs will know intellectually what it is that humans regard as common-sense morality, since that knowledge is instrumentally useful for any goal involving predicting or interacting with humans. I have less confidence that we'll figure out how to ensure that those AIs adopt human common-sense morality. Even humans, who probably have an innate drive to follow societal norms, will sometimes violate norms anyway, or do terrible things in a way that works around those constraints.

Comment by steve2152 on Two senses of “optimizer” · 2019-09-11T11:30:18.443Z · score: 4 (3 votes) · LW · GW

If the super-powerful SAT solver thing finds the plans but doesn't execute them, would you still lump it with optimizer_2? (I know it's just terminology and there's no right answer, but I'm just curious about what categories you find natural.)

(BTW this is more-or-less a description of my current Grand Vision For AGI Safety, where the "dynamics of the world" are discovered by self-supervised learning, and the search process (and much else) is TBD.)

Comment by steve2152 on Self-supervised learning & manipulative predictions · 2019-09-05T01:09:36.839Z · score: 1 (1 votes) · LW · GW

This is great, thanks again for your time and thoughtful commentary!

RE "I'm not entirely convinced that predictions should be made in a way that's completely divorced from their effects on the world.": My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don't claim that this is definitely the One Right Answer To AGI Safety (see "4. Give up, and just make an agent with value-aligned goals" in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.

If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it's a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as "answer my question". (We would need to make sure that the goal is what it's supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator's brain counts as "answer my question".) Again, I'm not opposed to building agents after solving value alignment, but we haven't solved value alignment yet, and thus it's worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).

Interfacing with a non-agential AGI is generally awkward. You can't directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like "If there were no AGIs in the world, what's the likeliest way that a person would find a cure for Alzheimer's?" This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).

OK, that's my grand vision and motivation, and why I'm hoping for "no reasoning about the consequences of one's output whatsoever", as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one's outputs is OK, but I'm nervous.)

Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I'm not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.

  • My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X's that result in high P(Y).
  • My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn't happen, as suggested by interstice's comments on this page.) So if X1 leads to one of 500 slightly different Y1's (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1's in aggregate are likelier than Y2; so X2 is at an unfair advantage.
  • Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.
Comment by steve2152 on Self-supervised learning & manipulative predictions · 2019-09-03T20:01:10.971Z · score: 3 (2 votes) · LW · GW

OK, hmm, let me try again then. This would be the section of the post entitled "A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker".

I've been assuming all along that the objective function only rewards the next word. Unfortunately, it seems that the way to achieve this objective in practice is to search for higher-level longer-term contexts that surround the next word, like when we're watching TV and we think, "A commercial break is starting." Knowing that a commercial break is starting is essential for predicting the very next frame on the TV screen, but it is also incidentally a (implicit) prediction about what will appear on the screen for the next few minutes. In other words, you could say that making accurate (possibly implicit) probabilistic predictions about the next many words is instrumentally useful for making accurate probabilistic predictions about the next one word, and is thus rewarded by the objective function. I expect that systems that work well will have to be designed this way (i.e. finding "contexts" that entail implicit predictions about many future words, as a step towards picking the single next word). I think this kind of thing is necessary to implement even very basic things like object permanence.

Then the next step is to suppose that the system (being highly intelligent) comes to believe that the prediction X will cause other aspects of the longer-term context to be Y. (See the "Hypothesis 1" vs "Hypothesis 2" examples in the post.) If the system was previously thinking that P(X) is high and P(Y) is low, then ideally, the realization that X implies Y will cause the system to raise P(Y), while keeping P(X) at its previous value. This is, after all, the logically correct update, based on the direction of causality!

But if the system screws up, and lowers P(X) instead of raising P(Y), then it will make a manipulative prediction—the output is being chosen partially for its downstream interactive effects. (Not all manipulative predictions are dangerous, and there might be limits to how strongly it optimizes its outputs for their downstream effects, but I suspect that this particular case can indeed lead to catastrophic outcomes, just like we generically expect from AIs with real-world human-misaligned goals.)

Why should the system screw up this way? Just because the system's causal models will sometimes have mistakes, and sometimes have uncertainties or blank spaces (statistical-regularities-of-unknown-cause), and also because humans make this type of mistake all the time ("One man's modus ponens is another man's modus tollens"). I suspect it will make the right update more often than chance, I just don't see how we can guarantee that it will never make the wrong update in the manipulative Y-->X direction.

Does that help?

Comment by steve2152 on Self-supervised learning & manipulative predictions · 2019-09-02T19:13:57.997Z · score: 1 (1 votes) · LW · GW

Well, strategy 1 is "Keep it from thinking that it's in an interactive environment". Things like "don't adjust the weights of the network while we ask questions" is a way to prevent it from thinking that it's in an interactive environment based on first-hand experience—we're engineering the experience to not leave traces in its knowledge. But to succeed in strategy 1, we also need to make sure that it doesn't come to believe it's in an interactive environment by other means besides first-hand experience, namely by abstract reasoning. More details in this comment, but basically an AGI with introspective information and world-knowledge will naturally over time figure out that it's an AGI, and to figure out the sorts of environments that AGIs are typically in, and thus to hypothesize the existence of interactions even if those interactions have never happened before, and were not intended by the designer (e.g. the "Help I'm trapped in a GPU!" type interactions).

Comment by steve2152 on Self-supervised learning & manipulative predictions · 2019-09-01T00:28:29.499Z · score: 1 (1 votes) · LW · GW

Yeah, I think something like that would probably work for 1B, but 1B is the easy part. It's 1C & 1D that are keeping me up at night...

Comment by steve2152 on [Link] Book Review: Reframing Superintelligence (SSC) · 2019-08-31T00:29:12.957Z · score: 2 (2 votes) · LW · GW

Regarding "computational threshold", my working assumption is that any given capability X is either (1) always and forever out of reach of a system by design, or (2) completely useless, or (3) very likely to be learned by a system, if the system has long-term real-world goals. Maybe it takes some computational time and effort to learn it, but AIs are not lazy (unless we program them to be). AIs are just systems that make good decisions in pursuit of a goal, and if "acquiring capability X" is instrumentally helpful towards achieving goals in the world, it will probably make that decision if it can (cf. "Instrumental convergence").

If I have a life goal that is best accomplished by learning to use a forklift, I'll learn to use a forklift, right? Maybe I won't be very fluid at it, but fine, I'll operate it more slowly and deliberately, or design a forklift autopilot subsystem, or whatever...

Comment by steve2152 on [Link] Book Review: Reframing Superintelligence (SSC) · 2019-08-30T15:08:23.526Z · score: 2 (2 votes) · LW · GW

Right, I was using "output" in a broad sense of "any way that the system can causally impact the rest of the world". We can divide that into "intended output channels" (text on a screen etc.) and "unintended output channels" (sending out radio signals using RAM etc.). I'm familiar with a small amount of work on avoiding unintended output channels (e.g. using homomorphic encryption or fancy vacuum-sealed Faraday cage boxes).

Usually the assumption is that a superintelligent AI will figure out what it is, and where it is, and how it works, and what all its output channels are (both intended and unintended), unless there is some strong reason to believe otherwise (example). I'm not sure this answers your question ... I'm a bit confused at what you're getting at.

Comment by steve2152 on [Link] Book Review: Reframing Superintelligence (SSC) · 2019-08-30T10:45:39.307Z · score: 2 (2 votes) · LW · GW

AGIs will have a causal model of the world. If their own output is part of that model, and they work forward from there to the real-world consequences of their outputs, and they choose outputs partly based on those consequences, then it's an agent by (my) definition. The outputs are called "actions" and the consequences are called "goals". In all other cases, then I'd call it a service, unless I'm forgetting about some edge cases.

A system whose only output is text on a screen can be either a service or an agent, depending on the computational process generating the text. A simple test is that if there's a weird, non-obvious way to manipulate the people reading the text (according the everyday, bad-connotation sense of "manipulate"), would the system take advantage of it? Agents would do so (by default, unless they had a complicated goal involving ethics etc.), services would not by default.

Nobody knows how to build a useful AI capable of world-modeling and formulating intelligent plans but which is not an agent, although I'm personally hopeful that it might be possible by self-supervised learning (cf. Self-Supervised Learning and AGI safety).

Comment by steve2152 on If the "one cortical algorithm" hypothesis is true, how should one update about timelines and takeoff speed? · 2019-08-29T14:26:37.061Z · score: 6 (3 votes) · LW · GW

My own updates after I wrote that were:

  • Increased likelihood of self-supervised learning algorithms as either a big part or even the entirety of the technical path to AGI—insofar as self-supervised learning is the lion's share of how the neocortex learning algorithm supposedly works. That's why I've been writing posts like Self-Supervised Learning and AGI safety.
  • Shorter timelines and faster takeoff, insofar as we think the algorithm is not overwhelmingly complicated
  • Increased likelihood of "one algorithm to rule them all" over Comprehensive AI Services. This might be on the meta-level of one learning algorithm to rule them all, and we feed it biology books to get a superintelligent biologist, and separately we feed it psychology books and nonfiction TV to get a superintelligent psychological charismatic manipulator, etc. Or it might be on the base level of one trained model to rule them all, and we train it with all 50 million books and 100,000 years of YouTube and anything else we can find. The latter can ultimately be more capable (you understand biology papers better if you also understand statistics, etc. etc.), but on the other hand the former is more likely if there are scaling limits where memory access grinds to a halt after too many gigabytes get loaded into the world-model, or things like that. Either way, it would make it likelier for AGI (or at least the final missing ingredient of AGI) to be developed in one place, i.e. the search-engine model rather than the open-source software model.
Comment by steve2152 on Embedded Agency via Abstraction · 2019-08-27T10:46:09.586Z · score: 1 (1 votes) · LW · GW

People (and robots) model the world by starting with sensor data (vision, proprioception, etc.), then finding low-level (spatiotemporally-localized) patterns in that data, then higher-level patterns in the patterns, patterns in the patterns in the patterns, etc. I'm trying to understand how this relates to "abstraction" as you're talking about it.

Sensor data, say the bits recorded by a video camera, is not a causal diagram, but it is already an "abstraction" in the sense that it has mutual information with the part of the world it's looking at, but is many orders of magnitude less complicated. Do you see a video camera as an abstraction-creator / map-maker by itself?

What if the video camera has a MPEG converter? MPEGs can (I think) recognize that low-level pattern X tends to follow low-level pattern Y, and this is more-or-less the same low-level primitive out of which which humans build their sophisticated causal understanding of the world (according to my current understanding of the human brain's world-modeling algorithms). So is a video camera with MPEG converter an abstraction-creator / map-maker? What's your thinking?

Comment by steve2152 on Optimization Provenance · 2019-08-26T09:07:27.101Z · score: 2 (2 votes) · LW · GW

Couple things:

(1) You might give some thought to trying to copy (or at least understand) the world model framework of the human brain. There's uncertainty in how that works, but a lot is known, and you'll at least be working towards something that we know for sure is capable of getting built up to a human level world-model within a reasonable amount of time and computation. As best as I can tell (and I'm working hard to understand it myself), and grossly oversimplifying, it's a data structure with billions of discrete concepts, and transformations between those concepts (composition, cause-effect, analogy, etc...probably all of those are built out of the same basic "transformation machinery" with different contexts acting as metadata). All these concepts are sitting in the top layer of some kind of loose hierarchy, whose lowest layer consists of (higher-level-context-dependent) probability distributions over spatiotemporal sequences of sensory inputs. See my Jeff Hawkins post for one possible point of departure. I've found a couple other references that are indirectly helpful, and like I said, I'm still trying to figure it out. I'm still trying to understand the "sheaves" approach , so I won't comment on how these compare.

(2) "This conception will be the result of an optimizer, and so this should be in the optimization provenance" - this seems to be important and I don't understand it. Better understanding the world consists (in part) of chunking sequences of events and actions, suppressing intermediate steps. Thus we say and think "I'll put some milk in my coffee," leaving out the steps like unscrewing the top of the jug. The process of "explore the world model, chunking sequences of events when appropriate" is (I suspect) essential to making the world-model usable and powerful, and needs to be repeated millions of times in every nook and cranny of the world model, and thus this is a process that an overseer would have little choice but to approve in general, I think. But this process can find and chunk manipulative causal pathways just as well as any other kind of pathway. And once manipulation is packaged up inside a chunk, you won't need optimization per se to manipulate, it will just be an obvious step in the process of doing something, just like unscrewing the top of the jug is an obvious step in putting-milk-into-coffee. I'm not sure how you propose to stop that from happening.

Comment by steve2152 on Does Agent-like Behavior Imply Agent-like Architecture? · 2019-08-25T11:46:17.344Z · score: 1 (1 votes) · LW · GW

No, I would try to rule out stars based on "a-priori-specifiable consequence"—saying that "stars shine" would be painting a target around the arrow, i.e. reasoning about what the system would actually do and then saying that the system is going to do that. For example, expecting bacteria to take actions that maximize inclusive genetic fitness would certainly qualify as "a priori specifiable". The other part is "more likely than chance", which I suppose entails a range of possible actions/behaviors, with different actions/behaviors invoked in different possible universes, but leading to the same consequence regardless. (You can see how every step I make towards being specific here is also a step towards making my "theorem" completely trivial, X=X.)

Comment by steve2152 on Does Agent-like Behavior Imply Agent-like Architecture? · 2019-08-25T01:17:55.053Z · score: 3 (2 votes) · LW · GW

Let's say "agent-like behavior" is "taking actions that are more-likely-than-chance to create an a-priori-specifiable consequence" (this definition includes bacteria).

Then I'd say this requires "agent-like processes", involving (at least) all 4 of: (1) having access to some information about the world (at least the local environment), including in particular (2) how one's actions affect the world. This information can come either baked into the design (bacteria, giant lookup table), and/or from previous experience (RL), and/or via reasoning from input data. It also needs (3) an ability to use this information to choose actions that are likelier-than-chance to achieve the consequence in question (again, the outcome of this search process could be baked into the design like bacteria, or it could be calculated on-the-fly like human foresight), and of course (4) a tendency to actually execute those actions in question.

I feel like this is almost trivial, like I'm just restating the same thing in two different ways... I mean, if there's no mutual information between the agent and the world, its actions can only be effective only insofar as the exact same action would be effective when executed in a random location of a random universe. (Does contracting your own muscle count as "accomplishing something without any world knowledge"?)

Anyway, where I'm really skeptical here is in the term "architecture". "Architecture" in everyday usage usually implies software properties that are obvious parts of how a program is built, and probably put in on purpose. (Is there a more specific definition of "architecture" you had in mind?) I'm pretty doubtful that the ingredients 1-4 have to be part of the "architecture" in that sense. For example, I've been thinking a lot about self-supervised learning algorithms, which have ingredient (1) by design and have (3) sorta incidentally. The other two ingredients (2) and (4) are definitely not part of the "architecture" (in the sense above). But I've argued that they can both occur as unintended side-effects of its operation: See here, and also here for more details about (2). And thus I argue at that first link that this system can have agent-like behavior.

(And what's the "architecture" of a bacteria anyway? Not a rhetorical question.)

Sorry if this is all incorrect and/or not in the spirit of your question.

Comment by steve2152 on Self-supervised learning & manipulative predictions · 2019-08-23T11:38:07.903Z · score: 1 (1 votes) · LW · GW

Thanks, that's helpful! I'll have to think about the "self-consistent probability distribution" issue more, and thanks for the links. (ETA: Meanwhile I also added an "Update 2" to the post, offering a different way to think about this, which might or might not be helpful.)

Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title "Why won't it try to get more predictable data?"). My argument here is not assuming there's a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent.

The ingredients are things like "Look for and learn patterns in all accessible data", which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process ("After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter"). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about ("sneakers are a type of shoe", or more problematically, "my thought processes resemble the associative memory of an AGI"), and cataloging these transformations when they're found. Stuff like that.

So, "make smart hypotheses about one's own embodied situation" is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, "make smart hypotheses about one's own embodied situation" would just be something that happens naturally, unless we somehow prevent it (and I can't see how to prevent it). Likewise, "model one's own real-world causal effects on downstream data" is neither desired by us nor rewarded (as such) by gradient descent. But it can happen anyway, as a side-effect of the usually-locally-helpful rule of "search through the world-model for any patterns and relationships which may impact our beliefs about the upcoming data". Likewise, we have the generally-helpful rule "Hypothesize possible higher-level contexts that span an extended swathe of text surrounding the next word to be predicted, and pick one such context based on how surprising it would be based on what it knows about the preceding text and the world-model, and then make a prediction conditional on that context". All these ingredients combine to get the pathological behavior of choosing "Help I'm trapped in a GPU". That's my argument, anyway...

Comment by steve2152 on Two senses of “optimizer” · 2019-08-22T10:04:58.950Z · score: 1 (1 votes) · LW · GW

RE "make the superintelligence assume that it is disembodied"—I've been thinking about this a lot recently (see The Self-Unaware AI Oracle) and agree with Viliam that knowledge-of-one's-embodiment should be the default assumption. My reasoning is: A good world-modeling AI should be able to recognize patterns and build conceptual transformations between any two things it knows about, and also should be able to do reasoning over extended periods of time. OK, so let's say it's trying to figure something out something about biology, and it visualizes the shape of a tree. Now it (by default) has the introspective information "A tree has just appeared in my imagination!". Likewise, if it goes through any kind of reasoning process, and can self-reflect on that reasoning process, then it can learn (via the same pattern-recognizing algorithm it uses for the external world) how that reasoning process works, like "I seem to have some kind of associative memory, I seem to have a capacity for building hierarchical generative models, etc." Then it can recognize that these are the same ingredients present in those AGIs it read about in the newspaper. It also knows a higher-level pattern "When two things are built the same way, maybe they're of the same type." So now it has a hypothesis that it's an AGI running on a computer.

It may be possible to prevent this cascade of events, by somehow making sure that "I am imagining a tree" and similar things never get written into the world model. I have this vision of two data-types, "introspective information" and "world-model information", and your static type-checker ensures that the two never co-mingle. And voila, AI Safety! That would be awesome. I hope somebody figures out how to do that, because I sure haven't. (Admittedly, I have neither time nor relevant background knowledge to try properly.) I'm also slightly concerned that, even if you figure out a way to cut off introspective knowledge, it might incidentally prevent the system from doing good reasoning, but I currently lean optimistic on that.

Comment by steve2152 on Two senses of “optimizer” · 2019-08-22T01:30:35.097Z · score: 4 (3 votes) · LW · GW

I think I have an example of "an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful". I posted it a couple days ago: Self-supervised learning & manipulative predictions. A self-supervised learning system is an optimizer_1: It's trying to predict masked bits in a fixed, pre-loaded set of data. This task does not entail interacting with the world, and we would presumably try hard to design it not to interact with the world.

However, if it was a powerful learning system with world-knowledge (via its input data) and introspective capabilities, it would eventually figure out that it's an AGI and might hypothesize what environment it's in, and then hypothesize that its operations could affect its data stream via unintended causal pathways, e.g. sending out radio signals. Then, if it used certain plausible types of heuristics as the basis for its predictions of masked bits, it could wind up making choices based on their downstream effects on itself via manipulating the environment. In other words, it starts acting like an optimizer_2.

I'm not super-confident about any of this and am open to criticism. (And I agree with you that this a useful distinction regardless; indeed I was arguing a similar (but weaker) point recently, maybe not as elegantly, at this link)

Comment by steve2152 on Self-Supervised Learning and AGI Safety · 2019-08-21T14:13:01.999Z · score: 3 (2 votes) · LW · GW

Thanks, that's helpful!

The way I'm currently thinking about it, if we have an oracle that gives superintelligent and non-manipulative answers, things are looking pretty good for the future. When you ask it to design a new drug, you also ask some follow-up questions like "How does the drug work?" and "If we deploy this solution, how might this impact the life of a typical person in 20 years time?" Maybe it won't always be able to give great answers, but as long as it's not trying to be manipulative, it seems like we ought to be able to use such a system safely. (This would, incidentally, entail not letting idiots use the system.)

I agree that extracting information from a self-supervised learner is a hard and open problem. I don't see any reason to think it's impossible. The two general approaches would be:

  1. Manipulate the self-supervised learning environment somehow. Basically, the system is going to know lots of different high-level contexts in which the statistics of low-level predictions are different—think about how GPT-2 can imitate both middle school essays and fan-fiction. We would need to teach it a context in which we expect the text to reflect profound truths about the world, beyond what any human knows. That's tricky because we don't have any such texts in our database. But maybe if we put a special token in the 50 most clear and insightful journal articles ever written, and then stick that same token in our question prompt, then we'll get better answers. That's just an example, maybe there are other ways.

  2. Forget about text prediction, and build an entirely separate input-output interface into the world model. The world model (if it's vaguely brain-like) is "just" a data structure with billions of discrete concepts, and transformations between those concepts (composition, cause-effect, analogy, etc...probably all of those are built out of the same basic "transformation machinery"). All these concepts are sitting in the top layer of some kind of hierarchy, whose lowest layer consists of probability distributions over short snippets of text (for a language model, or more generally whatever the input is). So that's the world model data structure. I have no idea how to build a new interface into this data structure, or what that interface would look like. But I can't see why that should be impossible...

Comment by steve2152 on Goodhart's Curse and Limitations on AI Alignment · 2019-08-21T13:38:26.865Z · score: 1 (1 votes) · LW · GW

I do think I understand that. I see E as a means to an end. It's a way to rank-order choices and thus make good choices. If I apply an affine transformation to E, e.g. I'm way too optimistic about absolutely everything in a completely uniform way, then I still make the same choice, and the choice is what matters. I just want my AGI to do the right thing.

Here, I'll try to put what I'm thinking more starkly. Let's say I somehow design a comparative AGI. This is a system which can take a merit function U, and two choices C_A and C_B, and it can predict which of the two choices C_A or C_B would be better according to merit function U, but it has no idea how good either of those two choices actually are on any absolute scale. It doesn't know whether C_A is wonderful while C_B is even better, or whether C_A is awful while C_B is merely so-so, both of those just return the same answer, "C_B is better". Assume it's not omniscient, so its comparisons are not always correct, but that it's still impressively superintelligent.

A comparative AGI does not suffer the optimizer's curse, right? It never forms any beliefs about how good its choices will turn out, so it couldn't possibly be systematically disappointed. There's always noise and uncertainty, so there will be times when its second-highest-ranked choice would actually turn out better than its highest-ranked choice. But that happens less often than chance. There's no systematic problem: in expectation, the best thing to do (as measure by U) is always to take its top-ranked choice.

Now, it seems to me that, if I go to the AGIs-R-Us store, and I see a normal AGI and a comparative AGI side-by-side on the shelf, I would have no strong opinion about which one of them I should buy. If I ask either one to do something, they'll take the same sequence of actions in the same order, and get the same result. They'll invest my money in the same stocks, offer me the same advice, etc. etc. In particular, I would worry about Goodhart's law (i.e. giving my AGI the wrong function U) with either of these AGIs to the exact same extent and for the exact same reason...even though one is subject to optimizer's curse and the other isn't.

Comment by steve2152 on Goodhart's Curse and Limitations on AI Alignment · 2019-08-21T12:26:30.289Z · score: 1 (1 votes) · LW · GW

I don't think it's related to mild optimization. Pick a target T that can be exceeded (wonderful future, even if it's not the absolute theoretically best possible future). Estimate what choice Cmax is (as far as we can tell) the #1 very best by that metric. We expect Cmax to give value E, and it turns out to be V<E, but V is still likely to exceed T, or at least likelier than any other choice. (Insofar as that's not true, it's Goodhart.) Optimizer curse, i.e. V<E, does not seem to be a problem, or even relevant, because I don't ultimately care about E. Maybe the AI doesn't even tell me what E is. Maybe the AI doesn't even bother guessing what E is, it only calculates that Cmax seems to be better than any other choice.

Comment by steve2152 on Self-Supervised Learning and AGI Safety · 2019-08-21T10:45:43.614Z · score: 1 (1 votes) · LW · GW

Ah, thanks for clarifying.

The first entry on my "list of pathological things" wound up being a full blog post in length: See Self-supervised learning and manipulative predictions.

RE daemons, I wrote in that post (and have been assuming all along): "I'm assuming that we will not do a meta-level search for self-supervised learning algorithms... Instead, I am assuming that the self-supervised learning algorithm is known and fixed (e.g. "Transformer + gradient descent" or "whatever the brain does"), and that the predictive model it creates has a known framework, structure, and modification rules, and that only its specific contents are a hard-to-interpret complicated mess." The contents of a world-model, as I imagine it, is a big data structure consisting of gajillions of "concepts" and "transformations between concepts". It's a passive data structure, therefore not a "daemon" in the usual sense. Then there's a KANSI (Known Algorithm Non Self Improving) system that's accessing and editing the world model. I also wouldn't call that a "daemon", instead I would say "This algorithm we wrote can have pathological behavior..."

Comment by steve2152 on Goodhart's Curse and Limitations on AI Alignment · 2019-08-21T00:28:23.167Z · score: 1 (1 votes) · LW · GW

It seems to me that your comment amounts to saying "It's impossible to always make optimal choices for everything, because we don't have perfect information and perfect analysis," which is true but unrelated to optimizer's curse (and I would say not in itself problematic for AGI safety). I'm sure that's not what you meant, but here's why it comes across that way to me. You seem to be setting T = E(C_max). If you set T = E(C_max) by definition, then imperfect information or imperfect analysis implies that you will always miss T by the error e, and the error will always be in the unfavorable direction.

But I don't think about targets that way. I would set my target to be something that can in principle be exceeded (T = have almost as much fun as is physically possible). Then when we evaluate the choices C, we'll find some that dramatically exceed T (i.e. way more fun than is physically possible, because we estimated the consequences wrong), and if we pick one of those, we'll still have a good chance of slightly exceeding T despite the optimizer's curse.

Comment by steve2152 on Goodhart's Curse and Limitations on AI Alignment · 2019-08-20T21:21:29.722Z · score: 1 (1 votes) · LW · GW

I get Goodhart, but i don't understand why the optimizer's curse matters at all in this context; can you explain? My reasoning is: When optimizing, you make a choice C and expect value E but actually get value V<E. But choice C may still have been the best choice. So what if the AI falls short of its lofty expectations? As long as it did the right thing, I don't care whether the AI was disappointed in how it turned out, like if we get a mere Utopia when the AI expected a super duper Utopia. All I care about is C and V, not E.

Comment by steve2152 on Self-supervised learning & manipulative predictions · 2019-08-20T16:43:44.321Z · score: 2 (2 votes) · LW · GW

Thank you for the links!! Sorry I missed them! I'm not sure I understand your comments though and want to clarify:

I'm going to try to rephrase what you said about example 1. Maybe the text in any individual journal article about pyrite is perplexing, but given that the system expects some article about pyrite there, it should ramp the probabilities of individual articles up or down such that the total probability of seeing a journal article about pyrite, conditional on the answer "pyrite", is 100%. (By the same token, "The following is a random number: 2113164" is, in a sense, an unsurprising text string.) I agree with you that a system that creates a sensible, self-consistent probability distribution for text strings would not have a problem with example 1 if we sample from that distribution. (Thanks.) I am concerned that we will build a system with heuristic-guided search processes, not self-consistent probability estimates, and that this system will have a problem with example 1. After all, humans are subject to the conjunction fallacy etc., I assume AGIs will be too, right? Unless we flag this as a critical safety requirement and invent good techniques to ensure it. (I updated the post in a couple places to clarify this point, thanks again.)

For gradient descent, yes they are "only updated towards what they actually observe", but they may "observe" high-level abstractions and not just low-level features. It can learn about a new high-level context in which the low-level word sequence statistics would be very different than when superficially-similar text appeared in the past. So I don't understand how you're ruling out example 2 on that basis.

I mostly agree with what you say about fixed points in principle, but with the additional complication that the system's beliefs may not reflect reality, especially if the beliefs come about through abstract reasoning (in the presence of imperfect information) rather than trial-and-error. If the goal is "No manipulative answers at all ever, please just try to predict the most likely masked bits in this data-file!"—then hopefully that trial-and-error will not happen, and in this case I think fixed points becomes a less useful framework to think about what's going on.

Comment by steve2152 on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-14T22:08:50.751Z · score: 19 (9 votes) · LW · GW

Yeah, unless I'm missing something, this is the solution to the "easy problem of wireheading" as discussed at Abram Demski, Stable Pointers to Value II: Environmental Goals .

Still, I say kudos to the authors for making progress on exactly how to put that principle into practice.

Comment by steve2152 on Self-Supervised Learning and AGI Safety · 2019-08-11T02:34:15.078Z · score: 3 (2 votes) · LW · GW

Thanks for this really helpful comment!!

Search: I don't think search is missing from self-supervised learning at all (though I'm not sure if GPT-2 is that sophisticated). In fact, I think it will be an essential, ubiquitous part of self-supervised learning systems of the future.

So when you say "The proof of this theorem is _____", and give the system a while to think about it, it uses the time to search through its math concept space, inventing new concepts and building new connections and eventually outputting its guess.

Just because it's searching doesn't mean it's dangerous. I was just writing code to search through a string for a big deal, right? A world-model is a complicated data structure, and we can search for paths through this data structure just like any other search problem. Then when a solution to the search problem is found, the result is (somehow) printed to the terminal. I would be generically concerned here about things like (1) The search algorithm "decides" to seize more computing power to do a better search, or (2) the result printed to the terminal is manipulative. But (1) seems unlikely here, or if not, just use a known search algorithm you understand! For (2), I don't see a path by which that would happen, at least under the constraints I mentioned in the post. Or is there something else you had in mind?

Going beyond human knowledge: When you write "it will tell you what humans have said", I'm not sure what you're getting at. I don't think this is true even with text-only data. I see three requirements to get beyond what humans know:

(1) System has optimization pressure to understand the world better than humans do

(2) System is capable of understanding the world better than humans do

(3) The interface to the model allows us to extract information that goes beyond what humans already know.

I'm pretty confident in all three of these. For example, for (1), give the system a journal article that says "We looked at the treated cell in the microscope and it appeared to be ____". The system is asked to predict the blank. It does a better job at this prediction task by understanding biology better and better, even after it understands biology better than any human. By the same token, for (3), just ask a similar question for an experiment that hasn't yet been done. For (2), I assume we'll eventually invent good enough algorithms for that. What's your take?

(I do agree that videos and images make it easier for the system to exceed human knowledge, but I don't think it's required. After all, blind people are able to have new insights.)

Ethics & FAI: I assume that a self-supervised learning system would understand concepts in philosophy and ethics just like it understands everything else. I hope that, with the right interface, we can ask questions about the compatibility of our decisions with our professed principles, arguments for and against particular principles, and so on. I'm not sure we should expect or want an oracle to outright endorse any particular theory of ethics, or any particular vision for FAI. I think we should ask more specific questions than that. Outputting code for FAI is a tricky case because even a superintelligent non-manipulative oracle is not omniscient; it can still screw up. But it could be a big help, especially if we can ask lots of detailed follow-up questions about a proposed design and always get non-manipulative answers.

Let me know if I misunderstood you, or any other thoughts, and thanks again!

Comment by steve2152 on Self-Supervised Learning and AGI Safety · 2019-08-11T01:10:17.305Z · score: 1 (1 votes) · LW · GW

Can you be more specific about the daemons you're thinking about? I had tried to argue that daemons wouldn't occur under certain circumstances, or at least wouldn't cause malign failures...

Do you accept the breakdown into "self-supervised learning phase" and "question-answering phase"? If so, in which of those two phases are you thinking that a daemon might do something bad?

I started my own list of pathological things that might happen with self-supervised learning systems, maybe I'll show you when it's ready and we can compare notes...?

Comment by steve2152 on Jeff Hawkins on neuromorphic AGI within 20 years · 2019-08-11T00:53:06.146Z · score: 5 (3 votes) · LW · GW

I did actually read his 2004 book (after writing this post), and as far as I can tell, he doesn't really seem to have changed his mind about anything, except details like "What exactly is the function of 5th-layer cortical neurons?" etc.

In particular, his 2004 book gave the impression that artificial neural nets would not appreciably improve except by becoming more brain-like. I think most neutral observers would say that we've had 15 years of astounding progress while stealing hardly any ideas from the brain, so maybe understanding the brain isn't required. Well, he doesn't seem to accept that argument. He still thinks the path forward is brain-inspired. I guess his argument would be that today's artifical NN's are neat but they don't have the kind of intelligence that counts, i.e. the type of understanding and world-model creation that the neocortex does, and that they won't get that kind of intelligence except by stealing ideas from the neocortex. Something like that...

Comment by steve2152 on In defense of Oracle ("Tool") AI research · 2019-08-07T19:32:21.094Z · score: 1 (1 votes) · LW · GW

Maybe there are other definitions, but the way I'm using the term, what you described would definitely be an agent. An oracle probably wouldn't have an internet connection at all, i.e. it would be "boxed". (The box is just a second layer of protection ... The first layer of protection is that a properly-designed safe oracle, even if it had an internet connection, would choose not to use it.)

Comment by steve2152 on In defense of Oracle ("Tool") AI research · 2019-08-07T19:20:01.025Z · score: 7 (3 votes) · LW · GW

Thank you, those are very interesting references, and very important points! I was arguing that solving a certain coordination problem is even harder than solving a different coordination problem, but I'll agree that this argument is moot if (as you seem to be arguing) it's utterly impossible to solve either!

Since you've clearly thought a lot about this, have you written up anything about very-long-term scenarios where you see things going well? Are you in the camp of "we should make a benevolent dictator AI implementing CEV", or "we can make task-limited-AGI-agents and coordinate to never make long-term-planning-AGI-agents", or something else?

Comment by steve2152 on In defense of Oracle ("Tool") AI research · 2019-08-07T18:14:36.046Z · score: 5 (3 votes) · LW · GW

Thanks, this is really helpful! For 1,2,4, this whole post is assuming, not arguing, that we will solve the technical problem of making safe and capable AI oracles that are not motivated to escape the box, give manipulative answers, send out radio signals with their RAM, etc. I was not making the argument that this technical problem is easy ... I was not even arguing that it's less hard than building a safe AI agent! Instead, I'm trying to counter the argument that we shouldn't even bother trying to solve the technical problem of making safe AI oracles, because oracles are uncompetitive.

...That said, I do happen to think there are paths to making safe oracles that don't translate into paths to making safe agents (see Self-supervised learning and AGI safety), though I don't have terribly high confidence in that.

Can you find a link to where "Christiano dismisses Oracle AI"? I'm surprised that he has done that. After all, he coauthored "AI Safety via Debate", which seems to addressed primarily (maybe even exclusively) at building oracles (question-answering systems). Your answer to (3) is enlightening, thank you, and do you have any sense for how widespread this view is and where it's argued? (I edited the post to add that people going for benevolent dictator CEV AGI agents should still endorse oracle research because of the bootstrapping argument.)

Comment by steve2152 on The Self-Unaware AI Oracle · 2019-07-26T13:47:40.919Z · score: 3 (2 votes) · LW · GW

Just as if it were looking into the universe from outside it, it would presumably be able to understand anything in the world, as a (third-person) fact about the world, including that humans have self-awareness, that there is a project to build a self-unaware AI without it, and so on. We would program it with strict separation between the world-model and the reflective, meta-level information about how the world-model is being constructed and processed. Thus the thought "Maybe they're talking about me" cannot occur, there's nothing in the world-model to grab onto as a referent for the word "me". Exactly how this strict separation would be programmed, and whether you can make a strong practical world-modeling system with such a separation, are things I'm still trying to understand.

A possible (not realistic) example is: We enumerate a vast collection of possible world-models, which we construct by varying any of a vast number of adjustable parameters, describing what exists in the world, how things relate to each other, what's going on right now, and so on. Nothing in any of the models has anything in it with a special flag labeled "me", "my knowledge", "my actions", etc., by construction. Now, we put a probability distribution over this vast space of models, and initialize it to be uniform (or whatever). With each timestep of self-supervised learning, a controller propagates each of the models forward, inspects the next bit in the datastream, and adjusts the probability distribution over models based on whether that new bit is what we expected. After watching 100,000 years of YouTube videos and reading every document ever written, the controller outputs the one best world-model. Now we have a powerful world-model, in which there are deep insights about how everything works. We can use this world-model for whatever purpose we like. Note that the "learning" process here is a dumb thing that just uses the transition rules of the world-models, it doesn't involve setting up the world-models themselves to be capable of intelligent introspection. So it seems to me like this process ought to generate a self-unaware world model.

Comment by steve2152 on The Self-Unaware AI Oracle · 2019-07-25T01:42:21.256Z · score: 1 (1 votes) · LW · GW

Just to be clear, when OpenAI trained GPT-2, I am not saying that GPT-2 is a known and well-understood algorithm for generating text, but rather that SGD (Stochastic Gradient Descent) is a known and well-understood algorithm for generating GPT-2. (I mean, OK sure, ML researchers are still studying SGD, but its inner workings are not an impenetrable mystery the way that GPT-2's are.)

Comment by steve2152 on The Self-Unaware AI Oracle · 2019-07-25T01:29:05.449Z · score: 3 (2 votes) · LW · GW

OK, so I was saying here that software can optimize for something (e.g. predicting a string of bits on the basis of other bits) and it's by default not particularly dangerous, as long as the optimization does not involve an intelligent foresight-based search through real-world causal pathways to reach the desired goal. My argument for this was (1) Such a system can do Level-1 optimization but not Level-2 optimization (with regards to real-world causal pathways unrelated to implementing the algorithm as intended), and (2) only the latter is unusually dangerous. From your response, it seems like you agree with (1) but disagree with (2). Is that right? If you disagree with (2), can you make up a scenario of something really bad and dangerous, something that couldn't happen with today's software, something like a Global Catastrophic Risk, that is caused by a future AI that is optimizing something but is not more specifically using a world-model to do an intelligent search through real-world causal pathways towards a desired goal?

Comment by steve2152 on The Self-Unaware AI Oracle · 2019-07-24T19:02:00.430Z · score: 5 (3 votes) · LW · GW

On further reflection, you're right, the Solomonoff induction example is not obvious. I put a correction in my post, thanks again.

Comment by steve2152 on The Self-Unaware AI Oracle · 2019-07-24T13:57:57.892Z · score: 8 (2 votes) · LW · GW

Thanks for your patience, I think this is important and helpful to talk through (hope it's as helpful for you as for me!)

Let's introduce two terminologies I made up. First, the thing I mentioned above:

  • Non-optimization means that "an action leading to a "good" consequence (according to a predetermined criterion) happens no more often than chance" (e.g. a rock)
  • Level-1 optimization means "an action leading to a "good" consequence happens no more often than chance at first, but once it's stumbled upon, it tends to be repeated in the future". (e.g. bacteria)
  • Level-2 optimization means "an action leading to a "good" consequence is taken more often than chance from the start, because of foresight and planning". (e.g. human)

Second, when you run a program:

  • Algorithm Land is where you find abstract mathematical entities like "variables", "functions", etc.
  • Real World is that place with atoms and stuff.

Now, when you run a program, you can think of what's happening in Algorithm Land (e.g. a list of numbers is getting sorted) and what's happening in the Real World (e.g. transistors are switching on and off). It's really always going to be both at once.

And now let's simplify things greatly by putting aside the case of world-modeling programs, which have a (partial, low-resolution) copy of the Real World inside Algorithm Land. Instead, let's restrict our attention a chess-playing program or any other non-world-modeling program.

Now, in this case, when we think about Level-2 optimization, the foresight and planning involved entail searching exclusively through causal pathways that are completely inside Algorithm Land. (Why? Because without a world model, it has no way to reason about Real-World causal pathways.) In this case, I say there isn't really anything much to worry about.

Why not worry? Think about classic weird AGI disaster scenarios. For example, the algorithm is optimizing for the "reward" value in register 94, so it hacks its RAM to overwrite the register with the biggest possible number, then seizes control of its building and the power grid to ensure that it won't get turned off, then starts building bigger RAMs, designing killer nanomachines, and on and on. Note that ALL those things (1) involve causal pathways in the Real World (even if the action and consequence are arguably in Algorithm Land) and (2) would be astronomically unlikely to occur by random chance (which is what happens without Level-2 optimization). (I won't say that nothing can go awry with Level-1 optimization—I have great respect for bacteria—but it's a much easier situation to keep under control than rogue Level-2 optimization through Real-World causal pathways.)

Again, things that happen in Algorithm Land are also happening in the Real World, but the mapping is kinda arbitrary. High-impact things in Algorithm Land are not high-impact things in the Real World. For example, using RAM to send out manipulative radio signals is high-impact in the Real World, but just a random meaningless series of operations in Algorithm Land. Conversely, an ingeniously-clever chess move in Algorithm Land is just a random activation of transistors in the Real World.

(You do always get Level-1 optimization through Real-World causal pathways, with or without a world model. And you can get Level-2 optimization through Real-World causal pathways, but a necessary requirement seems to be an algorithm with a world-model and self-awareness (i.e. knowledge that there is a relation between things in Algorithm Land and things in the Real World).

Comment by steve2152 on The Self-Unaware AI Oracle · 2019-07-23T18:28:43.636Z · score: 3 (2 votes) · LW · GW

A self-unaware system would not be capable of one particular type of optimization task:

Take real-world actions ("write bit 0 into register 11") on the basis of anticipating their real-world consequences (human will read this bit and then do such-and-such).

This thing is an example of an optimization task, and it's a very dangerous one. Maybe it's even the only type of really dangerous optimization task! (This might be an overstatement, not sure.) Not all optimization tasks are in this category, and a system can be intelligent by doing other different types of optimization tasks.

A self-unaware system certainly is an optimizer in the sense that it does other (non-real-world) optimization tasks, in particular, finding the string of bits that would be most likely to follow a different string of bits on a real-world webpage.

As always, sorry if I'm misunderstanding you, thanks for your patience :-)

Comment by steve2152 on The Self-Unaware AI Oracle · 2019-07-23T18:04:10.673Z · score: 3 (2 votes) · LW · GW

I think we're on the same page! As I noted at the top, this is a brainstorming post, and I don't think my definitions are quite right, or that my arguments are airtight. The feedback from you and others has been super-helpful, and I'm taking that forward as I search for more a rigorous version of this, if it exists!! :-)