# Alignment as Translation

post by johnswentworth · 2020-03-19T21:40:01.266Z · score: 44 (15 votes) · LW · GW · 39 comments

## Contents

  Some Approaches
Default: Humans Translate
Humans Translate Using Better Tools
Examples + Interpolation
Incentives
AI Translates
None


Technology Changes Constraints [? · GW] argues that economic constraints are usually modular with respect to technology changes - so for reasoning about technology changes, it’s useful to cast them in terms of economic constraints. Two constraints we’ll talk about here:

• Compute - flops, memory, etc.
• Information - sensors, data, etc.

Thanks to ongoing technology changes, both of these constraints are becoming more and more slack over time - compute and information are both increasingly abundant and cheap.

Immediate question: what happens in the limit as the prices of both compute and information go to zero?

Essentially, we get omniscience: our software has access to a perfect, microscopically-detailed model of the real world. Computers have the memory and processing capability to run arbitrary queries on that model, and predictions are near-perfectly accurate (modulo quantum noise). This limit applies even without AGI - as compute and information become more abundant, our software approaches omniscience, even limiting ourselves to special-purpose reasoning algorithms.

Of course, AGI would presumably be closer to omniscience than non-AGI algorithms, at the same level of compute/information. It would be able to more accurately predict more things which aren’t directly observable via available sensors, and it would be able to run larger queries with the same amount of compute. (How much closer to omniscience an AGI would get is an open question, but it would at least not be any worse in a big-O sense.)

Next question: as compute and information constraints slacken, which constraints become taut? What new bottlenecks appear, for problems which were previously bottlenecked on compute/information?

To put it differently: if our software can run arbitrary queries on an accurate, arbitrarily precise low-level model of the physical world, what else do we need in order to get value out of that capability?

Well, mainly we need some way to specify what it is that we want. We need an interface [? · GW].

Our highly accurate low-level world model can tell us anything about the physical world, but the things-we-want are generally more abstract than molecules/atoms/fields. Our software can have arbitrarily precise knowledge and predictive power on physical observables, but it still won’t have any notion that air-pressure-oscillations which sound like the word “cat” have something to do with the organs/cells/biomolecules which comprise a cat. It won’t have built-in any notion of “tree” or “rock” or “human” - using such high-level abstractions would only impede predictive power, when we could instead model the individual components of such high-level objects.

It’s the prototypical interface problem [? · GW]: the structure of a high-precision world-model generally does not match the structure of what-humans-want, or the structure of human abstractions in general. Someone/something has to translate between the structures in order to produce anything useful.

As I see it, this is the central problem of alignment.

## Some Approaches

Default: Humans Translate

Without some scalable way to build high-level world models out of low-level world models, we constantly need to manually translate things-humans-want into low-level specifications. It’s an intellectual-labor-intensive and error-prone process; writing programs in assembly code is not just an analogy but an example. Even today’s “high-level programming languages” are much more structurally similar to assembly code than to human world-models - Python has no notion of “oak tree”.

An analogy: translating high-level structure into low-level specification the way we do today is like translating English into Korean by hand.

Humans Translate Using Better Tools

It’s plausible (though I find it unlikely) that we could tackle the problem by building better tools to help humans translate from high-level to low-level - something like much-higher-level programming languages. I find it unlikely because we’d probably need major theoretical breakthroughs - for instance, how do I formally define “tree” in terms of low-level observables? Even if we had ways to do that, they’d probably enable easier strategies than building better programming languages.

Analogy: it’s like translating by hand from English to Korean, but with the assistance of a dictionary, spell-checker, grammar-checker, etc. But if we had an English-Korean dictionary, we'd probably be most of the way to automated translation anyway (in this respect, the analogy is imperfect).

Examples + Interpolation

Another path which is plausible (though I find it unlikely) is something like programming-by-example - not unlike today’s ML. This seems unlikely to work from both an inside and outside view:

• Inside view: the whole problem in the first place is that low-level structure doesn’t match high-level structure, so there’s no reason to expect software systems to interpolate along human-intuitive dimensions.
• Outside view: programming-by-example (and today’s ML with it) is notoriously unreliable.

Examples alone aren’t enough to make software reliably carve reality at the same joints as humans. There probably are some architectures which would reliably carve at the same joints as humans - different humans tend to chunk the world into similar objects, after all. But figuring out such an architecture would take more than just throwing lots of data at the problem.

To put it differently: the way-in-which-we-want-things-translated is itself something which needs to be translated. A human’s idea-of-what-constitutes-a-“good”-low-level-specification-of-“oak tree” is itself pretty high-level and abstract; that idea itself needs to be translated into a low-level specification before it can be used. If we’re trying to use examples+interpolation, then the interpolation algorithm is our “specification” of how-to-translate… and it probably isn’t a very good translation of our actual high-level idea of how-to-translate.

Analogy: it’s like teaching English to Korean speakers by pointing to trees and saying “tree”, pointing to cars and saying “car”, etc… except that none of them actually realize they’re supposed to be learning another language. The Korean-language instructions they received were not actually a translation of the English explanation “learn the language that person is speaking”.

Incentives

A small tweak to the previous approach: train a reinforcement learner.

The analogy: rather than giving our Korean-speakers some random Korean-language instructions, we don't give them any instructions - we just let them try things, and then pay them when they happen to translate things from English to Korean.

Problem: this requires some way to check that the translation was correct. Knowing what to incentivize is not any easier than specifying what-we-want to begin with. Rather than translating English-to-Korean, we’re translating English-to-incentives.

Now, there is a lot of room here for clever tricks. What if we verify the translation by having one group translate English-to-Korean, another group translate back, and reward both when the result matches the original? Or taking the Korean translation, giving it to some other Korean speaker, and seeing what they do? Etc. These are possible approaches to translating English into incentives, within the context of the analogy.

It’s possible in principle that translating what-humans-want into incentives is easier than translating into low-level specifications directly. However, if that’s the case, I have yet to see compelling evidence - attempts to specify incentives seem plagued by the same surprising corner cases and long tail of difficult translations as other strategies.

AI Translates

This brings us to the obvious general answer: have the AI handle the translation from high-level structure to low-level structure. This is probably what will happen eventually, but the previous examples should make it clear why it’s hard: an explanation of how-to-translate must itself be translated. In order to make an AI which translates high-level things-humans-want into low-level specifications, we first need a low-level specification of the high-level concept “translate high-level things-humans-want into low-level specifications”.

Continuing the earlier analogy: we’re trying to teach English to a Korean speaker, but that Korean speaker doesn’t have any idea that they’re supposed to be learning another language. In order to get them to learn English, we first need to somehow translate something like “please learn this language”.

This is a significant reduction of the problem: rather than translating everything by hand all the time, we just need to translate the one phrase “please learn this language”, and then the hard part is done and we can just use lots of examples for the rest.

But we do have a chicken-and-egg problem: somehow, we need to properly translate that first phrase. Screw up that first translation, and nothing else will work. That part cannot be outsourced; the AI cannot handle the translation because it has no idea that that’s what we want it to do.

comment by rohinmshah · 2020-03-27T18:19:50.030Z · score: 4 (2 votes) · LW(p) · GW(p)

Planned summary for the Alignment Newsletter:

At a very high level, we can model powerful AI systems as moving closer and closer to omniscience. As we move in that direction, what becomes the new constraint on technology? This post argues that the constraint is _good interfaces_, that is, something that allows us to specify what the AI should do. As with most interfaces, the primary challenge is dealing with the discrepancy between the user's abstractions (how humans think about the world) and the AI system's abstractions, which could be very alien to us (e.g. perhaps the AI system uses detailed low-level simulations). The author believes that this is the central problem of AI alignment: how to translate between these abstractions that accurately preserves meaning.
The post goes through a few ways that we could attempt to do this translation, but all of them seem to only reduce the amount of translation that is necessary: none of them solve the chicken-and-egg problem of how you do the very first translation between the abstractions.

Planned opinion:

I like this view on alignment, but I don't know if I would call it the _central_ problem of alignment. It sure seems important that the AI is _optimizing_ something: this is what prevents solutions like "make sure the AI has an undo button / off switch", which would be my preferred line of attack if the main source of AI risk were bad translations between abstractions. There's a longer discussion on this point here [AF(p) · GW(p)].

(I might change the opinion based on further replies to my other comment.)

comment by johnswentworth · 2020-03-27T19:16:17.071Z · score: 4 (2 votes) · LW(p) · GW(p)

Endorsed; that definitely captures the key ideas.

If you haven't already, you might want to see my answer to Steve's comment, on why translation to low-level structure is the right problem to think about even if the AI is using higher-level models.

comment by rohinmshah · 2020-03-27T20:30:25.185Z · score: 4 (2 votes) · LW(p) · GW(p)

I did see that answer and pretty strongly agree with it, the "low-level structure" part of my summary was meant to be an example, not a central case. To make this clearer, I changed

which could potentially be detailed accurate low-level simulations

to

which could be very alien to us (e.g. perhaps the AI system uses detailed low-level simulations)
comment by rohinmshah · 2020-03-26T03:47:42.356Z · score: 4 (2 votes) · LW(p) · GW(p)

So it seems like this framing of alignment removes the notion of the AI "optimizing for something" or "being goal-directed". Do you endorse dropping that idea?

With just this general argument, I would probably not argue for AI risk -- if I had to argue for it, the argument would go "we ask the AI to do something, this gets mistranslated and the AI does something else with weird consequences, maybe the weird consequences include extinction", but it sure seems like as it starts doing the "something else" we would e.g. turn it off.

comment by johnswentworth · 2020-03-26T17:20:08.154Z · score: 6 (3 votes) · LW(p) · GW(p)

Starting point: the problem which makes AI alignment hard is not the same problem which makes AI dangerous. This is the capabilities/alignment distinction: AI with extreme capabilities is dangerous; aligning it is the hard part.

So it seems like this framing of alignment removes the notion of the AI "optimizing for something" or "being goal-directed". Do you endorse dropping that idea?

Anything with extreme capabilities is dangerous, and needs to be aligned. This applies even outside AI - e.g. we don't want a confusing interface on a nuclear silo. Lots of optimization power is a sufficient condition for extreme capabilities, but not a necessary condition.

Here's a plausible doom scenario without explicit optimization. Imagine an AI which is dangerous in the same way as a nuke is dangerous, but more so: it can make large irreversible changes to the world too quickly for anyone to to stop it. Maybe it's capable of designing and printing a supervirus (and engineered bio-offence is inherently easier than engineered bio-defense); maybe it's capable of setting off all the world's nukes simultaneously; maybe it's capable of turning the world into grey goo.

If that AI is about as transparent as today's AI, and does things the user wasn't expecting about as often as today's AI, then that's not going to end well.

Now, there is the counterargument that this scenario would produce a fire alarm, but there's a whole host of ways that could fail:

• The AI is usually very useful, so the risks are ignored
• Errors are patched rather than fixing the underlying problem
• Really big errors turn out to be "easier" than small errors - i.e. high-to-low level translations are more likely to be catastrophically wrong than mildly wrong
• It's hard to check in testing whether there's a problem, because errors are rare and/or don't look like errors at the low-level (and it's hard/expensive to check results at the high-level)
• In the absence of optimization pressure, the AI won't actively find corner-cases in our specification of what-we-want, so it might actually be more difficult to notice problems ahead-of-time
• ...

Do you endorse dropping that idea?

I don't endorse dropping the AI-as-optimizer idea entirely. It is definitely a sufficient condition for AI to be dangerous, and a very relevant sufficient condition. But I strongly endorse the idea that optimization is not a necessary condition for AI to be dangerous. Tool AI can be plenty dangerous if it's capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.

comment by rohinmshah · 2020-03-26T19:09:57.050Z · score: 4 (2 votes) · LW(p) · GW(p)
Tool AI can be plenty dangerous if it's capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.

I definitely agree with that characterization. I think the solutions I would look for would be quite different though: they would look more like "how do I ensure that the AI system has an undo button" and "how do I ensure that the AI system does things slowly", similarly to how with nuclear power plants (I assume) there are (possibly redundant) mechanisms that ensure you can turn off the power plant.

Of course these solutions are also subject to the same translation problem, but it seems plausible to me that that translation problem is easier to solve, relative to solving translation in full generality.

AI-as-optimizer would suggest that even if the translation problem were solved for the particular things I mentioned, it still might not be enough, because e.g. the AI might deliberately prevent me from pressing the undo button.

You could say something like "an AI that can enact large irreversible changes might form a plan where the large irreversible change starts with disabling the undo button", but then it sort of feels like we're bringing back in the idea of optimization. Maybe that's fine, we're pretty confused about optimization anyway.

comment by johnswentworth · 2020-03-26T20:10:02.334Z · score: 4 (2 votes) · LW(p) · GW(p)
"how do I ensure that the AI system has an undo button" and "how do I ensure that the AI system does things slowly"

I don't think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of "slow" is roughly "a human has time to double-check", which immediately makes things very expensive.

Even if we abandon economic competitiveness, I doubt that slow+reversible makes the translation problem all that much easier (though it would make the AI at least somewhat less dangerous, I agree with that). It's probably somewhat easier - having a few cycles of feedback seems unlikely to make the problem harder. But if e.g. we're originally training the AI via RL, then slow+reversible basically just adds a few more feedback cycles after deployment; if millions or billions of RL cycles didn't solve the problem, then adding a handful more at the end seems unlikely to help much (though an argument could be made that those last few are higher-quality). Also, there's still the problem of translating a human's high-level notion of "reversible" into a low-level notion of "reversible".

Taking a more outside view... restrictions like "make it slow and reversible" feel like patches which don't really address the underlying issues. In general, I'd expect the underlying issues to continue to manifest themselves in other ways when patches are applied. For instance, even with slow & reversible changes, it's still entirely plausible that humans don't stop something bad because they don't understand what's going on in enough detail - that's a typical scenario in the "translation problem" worldview.

Zooming out even further...

I think the solutions I would look for would be quite different though...

I think what's driving this intuition is that you're looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities. I expect that such strategies, in general, will run into similar problems to those mentioned above:

• Capabilities which make an AI economically valuable are often capabilities which make it dangerous. Limit capabilities for safety, and the AI won't be economically competitive.
• Choosing which capabilities are "dangerous" is itself a problem of translating what-humans-want into some other framework, and is subject to the usual problems: simple solutions will be patches which don't address everything, there will be a long tail of complicated corner cases, etc.
comment by rohinmshah · 2020-03-27T17:41:07.120Z · score: 4 (2 votes) · LW(p) · GW(p)
I think what's driving this intuition is that you're looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities.

Yup, that is definitely the intuition.

Taking a more outside view... restrictions like "make it slow and reversible" feel like patches which don't really address the underlying issues.

Agreed.

In general, I'd expect the underlying issues to continue to manifest themselves in other ways when patches are applied.

I mean, they continue to manifest in the normal sense, in that when you say "cure cancer", the AI systems works on a plan to kill everyone; you just now get to stop the AI system from actually running that plan.

For instance, even with slow & reversible changes, it's still entirely plausible that humans don't stop something bad because they don't understand what's going on in enough detail - that's a typical scenario in the "translation problem" worldview.
[...]
Also, there's still the problem of translating a human's high-level notion of "reversible" into a low-level notion of "reversible".
[...]
simple solutions will be patches which don't address everything, there will be a long tail of complicated corner cases, etc.

All of this is true; I'm more arguing that slow & reversible eliminates ~95% of the problems, and so if it's easier to do than "full" alignment, then it probably becomes the best thing to do on the margin.

I don't think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of "slow" is roughly "a human has time to double-check", which immediately makes things very expensive.

I'd expect we'd be able to solve this over time, e.g. first you use your AI system for simple tasks which you can check quickly, then as you start trusting that you've worked out the bugs for those tasks, you let the AI do them faster / without oversight, and move on to more complicated tasks, etc.

(This is a much more testing + engineering based approach; the standard argument against such an approach is that it fails in the presence of optimization [AF · GW].)

It certainly does mean you take a hit to economic competitiveness, I mostly think the hit is not that large and is something we could pay.

comment by johnswentworth · 2020-03-27T19:00:51.600Z · score: 6 (3 votes) · LW(p) · GW(p)

I agree with most of this reasoning. I think my main point of departure is that I expect most of the value is in the long tail [LW · GW], i.e. eliminating 95% of problems generates <10% or maybe even <1% of the value. I expect this both in the sense that eliminating 95% of problems unlocks only a small fraction of economic value, and in the sense that eliminating 95% of problems removes only a small fraction of risk. (For the economic value part, this is mostly based on industry experience trying to automate things.)

Optimization is indeed the standard argument for this sort of conclusion, and is a sufficient condition for eliminating 95% of problems to have little impact on risk. But again, it's not a necessary condition - if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn't really decreased. And that's exactly the sort of situation I expect when viewing translation as the central problem: illusion of transparency [LW · GW] is exactly the sort of thing which doesn't seem like a problem 95% of the time, right up until you realize that everything was completely broken all along.

Anyway, sounds like value-in-the-tail is a central crux here.

comment by rohinmshah · 2020-03-27T20:44:13.250Z · score: 4 (2 votes) · LW(p) · GW(p)
Anyway, sounds like value-in-the-tail is a central crux here.

Seems somewhat right to me, subject to caveat below.

it's not a necessary condition - if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn't really decreased.

An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you've translated better and now you've eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.

A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.

(For the economic value part, this is mostly based on industry experience trying to automate things.)

I have the same intuition, and strongly agree that usually most of the value is in the long tail [LW · GW]. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.

comment by johnswentworth · 2020-03-27T21:43:41.248Z · score: 4 (2 votes) · LW(p) · GW(p)
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you've translated better and now you've eliminated 99% of the risk...

I don't see how this ever actually gets around the chicken-and-egg problem.

An analogy: we want to translate from English to Korean. We first obtain a translation dictionary which is 95% accurate, then use it to ask our Korean-speaking friend to help out. Problem is, there's a very important difference between very similar translations of "help me translate things" - e.g. consider the difference between "what would you say if you wanted to convey X?" and "what should I say if I want to convey X?", when giving instructions to an AI. Both of those would produce very similar results, right up until everything went wrong. (Let me know if this analogy sounds representative of the strategies you imagine.)

If you do manage to get that first translation exactly right, and successfully ask your friend for help, then you're good - similar to the "translate how-to-translate" strategy from the OP. And with a 95% accurate dictionary, you might even have a decent chance of getting that first translation right. But if that first translation isn't perfect, then you need some way to find that out safely - and the 95% accurate dictionary doesn't make that any easier.

Another way to look at it: the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further. We need some other way to get at the ground truth, in order to actually reduce the error rate. If we know how to convey what-we-want with 95% accuracy, then we need some other way to get at the ground truth of translation in order to increase that accuracy further.

comment by rohinmshah · 2020-03-28T00:02:08.436Z · score: 4 (2 votes) · LW(p) · GW(p)
Let me know if this analogy sounds representative of the strategies you imagine.

Yeah, it does. I definitely agree that this doesn't get around the chicken-and-egg problem, and so shouldn't be expected to succeed on the first try. It's more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.

the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further.

I think you get "ground truth data" by trying stuff and seeing whether or not the AI system did what you wanted it to do.

(This does suggest that you wouldn't ever be able to ask your AI system to do something completely novel without having a human along to ensure it's what we actually meant, which seems wrong to me, but I can't articulate why.)

comment by johnswentworth · 2020-03-28T00:29:33.346Z · score: 4 (2 votes) · LW(p) · GW(p)
I think you get "ground truth data" by trying stuff and seeing whether or not the AI system did what you wanted it to do.

That's the sort of strategy where illusion of transparency is a big problem, from a translation point of view. The difficult cases are exactly the cases where the translation usually produces the results you expect, but then produce something completely different in some rare cases.

Another way to put it: if we're gathering data by seeing whether the system did what we wanted, then the long tail problem works against us pretty badly. Those rare tail-cases are exactly the cases we would need to observe in order to notice problems and improve the system. We're not going to have very many of them to work with. Ability to generalize from small data sets becomes a key capability, but then we need to translate how-to-generalize in order for the AI to generalize in the ways we want (this gets at the can't-ask-the-AI-to-do-anything-novel problem).

comment by johnswentworth · 2020-03-27T22:01:24.366Z · score: 3 (2 votes) · LW(p) · GW(p)

(The other comment is my main response, but there's a possibly-tangential issue here.)

In a long-tail world, if we manage to eliminate 95% of problems, then we generate maybe 10% of the value. So now we use our 10%-of-value product to refine our solution. But it seems rather optimistic to hope that a product which achieves only 10% of the value gets us all the way to a 99% solution. It seems far more likely that it gets to, say, a 96% solution. That, in turn, generates maybe 15% of the value, which in turn gets us to a 96.5% solution, and...

Point being: in the long-tail world, it's at least plausible (and I would say more likely than not) that this iterative strategy doesn't ever converge to a high-value solution. We get fancier and fancier refinements with decreasing marginal returns, which never come close to handling the long tail.

Now, under this argument, it's still a fine idea to try the iterative strategy. But you wouldn't want to bet too heavily on its success, especially without a reliable way to check whether it's working.

comment by rohinmshah · 2020-03-27T23:52:57.990Z · score: 5 (3 votes) · LW(p) · GW(p)

Yeah, this could be a way that things are. My intuition is that it wouldn't be this way, but I don't have any good arguments for it.

comment by TAG · 2020-03-27T13:43:34.820Z · score: 1 (1 votes) · LW(p) · GW(p)
I don't think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of "slow" is roughly "a human has time to double-check", which immediately makes things very expensive.

There's already an answer to that: you separate "fast" from "unpredictable". The AI that does things fast is not the AI that engages in out-of-the-box thinking.

comment by johnswentworth · 2020-03-27T16:18:54.196Z · score: 2 (1 votes) · LW(p) · GW(p)

Predictable low-level behavior is not the same as predictable high-level behavior. When I write or read python code, I can have a pretty clear idea of what every line does in a low-level sense, but still sometimes be surprised by high-level behavior of the code.

We still need to translate what-humans-want into a low-level specification. "Making it predictable" at a low-level doesn't really get us any closer to predictability at the high-level (at least in the cases which are actually difficult in the first place). "Making it predictable" at a high-level requires translating high-level "predictability" into some low-level specification, which just brings us back to the original problem: translation is hard.

comment by TAG · 2020-03-28T17:00:16.310Z · score: 1 (1 votes) · LW(p) · GW(p)

I am assuming that the AI that engages in out-of-the-box thinking is not fast, and that the conjunction of fast *and* unpredictable is the central problem.

The market will demand AI that's faster than humans, and at least as capable of creative, unpredictable thinking.
However, the same AI does not have to be both. This approach to AI safety is copied from a widespread organisational
principal, where the higher levels do the abstract strategic thinking, the least predictable stuff,
the middle levels do the concrete, tactical thinking and the lowest levels do what they are told.
The fastest and most fine grained actions are at the lowest level. The higher level can only communicate with the lower levels by communicating an amended strategy or policy: they are not able interrupt fine-grained decisions, and only hear about fine grained actions after they have happenned. I have given an abstract description of this organising principle because there are multiple concrete examples: large businesses, militaries, and the human brain/CNS. Businesses already use fast but not very flexible systems to do things faster than humans, notably in high frequency trading. The question is whether
more advanced AI's will be responsible for fine-grained trading decisions, the all-in-one approach, or whether advanced AI will substitute for or assist business analysts and market strategists.

A standard objection to Tool AI is that having a human check all the TAI's decisions would slow things up too much. The above architecture allows an alternative, where human checking occurs between levels. In particular, communication from the highest level to the lower ones is slow anyway. The main requisite for this apprach to AI safety is a human readable communications protocol.

Making it predictable" at a high-level requires translating high-level "predictability" into some low-level specification, which just brings us back to the original problem: translation is hard.

If you are checking your high level AI as you go along, you need a high level language that is human comprehensible.

comment by johnswentworth · 2020-03-28T23:50:32.282Z · score: 2 (1 votes) · LW(p) · GW(p)

I'm pretty sure none of this actually affects what I said: the low-level behavior still needs produce results which are predictable to humans in order for predictability to be useful, and that's still hard.

The problem is that making an AI predictable to a human is hard. This is true regardless of whether or not it's doing any outside-the-box thinking. Having a human double-check the instructions given to a fast low-level AI does not make the problem any easier; the low-level AI's behavior still has to be understood by a human in order for that to be useful.

As you say toward the end, you'd need something like a human-readable communications protocol. That brings us right back to the original problem: it's hard to translate between humans' high-level abstractions and low-level structure. That's why AI is unpredictable to humans in the first place.

comment by TAG · 2020-03-30T17:55:10.830Z · score: 1 (1 votes) · LW(p) · GW(p)

If you know in general that a low level AI will follow the rule si has been given, you don't need to keep re-checking.

comment by johnswentworth · 2020-03-30T20:11:32.293Z · score: 2 (1 votes) · LW(p) · GW(p)

The rules it's given are, presumably, at a low level themselves. (Even if that's not the case, the rules it's given are definitely not human-intelligible unless we've already solved the translation problem in full.)

The question is not whether the low-level AI will follow those rules, the question is what actually happens when something follows those rules. A python interpreter will not ever deviate from the simple rules of python, yet it still does surprising-to-a-human things all the time. The problem is accurately translating between human-intelligible structure and the rules given to the AI.

The problem is not that the AI might deviate from the given rules. The problem is that the rules don't always mean what we want them to mean.

comment by TAG · 2020-04-08T20:51:45.808Z · score: 1 (1 votes) · LW(p) · GW(p)

The rules it’s given are, presumably, at a low level themselves.

The rules that the low level AI runs on could be medium level. There is no point in giving it very low level rules, since its job is to fill in the details. But the point is that I am stipulating that the rules should be high level enough to be human-readable.

The question is not whether the low-level AI will follow those rules, the question is what actually happens when something follows those rules. A python interpreter will not ever deviate from the simple rules of python, yet it still does surprising-to-a-human things all the time.

But the world hasn't ended. A python interpreter doesn't do surprisingly intelligent things, because it is not intelligent.

The problem is not that the AI might deviate from the given rules. The problem is that the rules don’t always mean what we want them to mean.

In your framing of the problem , you create one superpowerful AI that has to be programmed perfectly, which is impossible. In my solution, you reduce the problem to more manageable chunks. My solution is already partially implemented.

comment by johnswentworth · 2020-04-08T23:25:09.881Z · score: 2 (1 votes) · LW(p) · GW(p)
But the point is that I am stipulating that the rules should be high level enough to be human-readable.

If the rules are high level enough to be human readable, then translating them into something a computer can run while still maintaining the original intent is hard. That's basically the whole alignment problem. If an AI is doing that translation, then writing/training that AI is as hard as the whole alignment problem.

A python interpreter doesn't do surprisingly intelligent things, because it is not intelligent.

If a system is doing large, fast, irreversible things, then it does not matter whether those things are surprisingly intelligent. If they're surprising, then that's sufficient for it to be a problem.

In your framing of the problem , you create one superpowerful AI that has to be programmed perfectly, which is impossible.

I'm not sure what gave you that impression, but I definitely do not intend to assume any of that.

comment by TAG · 2020-04-09T10:30:20.877Z · score: 1 (1 votes) · LW(p) · GW(p)

If the rules are high level enough to be human readable, then translating them into something a computer can run while still maintaining the original intent is hard.

It's not harder than AGI, because NL is a central part of AGI.

That’s basically the whole alignment problem.

No it isn't. You can have systems that do what they are told without having any notion of values and preferences. The higher level systems need goals because they are defining strategy,but only the higher level ones.

If a system is doing large, fast, irreversible things, then it does not matter whether those things are surprisingly intelligent. If they’re surprising, then that’s sufficient for it to be a problem.

Yes, but that's a problem we already have, with solutions we already have. For instance, high frequency trading systems can be shut down [automatically] if the market moves too much.

comment by johnswentworth · 2020-04-09T17:18:38.745Z · score: 4 (2 votes) · LW(p) · GW(p)
Yes, but that's a problem we already have, with solutions we already have.

It is a problem we already have, but the solutions we already have are all based on the assumption that either (a) we know in advance what kind of problems can happen, or (b) the problem doesn't kill us all in one shot. For instance, in your HFT system shutdown example, we already know that "market moves too much" is something which makes a lot of HFT systems not work very well. But how did we learn that? Either we had a prior idea of what problems could happen (implying some transparency of the system), or the problem happened at least once and we learned from that (implying it didn't kill us the first time - see e.g. Knight capital).

With AI, it's the same old problem, but on hard mode (i.e. the system is very opaque) and high stakes (i.e. we don't necessarily the survive the first big mistake). That's exactly the sort of scenario where our current solutions do not work.

It's not harder than AGI, because NL is a central part of AGI.

NL? I'm not familiar with this acronym. Also I said it's as hard as alignment, not as hard as AGI, in case that's relevant.

No it isn't. You can have systems that do what they are told without having any notion of values and preferences. The higher level systems need goals because they are defining strategy,but only the higher level ones.

I'm not even convinced that higher-level systems necessarily need goals. Pure goal-free tool AI is one possible path; the OP was written to be agnostic to such considerations.

Indeed, that's a big part of why I say translation is the central piece of the alignment problem: it's the piece that's agnostic. It's the piece that has to be there, in every scheme, under a wide range of assumptions about how the world works. Tool AI? Still needs to solve the translation problem in order to be safe and useful, even without any notion of values or preferences. Utility-maximizing AI? Needs to solve the translation problem in order to be safe and useful. Hierarchical scheme? Translation still needs to be handled somewhere in order to be safe and useful. Humans-consulting-humans or variations thereof? Full system needs to solve the translation problem in order to be safe and useful. Etc.

comment by Vaniver · 2020-04-09T17:24:10.951Z · score: 4 (2 votes) · LW(p) · GW(p)

NL? I'm not familiar with this acronym. Also I said it's as hard as alignment, not as hard as AGI, in case that's relevant.

Presumably "natural language", which often gets called NLP for "natural language processing" in AI.

I think the right response there is something like "suppose you have an AGI that can understand what a human means as well as another human does; now you still have all the difficulty of interpretation that makes law a complicated and contentious field." It'd be nice to be able to write a Constitution and recognize it after the AI has thought about it while having adversarial pressure on how to interpret it for 300 years, for example.

comment by steve2152 · 2020-03-20T12:03:44.294Z · score: 4 (2 votes) · LW(p) · GW(p)

I'm not sure why your default assumption is that the AGI's understanding of the world is at a "low level". My default assumption would be that it would develop a predictive world-model with entities that are at many different levels at once, sorta like humans do. (Or is that just a toy example to illustrate what you're talking about?)

comment by johnswentworth · 2020-03-20T17:51:18.339Z · score: 4 (2 votes) · LW(p) · GW(p)

I do expect that systems trained with limited information/compute will often learn multi-level models. That said, there's a few reasons why low-level is still the right translation target to think about.

First, there's the argument from the beginning of the OP: in the limit of abundant information & compute, there's no need for multi-level models; just directly modelling the low-level will have better predictive power. That's a fairly general argument, which applies even beyond AI, so it's useful to keep in mind.

But the main reason to treat low-level as the translation target is: assuming an AI does use high-level models, translating into those models directly will only be easier than translating into the low level to the extent that the AI's high-level models are similar to a human's high-level models. We don't have any reason to expect AI to use similar abstraction levels as humans except to the extent that those abstraction levels are determined by the low-level structure. In studying how to translate our own high-level models into low-level structure, we also learn when and to what extent an AI is likely to learn similar high-level structures, and what the correspondence looks like between ours and theirs.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-03-19T23:21:23.617Z · score: 4 (2 votes) · LW(p) · GW(p)

"What if we verify the translation by having one group translate English-to-Korean, another group translate back, and reward both when the result matches the original?"

This is a fun idea. Does it work in practice for machine translation?

In the AI safety context, perhaps it would look like: A human gives an AI in a game world some instructions. The AI then goes and does stuff in the game world, and another AI looks at it and reports back to the human. The human then decides whether the report is sufficiently similar to the instructions that both AIs deserve reward.

I feel like eventually this would reach a bad equilibria where the acting-AI just writes out the instructions somewhere and the reporting-AI just reports what they see written.

comment by steve2152 · 2020-03-20T11:55:24.387Z · score: 6 (3 votes) · LW(p) · GW(p)

This is a fun idea. Does it work in practice for machine translation?

I still find it mind-blowing, but unsupervised machine translation is a thing.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-03-20T13:36:45.783Z · score: 4 (2 votes) · LW(p) · GW(p)

Holy shit, that's awesome. I wonder if it would work to figure out what dolphins, whales, etc. are saying.

comment by Vaniver · 2020-03-26T20:20:02.597Z · score: 2 (1 votes) · LW(p) · GW(p)

I think you run into a problem that most animal communication is closer to a library of different sounds, each of which maps to a whole message, than it is something whose content is determined by internal structure, so you don't have the sort of corpus you need for unsupervised learning (while you do have the ability to do supervised learning).

comment by TAG · 2020-03-28T17:06:17.104Z · score: 1 (1 votes) · LW(p) · GW(p)

>Thanks to ongoing technology changes, both of these constraints are becoming more and more slack over time - compute and information are both increasingly abundant and cheap.

>Immediate question: what happens in the limit as the prices of both compute and information go to zero?

>Essentially, we get omniscience: our software has access to a perfect, microscopically-detailed model of the real world.

Nope. A finite sized computer cannot contain a fine-grained representation of the entire universe. Note, that while the *marginal* cost of processing and storage might approach zero, that doesn't mean that you can have infinite computers for free, because marginal costs rise with scale. It would be extremely *expensive* to build a planet sized computer.

comment by johnswentworth · 2020-03-30T20:16:10.476Z · score: 2 (1 votes) · LW(p) · GW(p)
A finite sized computer cannot contain a fine-grained representation of the entire universe.

cannot ever be zero for finite , yet it approaches zero in the limit of large x. The OP makes exactly the same sort of claim: our software approaches omniscience in the limit.

comment by TAG · 2020-03-31T10:50:09.874Z · score: 1 (1 votes) · LW(p) · GW(p)

It takes more than one atom to represent one atom computationally, so the limit can't be reached. Really, the issue is going beyond human cognitive limitations.

comment by johnswentworth · 2020-03-31T17:32:34.393Z · score: 2 (1 votes) · LW(p) · GW(p)

Of course the limit can't be reached, that's the entire reason why people use the phrase "in the limit".

comment by TAG · 2020-04-01T08:52:50.266Z · score: 1 (1 votes) · LW(p) · GW(p)

But it can't be approached like e^−x either, because the marginal cost of hardware starts to rise once you get low on resources.

Edit:

Exponential decay looks like this

Whereas he marginal cost curve looks like this

comment by johnswentworth · 2020-04-01T17:12:18.414Z · score: 2 (1 votes) · LW(p) · GW(p)

That's a marginal cost curve at a fixed time. Its shape is not directly relevant to the long-run behavior; what's relevant is how the curve moves over time. If any fixed quantity becomes cheaper and cheaper over time, approaching (but never reaching) zero as time goes on, then the price goes to zero in the limit.

Consider Moore's law, for example: the marginal cost curve for compute looks U-shaped at any particular time, but over time the cost of compute falls like , with k around ln(2)/(18 months).

comment by TAG · 2020-04-01T17:42:28.675Z · score: 1 (1 votes) · LW(p) · GW(p)

Until you hit a hard limit, like lack of resources.