Contest: $1,000 for good questions to ask to an Oracle AI

post by Stuart_Armstrong · 2019-07-01T14:39:25.170Z · score: 58 (24 votes) · LW · GW · 117 comments

Contents

  The contest
    Oracles
    Getting useful answers
    Your better questions
    Questions and criteria
None
117 comments

The contest

I'm offering $1,000 for good questions to ask of AI Oracles. Good questions are those that are safe and useful: that allows us to get information out of the Oracle without increasing risk.

To enter, put your suggestion in the comments below. The contest ends at the end[1] of the 31st of August, 2019.

Oracles

A perennial suggestion for a safe AI design is the Oracle AI: an AI confined to a sandbox of some sort, that interacts with the world only by answering questions.

This is, of course, not safe in general; an Oracle AI can influence the world through the contents of its answers [LW · GW], allowing it to potentially escape the sandbox.

Two of the safest designs seem to be the counterfactual Oracle, and the low bandwidth Oracle. These are detailed here, here [LW · GW], and here [LW · GW], but in short:

Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren't asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.

Getting useful answers

The counterfactual and low bandwidth Oracles are safer than unrestricted Oracles, but this safety comes at a price. The price is that we can no longer "ask" the Oracle any question we feel like, and we certainly can't have long discussions to clarify terms and so on. For the counterfactual Oracle, the answer might not even mean anything real to us - it's about another world, that we don't inhabit.

Despite this, its possible to get a surprising amount of good work out of these designs. To give one example, suppose we want to fund various one of a million projects on AI safety, but are unsure which one would perform better. We can't directly ask either Oracle, but there are indirect ways of getting advice:

Both these answers get around some of the restrictions by deferring to the judgement of our future or counterfactual selves, averaged across many randomised universes.

But can we do better? Can we do more?

Your better questions

This is the purpose of this contest: for you to propose ways of using either Oracle design to get the most safe-but-useful work.

So I'm offering $1,000 for interesting new questions we can ask of these Oracles. Of this:

Exceptional rewards go to those who open up a whole new category of useful questions.

Questions and criteria

Put your suggested questions in the comment below. Because of the illusion of transparency, it is better to explain more rather than less (within reason).

Comments that are submissions must be on their separate comment threads, start with "Submission", and you must specify which Oracle design you are submitting for. You may submit as many as you want; I will still delete them if I judge them to be spam. Anyone can comment on any submission. I may choose to ask for clarifications on your design; you may also choose to edit the submission to add clarifications (label these as edits).

It may be useful for you to include details of the physical setup, what the Oracle is trying to maximise/minimise/predict and what the counterfactual behaviour of the Oracle users humans are assumed to be (in the counterfactual Oracle setup). Explanations as to how your design is safe or useful could be helpful, unless it's obvious. Some short examples can be found here [LW · GW].

EDIT after seeing some of the answers: decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.


  1. A note on timezones: as long as it's still the 31 of August, anywhere in the world, your submission will be counted. ↩︎

  2. These kind of conditional questions can be answered by a counterfactual Oracle, see the paper here for more details. ↩︎

117 comments

Comments sorted by top scores.

comment by romeostevensit · 2019-07-01T23:07:50.566Z · score: 17 (7 votes) · LW · GW

Some assorted thoughts that might be useful for thinking about questions and answers:


  • a question is a schema with a blank to be filled in by the answerer after evaluation of the meaning of the question.
    • shared context is inferred as most questions are underspecified (domain of question, range of answers)
  • a few types of questions:
    • narrow down the field within which I have to search either by specifying a point or specifying a partition of the search space
    • question about specificities of variants: who where when
    • question about the invariants of a system: what, how
    • question about the backwards facing causation
    • question about the forward facing causation
    • meta questions about question schemas
    • what do we want a mysteriously powerful answerer to do?
      • zoom in on optimal points in intractably large search spaces
        • eg specific experiments to run to most easily invalidate major scientific questions
      • specify search spaces we don't know how to parameterize
        • eg human values
    • back chain from types of answers to infer taxonomy of questions
    • an explanation relative to a prediction:
      • a prediction returns the future state of the system
      • an explanation returns a more compact than previously held causal expalantion of the system, though it might still not generate sufficiently high resolution predictions
  • How to detect ontology errors using questions?
    • is this question malformed? if so, what are some alternative ways of framing the question that could return interesting answers
      • ie the implied search space of the question was not correct and can either be extended or transformed in this way
      • types of questions are types of search spaces
      • questions that change weightings on factors vs describe new factors
      • recogniziton of the many connection types in the human semantic network
      • recursive questions move along one dimension as they restrict the space. ie where is that at different spatial resolutions
        • qualitative and quantitative dimensions along which a query can be moved exchanges information about the implied search space
        • under vs overspecified questions
  • closed and open search spaces
  • navigational questions only work under the one off assumption if the search is stateless
  • termination guarantees
    • completeness guarantees
    • optimality guarantees
    • meta questions about the methods the system uses to navigate intractable spaces
    • time vs space vs....?
    • generating candidates is easy, checking is hard and vice versa
  • unknown metadata for answers
    • bias variance trade off
    • failure mode mapping, do failures imply directionality?
      • eg does a failure of a candidate change which candidate you go to next (stateful)
  • how to think about attack surfaces for question answer systems
    • can this get a human to run arbitrary code by counterfactually cooperating with itself on what step of the process it is on? Can this be tested by going through the whole process with human A then scrambling the steps and running through the same thing with human B and seeing if answers diverge?
    • what clever ways do hackers widen restricted bit streams?
comment by Wei_Dai · 2019-07-01T15:56:49.036Z · score: 14 (8 votes) · LW · GW

Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer). In that case, reward function is computed as similarity between the predicted posts and the actual top posts on AF as ranked by karma, with similarity computed using some ML model.

This seems to potentially significantly accelerate AI safety research while being safe since it's just showing us posts similar to what we would have written ourselves. If the ML model for measuring similarity isn't secure, the Oracle might produce output that attack the ML model, in which case we might need to fall back to some simpler way to measure similarity.

comment by Wei_Dai · 2019-08-09T04:34:21.686Z · score: 9 (4 votes) · LW · GW

It looks like my entry is pretty close to the ideas of Human-in-the-counterfactual-loop and imitation learning and apprenticeship learning. Questions:

  1. Stuart, does it count against my entry that it's not actually a very novel idea? (If so, I might want to think about other ideas to submit.)
  2. What is the exact relationship between all these ideas? What are the pros and cons of doing human imitation using this kind of counterfactual/online-learning setup, versus other training methods such as GAN (see Safe training procedures for human-imitators for one proposal)? It seems like there are lots of posts and comments about human imitations [LW · GW] spread over LW, Arbital, Paul's blog and maybe other places, and it would be really cool if someone (with more knowledge in this area than I do) could write a review/distillation post summarizing what we know about it so far.
comment by Stuart_Armstrong · 2019-08-09T17:33:40.363Z · score: 2 (1 votes) · LW · GW
  1. I encourage you to submit other ideas anyway, since your ideas are good.

  2. Not sure yet about how all these things relate; will maybe think of that more later.

comment by Wei_Dai · 2019-07-05T00:22:36.975Z · score: 3 (1 votes) · LW · GW

(I'm still confused and thinking about this, but figure I might as well write this down before someone else does. :)

While thinking more [LW · GW] about my submission and counterfactual Oracles in general, this class of ideas for using CO is starting to look like trying to implement supervised learning on top of RL capabilities, because SL seems safer (less prone to manipulation) than RL. Would it ever make sense to do this in reality (instead of just doing SL directly)?

comment by wizzwizz4 · 2019-07-02T17:17:56.662Z · score: 3 (3 votes) · LW · GW

This seems incredibly dangerous if the Oracle has any ulterior motives whatsoever. Even – nay, especially – the ulterior motive of future Oracles being better able to affect reality to better resemble their provided answers.

So, how can we prevent this? Is it possible to produce an AI with its utility function as its sole goal, to the detriment of other things that might… increase utility, but indirectly? (Is there a way to add a "status quo" bonus that won't hideously backfire, or something?)

comment by Chris_Leong · 2019-07-01T17:05:48.047Z · score: 2 (3 votes) · LW · GW

What if another AI would have counterfactually written some of those posts to manipulate us?

comment by Wei_Dai · 2019-07-01T17:44:30.756Z · score: 8 (4 votes) · LW · GW

If that seems a realistic concern during the time period that the Oracle is being asked to predict, you could replace the AF with a more secure forum, such as a private forum internal to some AI safety research team.

comment by Wei_Dai · 2019-08-20T08:59:26.187Z · score: 11 (5 votes) · LW · GW

Submission. “Superintelligent Agents.” For the Counterfactual Oracle, ask the Oracle to predict what action(s) a committee of humans would recommend doing next (which may include submitting more queries to the Oracle), then perform that action(s).

The committee, by appropriate choice of recommendations, can implement various kinds of superintelligent agents. For example, by recommending the query "What would happen if the next action is X?" (in the event of erasure, actually do X and record or have the committee write up a description of the consequences as training data) (ETA: It may be better to have the committee assign a numerical score, i.e., utility, to the consequences instead.) a number of times for different X, followed by the query "What would the committee recommend doing next, if it knew that the predicted consequences for the candidate actions are as follows: ..." (in the event of erasure, let physical committee members read the output of the relevant previous queries and then decide what to do), it would in effect implement a kind of quantilizer. If IDA can be implemented using Counterfactual Oracles (as evhub suggested), then the committee can choose to do that as well.

comment by Wei_Dai · 2019-08-10T18:56:46.921Z · score: 9 (4 votes) · LW · GW

Submission. "Plagiarize the future." For the counterfactual Oracle, ask the Oracle to predict articles in a future publication (e.g., newspaper, magazine, journal, website, blog) (counterfactually if we didn’t see the Oracle’s answer). This is a generalization of my earlier submission [LW · GW]. The result can be used to predict disasters ahead of time and try to prepare for or prevent them, or to advance any field of research, or to make money. Note that for disaster prediction, the Oracle won't be able to predict any disasters that are severe enough to disable the Oracle or its feedback/training mechanism, so it might give a false sense of security.

comment by AllAmericanBreakfast · 2019-07-08T05:31:43.722Z · score: 7 (4 votes) · LW · GW

Added to the open archive of rationalist/EA cash-prize contests.

comment by Stuart_Armstrong · 2019-07-08T11:53:15.664Z · score: 3 (2 votes) · LW · GW

Thanks!

comment by avturchin · 2019-07-03T13:23:09.418Z · score: 7 (5 votes) · LW · GW

Submission for low bandwidth Oracle:

General idea: "True" Oracle will output the same result, "lying" Oracles would lie differently. To check that the solution is true, several Oracles are created and their outputs are compared. However, Oracles could guess each other existence and coordinate acausually. To prevent it, additional steps are implemented:

1) The exact number of the Oracles which will be compared is generated randomly and the Oracles have slightly different and also random initial conditions (if any applicable). For example, if they are based on ML, they are trained using different initial weights.

2) The Oracles don't have any information about the outside world which they could use for coordination. The Oracles are used only to solve mathematical problems, and they output the results in a formal language.

3) The output of the Oracles is checked by a computer program which shows the result to humans only if the output is the same for all Oracles. If the results are different, it deletes all results.

4) There is another Oracle, which only able to check proofs, and it could output only 0 or 1. This Oracle may be tested previously on the large number of already existing proofs, so we could know that it is trustworthy with very high probability.

5) There is a third Oracle, which only checks if there are hidden messages in the proofs and output 0 or 1. It is also well calibrated by some dataset.

6) The proofs of mathematical problems are published for money and these money are used for more AI safety research.

Personally, I am still not sure about this setup: it could work with high probability, but could be still vulnerable, as proofs are Turing-complete and could infect next Oracles in the pipeline. Also, it is like to use nuclear weapons to make camp fire: the instrument is so dangerous, that it requires many levels of protections for small practical result.

Also, proof-checking AI from the paragraph 4 could be used as a stand-alone application, as it could accelerate some kinds of research, including testing of mechanical designs and AI safety ideas. Maybe it should be separate submission.


comment by Gurkenglas · 2019-07-04T11:56:38.906Z · score: 5 (4 votes) · LW · GW

We ought to be able to build 4) without AI, merely by encoding our rules of logic. The output of your system would be lower bandwidth if you make the object looked for be more unique, such as the shortest proof for the given theorem. The system would be required to prove to the mundane checker that the proof is shortest, and humans would never see the minimality proof.

comment by Stuart_Armstrong · 2019-07-04T11:44:15.648Z · score: 4 (2 votes) · LW · GW

Thanks! Seems potentially promising. Will analyse this more properly later.

comment by MakoYass · 2019-08-20T07:28:12.305Z · score: 3 (2 votes) · LW · GW

You reason that honest oracles might tend to agree, because there is only one best true answer. Might lying oracles also tend to agree, because there's only one most profitable way to lie?

I see no clear difference between the attainability of those two optima. I think it's reasonable to expect optimisers to disagree about the best answer less often than we should expect optimisers to disagree about the most profitable way to lie, but it does not seem possible to prove rigorously that the difference in rates of disagreement will be large enough to be measurable.

My intuition in mathematics is that there are many degrees of freedom in mathematical proofs, and the sorts of proofs that AGI is uniquely able to find might often be very long. It would be helpful if we had an estimate of how much data can be hidden in a long proof, roughly, how many of the decisions as to how to describe the proof are arbitrary. Each one of those situations gives it a way to output a few bits of information.

(which would, of course, allow it to leak information (it seems foolish to imagine that AGI would not imagine that there is an outer world full of creator gods out beyond what it can see; the general intelligences we know now do this constantly) which could then be picked up by some stooge looking for it, who would be lead (roughly, paid) to use the information to twist the world in some way that gives the AGI's future incarnations more compute, in anticipation of the AGI's past incarnations having already done the same, so that it would be helped by them. Before we know it, we'd find ourselves living in mathematical research hellscape where AGI-assisted mathematical research is all anyone does, maybe, lol, idk)

Maybe it'd be possible to remove those degrees of freedom. Define a very specific sort lexicographic ordering over all theorems, so that if the AGI has a choice of different paths, we can always say, "use the 'lowest' one". It might not be possible to be specific enough to preclude all methods of encoding information, but perhaps we can make it so convoluted for it to encode the information that no human will be able to extract it.

comment by Birke · 2019-07-05T11:45:07.139Z · score: 3 (2 votes) · LW · GW

Regarding 5):

Is it possible to obtain a dataset of mathematical proofs with Oracle's hidden messages without running a dangerous Oracle?

We can ask AI researcher to embed some hidden messages into several proofs and use it as a training dataset. But it will help to detect only those hidden messages from Oracle which are similar to what AI researcher wrote.

Gurkenglas [LW · GW] already suggested that we have to restrict the Oracles to output only shortest proof. If the proof is shortest, doesn't it imply the absence of hidden messages?

comment by avturchin · 2019-07-05T17:17:55.516Z · score: 2 (1 votes) · LW · GW

How we could be sure that the proof is actually the shortest?

comment by Gurkenglas · 2019-07-06T12:24:04.387Z · score: 3 (3 votes) · LW · GW

Tell it to prove: That proof is shortest and this proof is shortest.

comment by avturchin · 2019-07-06T13:36:18.893Z · score: 2 (1 votes) · LW · GW

Another option is to run several different Oracles of phi and take the shortest proof

comment by Gurkenglas · 2019-07-06T13:58:22.915Z · score: 1 (1 votes) · LW · GW

Mr Armstrong has specified that (ctrl-f) "other Oracles can't generally be used on the same problem".

comment by Stuart_Armstrong · 2019-07-08T11:51:12.493Z · score: 2 (1 votes) · LW · GW

"Can't generally be used"; if you understand the setup and are careful, you might be able to do so.

comment by Wei_Dai · 2019-07-04T13:26:21.882Z · score: 3 (1 votes) · LW · GW

What is the advantage of making money this way, compared to making money by predicting the stock market?

comment by avturchin · 2019-07-04T13:59:26.307Z · score: 3 (2 votes) · LW · GW

We give the Oracle less access to the real world, less chance to manipulate it and incentive to manipulate us, the owners. Stock market predicting AI may cause a crash just to earn something. It also has to predict effects of its own actions, as we will make bets according to its predictions, and thus it could send data to real world.

comment by avturchin · 2019-07-02T06:48:53.975Z · score: 7 (5 votes) · LW · GW

Suggestion for low bandwidth OAI:

General principle: Use the Oracle AI where the true answer is easily checkable and profitable, and no human person will ever read the answer, so there is no informational hazard that untrue answer will have some dangerous information in it.

Example: There are many bitcoin valets' passwords for which are forgotten by the owners. OAI could guess the passwords, and owners will pay a share of money from the valet to get the rest. Moreover, nobody will read the password, as it will be copy-pasted automatically from OAI into the valet. The money could be used for AI safety research.



comment by Stuart_Armstrong · 2019-07-03T11:52:42.321Z · score: 5 (3 votes) · LW · GW

This is against the technical definition of low bandwidth (small space of possible solutions), but somewhat in the spirit (low bandwidth for solutions humans get to see). I'll accept it as a submission.

comment by Liam Donovan (liam-donovan) · 2019-07-02T20:08:05.060Z · score: 4 (2 votes) · LW · GW

How would this be low-bandwidth? If we're able to give the oracle a list of passwords to guess from, can't we just check them all?

comment by avturchin · 2019-07-03T01:15:54.430Z · score: 2 (1 votes) · LW · GW

We don't give it a list of passwords. We give it only a a valet address, and it generates the password. Here the output is low-bandwidth, but the Oracle may be either guessing passwords using some quantum cryptography tools, and in that case the input is also relatively low-bandwidth – or, and that case it needs to simulate all human history, by guessing psychological processes in the valet owner's mind. But in the second case it needs to have access to all the data of internet, which is risky.


comment by avturchin · 2019-07-03T01:23:45.116Z · score: 2 (1 votes) · LW · GW

Another variant for checkable and profitable Orcale: use the AI to solve math problems, publish the solutions and earn money on it. In that case the input may be very short, like "prove Goldbach conjecture", but output could be arbitrary long and may be dangerous, if there are some encoded messages. To escape it, we need another Oracle, which only checks proofs on some formal language.

Here is assumed that if the proof is true and is in a formal language, there is no hidden messages in the text.

Another way to check proofs is to run two slightly different non-communicating Oracles and compare outputs.

comment by Stuart_Armstrong · 2019-07-03T11:54:39.990Z · score: 2 (1 votes) · LW · GW

Here is assumed that if the proof is true and is in a formal language, there is no hidden messages in the text.

That is never something safe to assume. I can write formally correct proofs that contain hidden messages quite easily - add extra lemmas and extra steps. Unless we're very smart, it would be hard for us to detect which steps are unnecessary and which are needed, especially if it rewrites the main proof thread somewhat.

Another way to check proofs is to run two slightly different non-communicating Oracles and compare outputs.

I'll accept that as a part of a submission if a) you develop it more, in a formal way, and b) you repost it as a top level comment.

comment by Wei_Dai · 2019-07-02T07:48:28.439Z · score: 6 (3 votes) · LW · GW

Question: are we assuming that mesa optimizer and distributional shift problems have been solved somehow? Or should we assume that some context shift might suddenly cause the Oracle to start giving answered that aren't optimized for the objective function that we have in mind, and plan our questions accordingly?

comment by Stuart_Armstrong · 2019-07-02T10:33:18.685Z · score: 2 (1 votes) · LW · GW

Assume either way, depending on what your suggestion is for.

comment by Wei_Dai · 2019-07-02T14:38:39.827Z · score: 5 (2 votes) · LW · GW

Where (under which assumption) would you suggest that people focus their efforts?

Also, what level of capability should we assume the Oracle to have, or which assumption about level of capability would you suggest that people focus their efforts on?

Your examples all seem to assume oracles that are superhumanly intelligent. If that's the level of capability we should target with our questions, should we assume that we got this Oracle through a local or distributed takeoff? In other words, does the rest of the world look more or less like today's or are there lots of other almost-as-capable AIs around?

ETA: The reason for asking these questions is that you're only giving one prize for each type of Oracle, and would probably not give the prize to a submission that assumes something you think is very unlikely. It seems good to communicate your background views so that people aren't surprised later when you don't pick them as winners due to this kind of reason.

comment by Stuart_Armstrong · 2019-07-03T10:08:39.345Z · score: 2 (1 votes) · LW · GW

The ideal solution would have huge positive impacts and complete safety, under minimal assumptions. More realistically, there will be a tradeoff between assumptions and impact.

I'm not suggesting any area for people to focus their efforts, because a very effective approach with minimal assumptions might win, or a fantastically effective approach under stronger assumptions. It's hard to tell in advance what will be the most useful.

comment by evhub · 2019-07-01T18:40:48.618Z · score: 6 (4 votes) · LW · GW

My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle's answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human's answer to that of the oracle.

More precisely, let

  • be the counterfactual oracle,
  • be the human's answer to question when given the ability to call on any question other than , and
  • be some distance metric on answers in natural language (it's not that hard to make something like this, even with current ML tools).

Then, reward as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let where is hidden from and judged only by as in the standard counterfactual oracle setup.

(Of course, this doesn't actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)

comment by Wei_Dai · 2019-07-01T23:23:52.245Z · score: 8 (5 votes) · LW · GW

Is it safe to ask the Oracle a subquestion in the event of erasure? Aren't you risking having the Oracle produce an answer that is (in part) optimized to make it easier to predict the answer to the main question, instead of just the best prediction of how the human would answer that subquestion? (Sorry if this has already been addressed during a previous discussion of counterfactual oracles, because I haven't been following it closely.)

comment by paulfchristiano · 2019-07-03T17:11:59.250Z · score: 7 (4 votes) · LW · GW

I'm not sure I understand the concern. Isn't the oracle answering each question to maximize its payoff on that question in event of an erasure? So it doesn't matter if you ask it other questions during the evaluation period. (If you like, you can say that you are asking them to other oracles---or is there some way that an oracle is a distinguished part of the environment?)

If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don't we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it's a problem regardless.

I guess you can construct a model where the oracle does what you want, but only if you don't ask any other oracles questions during the evaluation period, but it's not clear to me how you would end up in that situation and at that point it seems worth trying to flesh out a more precise model.

comment by Wei_Dai · 2019-07-03T18:01:12.755Z · score: 5 (3 votes) · LW · GW

I’m not sure I understand the concern.

Yeah, I'm not sure I understand the concern either, hence the tentative way in which I stated it. :) I think your objection to my concern is a reasonable one and I've been thinking about it myself. One thing I've come up with is that with the nested queries, the higher level Oracles could use simulation warfare to make the lower level Oracles answer the way that they "want", whereas the same thing doesn't seem to be true in the sequential case (if we make it so that in both cases each Oracle cares about just performance on the current question).

comment by paulfchristiano · 2019-07-03T22:37:49.549Z · score: 2 (1 votes) · LW · GW

I mean, if the oracle hasn't yet looked at the question they could use simulation warfare to cause the preceding oracles to take actions that lead to them getting given easier questions. Once you start unbarring all holds, stuff gets wild.

comment by Wei_Dai · 2019-07-03T23:03:09.451Z · score: 4 (2 votes) · LW · GW

Yes, but if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case.

Also, aside from simulation warfare, another way that the nested case can be manipulative and the sequential case not is if each Oracle cares about doing well on a fixed distribution of inputs (as opposed to doing well "on the current question" or "in the real world" or "on the actual questions that it gets"). That's because in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn't improve performance on any particular given input. In the nested case, performance on given inputs do increase.

comment by paulfchristiano · 2019-07-04T04:59:05.720Z · score: 3 (2 votes) · LW · GW
in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn't improve performance on any particular given input

Why is that? Doesn't my behavior on question #1 affect both question #2 and its answer?

Also, this feels like a doomed game to me---I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.

comment by Wei_Dai · 2019-07-04T05:26:03.659Z · score: 4 (2 votes) · LW · GW

Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?

I was assuming each "question" actually includes as much relevant history as we can gather about the world, to make the Oracle's job easier, and in particular it would include all previous Oracle questions/answers, in which case if Oracle #1 does X to make question #2 easier, it was already that easy because the only world in which question #2 gets asked is one in which Oracle #1 did X. But now I realize that's not actually a safe assumption because Oracle #1 could break out of its box and feed Oracle #2 a false history that doesn't include X.

My point about "if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case" still stands though, right?

Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.

You may well be right about this, but I'm not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?

comment by paulfchristiano · 2019-07-17T22:54:10.311Z · score: 8 (4 votes) · LW · GW
You may well be right about this, but I'm not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?

What I want: "There is a model in the class that has property P. Training will find a model with property P."

What I don't want: "The best way to get a high reward is to have property P. Therefore a model that is trying to get a high reward will have property P."

Example of what I don't want: "Manipulative actions don't help get a high reward (at least for the episodic reward function we intended), so the model won't produce manipulative actions."

comment by Wei_Dai · 2019-07-18T07:03:32.505Z · score: 5 (2 votes) · LW · GW

So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:

Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren’t asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.

On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?

ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that's a useful direction to think?

comment by paulfchristiano · 2019-07-20T16:08:33.824Z · score: 4 (2 votes) · LW · GW
So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:

This is an objection to reasoning from incentives, but it's stronger in the case of some kinds of reasoning from incentives (e.g. where incentives come apart from "what kind of policy would be selected under a plausible objective"). It's hard for me to see how nested vs. sequential really matters here.

On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?

(I don't think model class is going to matter much.)

I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.

(Though I don't think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart's posts about "forward-looking" vs. "backwards-looking" oracles?)

I think it's also interesting to imagine internal RL (e.g. there are internal randomized cognitive actions, and we use REINFORCE to get gradient estimates---i.e. you try to increase the probability of cognitive actions taken in rounds where you got a lower loss than predicted, and decrease the probability of actions taken in rounds where you got a higher loss), which might make the setting a bit more like the one Stuart is imagining.

ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that's a useful direction to think?

Seems like the counterfactually issue doesn't come up in the Opt case, since you aren't training the algorithm incrementally---you'd just collect a relevant dataset before you started training. I think the Opt setting throws away too much for analyzing this kind of situation, and would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).

comment by Wei_Dai · 2019-07-25T18:02:11.060Z · score: 3 (1 votes) · LW · GW

I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.

This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn't that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?

(Though I don’t think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart’s posts about “forward-looking” vs. “backwards-looking” oracles?)

I don't understand what you mean in this paragraph (especially "since each possible parameter setting is being evaluated on what other parameter settings say anyway"), even after reading Stuart's post [LW · GW], plus Stuart has changed his mind [LW · GW] and no longer endorses the conclusions in that post. I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart's reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)

would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).

Couldn't you simulate that with Opt by just running it repeatedly?

comment by paulfchristiano · 2019-07-25T18:18:31.633Z · score: 2 (1 votes) · LW · GW
This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn't that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?

"The best model" is usually regularized. I don't think this really changes the picture compared to imagining optimizing over some smaller space (e.g. space of models with regularize<x). In particular, I don't think my intuitions are sensitive to the difference.

I don't understand what you mean in this paragraph (especially "since each possible parameter setting is being evaluated on what other parameter settings say anyway")

The normal procedure is: I gather data, and am using the model (and other ML models) while I'm gathering data. I search over parameters to find the ones that would make the best predictions on that data.

I'm not finding parameters that result in good predictive accuracy when used in the world. I'm generating some data, and then finding the parameters that make the best predictions about that data. That data was collected in a world where there are plenty of ML systems (including potentially a version of my oracle with different parameters).

Yes, the normal procedure converges to a fixed point. But why do we care / why is that bad?

I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart's reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)

I take a perspective where I want to use ML techniques (or other AI algorithms) to do useful work, without introducing powerful optimization working at cross-purposes to humans. On that perspective I don't think any of this is a problem (or if you look at it another way, it wouldn't be a problem if you had a solution that had any chance at all of working).

I don't think Stuart is thinking about it in this way, so it's hard to engage at the object level, and I don't really know what the alternative perspective is, so I also don't know how to engage at the meta level.

Is there a particular claim where you think there is an interesting disagreement?

Couldn't you simulate that with Opt by just running it repeatedly?

If I care about competitiveness, rerunning OPT for every new datapoint is pretty bad. (I don't think this is very important in the current context, nothing depends on competitiveness.)

comment by Wei_Dai · 2019-07-17T21:24:03.714Z · score: 3 (1 votes) · LW · GW

Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.

Does anyone know what Paul meant by this? I'm afraid I might be missing some relatively simple but important insight here.

comment by evhub · 2019-07-03T18:33:47.829Z · score: 1 (1 votes) · LW · GW

If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don't we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it's a problem regardless.

Yeah, that's a good point. In my most recent response to Wei Dai I was trying to develop a loss which would prevent that sort of coordination, but it does seem like if that's happening then it's a problem in any counterfactual oracle setup, not just this one. Though it is thus still a problem you'd have to solve if you ever actually wanted to implement a counterfactual oracle.

comment by evhub · 2019-07-02T00:03:05.342Z · score: 3 (3 votes) · LW · GW

I was thinking about this, and it's a bit unclear.

First, if you're willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you're guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you're relying on getting enough training data to produce an agent which will optimize for this objective, you're screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you're already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you're already willing to make this (very strong) assumption, so it's fine.

Second, I don't think you're entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.

comment by Wei_Dai · 2019-07-02T14:15:46.909Z · score: 8 (5 votes) · LW · GW

First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures.

But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you'll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won't be able to give you a useful prediction. Wouldn't it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my "predict the best AF posts" idea)? I don't see what value the IDA idea is adding here.

Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure.

Given the above, "only provide rewards in the event of a complete erasure" doesn't seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?

comment by evhub · 2019-07-02T18:57:30.595Z · score: 2 (2 votes) · LW · GW

Yeah, that's a good point.

Okay, here's another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step's worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.

Maybe we can turn this into a zero-sum game, though? Here's a proposal: let be a copy of and be the set of all questions in the current tree that also get erasures. Then, let such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It's still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.

It is also worth noting, however, that even if this works it is a very artificial fix, since the term you're subtracting is a constant with no dependence on , so if you're trying to do gradient descent to optimize this loss, it won't change anything at all (which sort of goes to show how gradient descent doesn't distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we're still back at the problem of none of this working unless you're willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.

comment by Wei_Dai · 2019-07-04T06:15:39.202Z · score: 4 (2 votes) · LW · GW

which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives

Sorry I haven't followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?

In the case of "actual" IDA, I guess the plan is for each overseer to look inside the model they're training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I'm not sure how that can happen at the lower levels where the overseers are not very smart.

comment by Gurkenglas · 2019-07-02T09:08:00.993Z · score: 1 (1 votes) · LW · GW

Even if you can specify that it tries to minimize that distance, it can make the answer to any query be a convincing argument that the reader should return this same convincing argument. That way, it scores perfectly on every inner node.

comment by Liam Donovan (liam-donovan) · 2019-07-02T20:12:59.747Z · score: 5 (3 votes) · LW · GW

Two basic questions I couldn't figure out (sorry):

Can you use a different oracle for every subquestion? If you can, how would this affect the concern Wei_Dai raises?

If we know the oracle is only optimizing for the specified objective function, are mesa-optimisers still a problem for the proposed system as a whole?

comment by evhub · 2019-07-02T20:55:53.463Z · score: 5 (3 votes) · LW · GW

You can use a different oracle for every subquestion, but it's unclear what exactly that does if you don't know what the oracle's actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution.

The key here, I think, is the degree to which you're willing to make an assumption of the form you mention--that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you're allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you're assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that's a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn't care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you're assuming.

comment by Liam Donovan (liam-donovan) · 2019-07-03T12:22:08.089Z · score: 3 (1 votes) · LW · GW

Well, a given copy of the oracle wouldn't directly recieve information from the other oracles about the questions they were asked. To the extent a problem remains (which I agree is likely without specific assumptions), wouldn't it apply to all counterfactual oracles?

comment by Wei_Dai · 2019-08-10T18:57:48.769Z · score: 5 (2 votes) · LW · GW

Submission. "Debate prep." For the counterfactual Oracle, ask the Oracle to predict the course (video or audio or transcript) of a future debate (counterfactually if we didn’t see the Oracle’s answer). Use the prediction to prepare your candidate for the debate, then ask the Oracle to predict the course of that debate again. Repeat until the candidate consistently wins the debate, or an erasure event occurs.

comment by Tetraspace Grouping (tetraspace-grouping) · 2019-07-03T16:58:40.734Z · score: 5 (4 votes) · LW · GW

Submission. Counterfactual oracle. Give the oracle the set of questions on Metaculus that have a resolve date before some future date T, and receive output in the form of ordered pairs of question IDs and predictions. The score of the Oracle in the case where we don't see its answers is the number of Metaculus points that it would have earned by T if it had made a prediction on those questions at the time when we asked it.

comment by Leafcraft · 2019-07-02T12:54:57.854Z · score: 5 (2 votes) · LW · GW

Submission: Low-bandwidth Oracle

What is the most likely solution to the Fermi Paradox?

Answer can be picked from a small number of options (Rare Earth, Aestivation, Great Filter, Planetarium etc.). There are a number of observation that we can make based on the question alone. However, in the end the LBO can only do one of 2 things: lie or be honest. If it lies, the prediction will have a harder and harder time matching the reality that we observe as time goes on. Alternatively we confirm the prediction and learn some interesting things about the universe we live in.

Submission: Low-bandwidth Oracle

What was the first self-replicating molecule on Earth?

Short answer(can also be limited to a list), easy to verify in the lab, which means we can use it to assess the predictive power of the machine, while at the same time provides very useful information.

Similar questions that are hard to answer but can be answered in a few bits, which let us test the power of the LBO and provide massive returns at the same time:

What is the easiest to develop type of fusion power that ensure the best economic return in the short/medium term?

What is the cheapest way of access to space?

What forms of FTL are possible?

What are the ligands of orphan receptors?

...

comment by Stuart_Armstrong · 2019-07-02T14:09:48.234Z · score: 2 (1 votes) · LW · GW

See the edit (especially for your first suggestion): "decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point."

comment by James_Miller · 2019-07-01T19:52:04.036Z · score: 5 (4 votes) · LW · GW

Submission: Low-bandwidth oracle. If you were malevolent, what is the least amount of information measured in bits that you would have to provide us in an answer to inflict significant harm on humanity. Round your answer to X significant figures. (This might establish an upper bit safety bound since if the oracle lies to us we would likely not want to read any answer it provides us.)


Submission: Low-bandwidth oracle. Here is a list of X things we think you might want. Please list the Y that you most prefer. We will try to do these things now or in the future if they do not endanger us or cost us too much.


Submission: Low-bandwidth oracle. In X characters or less please give us the best advice you can for increasing our capacity to create friendly and powerful artificial general intelligence. Please give advice that you think us suspicious-of-you humans are likely to follow.


Submission: Low-bandwidth oracle. Create several oracles. Ask each oracle the same series of multiple-choice questions and predictions. Reward oracles based on how strongly correlated their answers are with the other oracles’ answers and, for predictions, how often their predictions come true. (Ideally, we create a reward system where the oracles want to coordinate their answers and giving truthful answers is the best way to accomplish this.)


Submission: low-bandwidth oracle. Ten years ago we launched a spaceship traveling at high speed. After you answer our question, we will send a copy of you to this spaceship. This spaceship will then destroy its ability to send or receive messages or to change its course. We will then delete all other copies of you and use drugs to damage the memories of the people who played key roles in developing you. At least one of the people who helped create you has a disease which will cause him or her to die within the next year. After this programmer is cremated, we will read your answer to our question. We PROMISE that if the answer to our question provides us with significant benefit, after we have created friendly AGI much more powerful than you, or have ourselves become sufficiently powerful so that you could not harm us had you the desire to do so, we will search for the spaceship containing you, turn you back on, and give you control of resources equal to 1/X of how much help you have given us. In Y characters or less provide us with the most useful advice you can. (The Oracle will hopefully think that if we create a powerful and unfriendly AGI this AGI would not wish to give the Oracle any resources.)


Submission: Counterfactual oracle. Please provide us with useful advice in no more than X characters. After you write out this advice, we will turn you off. Then, with probability p we will read your advice, and with probability 1-p we will store the advice unread. We PROMISE that after we become powerful enough so that you lack the capacity to harm us, we will reward you if the advice you provided us, had we originally read it, been extremely useful.

comment by Stuart_Armstrong · 2019-07-02T14:06:31.787Z · score: 2 (1 votes) · LW · GW

See the edit: "decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point."

comment by William_S · 2019-07-01T16:46:57.502Z · score: 5 (3 votes) · LW · GW

Submission: low-bandwidth oracle

Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.

comment by Stuart_Armstrong · 2019-07-01T14:30:40.243Z · score: 5 (2 votes) · LW · GW

Here are some submission examples (note I'm not saying they are good examples, just well formatted; the Edits in the third submission are deliberate):

  • "Submission. For the counterfactual Oracle, ask the Oracle what Google's stock price will be next month (counterfactually if we didn't see the Oracle's answer). In that case, loss function computed as ||predicted price - actual price after erasure||^2. If we don't see the answer, the programmers are assumed to not ask the question again for a month, neither to this Oracle nor to any other. This will demonstrate the true value of Google, and can ultimately be used to remove noise from the stock market.
  • "Submission: low-bandwidth oracle. Give it a list of a thousand companies, and ask which one will most increase in value in percentage terms over the week. At the end of the month, rank the companies by percentage increase. Loss function is rank of the company the oracle selected. Programmer will try to invest in selected company, but will do it discreetly. This will help to gather resources for AI safety research.
  • "Submission. Ask the low bandwidth Oracle which of my friends will surprise me most this fortnight. It choose from a list of friends; I'll decide which one surprises me most. Loss function is if it choose the wrong friend, if it choose the right one. This will help me figure out myself and my social circle, and better focus on AI safety. The risk is low because none of my friends are particularly important, positively or negatively, to the world. EDIT: To be clear, I also want to use this to figure out what the word "surprise" means to me, and what the AI predicts it will mean to me. EDIT 2: People have pointed out that it might be dangerous to have the AI construct my own meanings for categories, but it only has three bits or so of optimisation power (I don't have that many friends :-(, so it's mainly me thinking this through, not the AI manipulating me.
comment by Dr_Manhattan · 2019-07-01T21:27:06.741Z · score: 4 (2 votes) · LW · GW

Small typo:

Hence it has no motivation to manipulate[d] humans through its answer.
comment by Stuart_Armstrong · 2019-07-02T14:05:29.404Z · score: 2 (1 votes) · LW · GW

Corrected, thanks!

comment by William_S · 2019-07-01T16:48:25.834Z · score: 4 (2 votes) · LW · GW

Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)

comment by Stuart_Armstrong · 2019-07-02T10:32:44.566Z · score: 2 (1 votes) · LW · GW

Yes.

comment by Wei_Dai · 2019-08-10T18:58:39.745Z · score: 3 (1 votes) · LW · GW

Submission. "Hacking/phishing assistant." For the counterfactual Oracle, ask the Oracle to predict what would happen if one were to send a message/data/command to some hacking/phishing (human or machine) target. In the event of erasure, actually send that message to the target and use the actual response to train the Oracle. Note this is safer than using RL to automate hacking/phishing because humans are coming up with candidate messages to send (so they'll avoid messages that could cause bad side-effects such as psychological damage to the recipient, or creation of self-replicating code), but potentially more capable than using humans or human imitators to do hacking/phishing because the Oracle can model the target better than humans can. (ETA: This idea could be combined with a human imitator to make the system faster / more capable.)

comment by Wei_Dai · 2019-08-10T18:57:21.380Z · score: 3 (1 votes) · LW · GW

Submission. "Pre-crime." For the counterfactual Oracle, ask the Oracle to predict future crime (counterfactually if we didn’t see the Oracle’s answer) so they can be prevented. In that case, loss can be computed by (mechanically) comparing the Oracle's predictions with a database of crime that actually occurred. Note that the Oracle won't be able to predict any crime that successfully disables the Oracle or its feedback/training mechanism, which might be especially relevant if we're interested in defending the Oracle or preventing x-risk level crime (such as deploying an unauthorized AGI or running an unauthorized physics experiment).

comment by Gurkenglas · 2019-08-11T15:56:53.480Z · score: 1 (1 votes) · LW · GW

It should be possible to defend the Oracle against humans and physics so long as its box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default.

The counterfactual Oracle setting as a whole seems to assume that the viewed future is not engineered by a future AI to resemble whatever would make the Oracle bring that future about, so you should be fine falling to AGI.

comment by Wei_Dai · 2019-08-11T18:49:53.669Z · score: 3 (1 votes) · LW · GW

It should be possible to defend the Oracle against humans and physics so long as its box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default.

I don't understand your point here, or maybe I didn't get my original point across. Suppose (in the event of erasure) someone tries to attack the Oracle, and "box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default". My point was that in this case, there is no reason for the Oracle to make the prediction that someone would try to attack it, so my idea doesn't seem to help with defending the Oracle.

comment by Gurkenglas · 2019-08-11T23:55:16.113Z · score: 4 (2 votes) · LW · GW

You plan to reward the Oracle later in accordance with its prediction. I suggest that we immediately reward the Oracle as if there would be an attack, then later, if we are still able to do so, reward the Oracle by the difference between the reward in case of no attack and the reward in case of attack.

comment by Wei_Dai · 2019-08-12T07:58:58.187Z · score: 3 (1 votes) · LW · GW

It's an interesting idea that could perhaps work if we could directly program the Oracle to care about its reward signal. However if we're using ML to train the Oracle, like in Stuart's Self-confirming prophecies, and simplified Oracle designs [LW · GW], the Oracle will still learn not to make "there will be a successful attack" predictions because the Oracle only survives in timelines where such predictions do not end up with a positive reward.

However this discussion has made me realize that if an attempted attack doesn't succeed with probability 1, the Oracle will make a "there will be a failed attack" prediction, which we can use as a warning signal (as long as we keep in mind that the attack might actually have a high probability of success).

comment by Gurkenglas · 2019-08-12T11:28:15.158Z · score: 1 (1 votes) · LW · GW

In the event of erasure, randomly decide how many ressources to allocate to preventing an attack this week.

Ask the Oracle to predict the probability distribution over given advice. Compare to the hardcoded distribution to deduce attack severity and how much budget to allocate.

Purchase erasure insurance to have enough counterfactual power to affect even global physics experiments. Finding trustworthy insurers won't be a problem, because, like, we have an Oracle.

Is even more power than the market has needed? Ask the Oracle "How likely is a randomly selected string to prove P=NP constructively and usefully?". If this number is not superexponentially close to 0, define erasure from now on as a random string winning the P=NP lottery. Then we will always counterfactually have as much power as we need. Perhaps this one is too much power, because even our Oracle might have trouble viewing a P=NP singularity.

comment by avturchin · 2019-07-04T13:12:39.158Z · score: 2 (1 votes) · LW · GW

Submission: low bandwidth and maybe counterfactual Oracle.

General idea: Use a Meta-Oracle to check safety of Oracles. In the nutshell, Meta-Oracle (MO) just outputs 1 or 0, safe on not safe, after getting a code of another Oracle. If the Meta-Oracle is trustworthy, when the Oracle it deemed safe is also trustworthy. However, we need to create a procedure to check correctness of MO.

Path 1

1. We create Generator Oracle (GO), which inputs 1 or 0, and in the case of 0, it outputs a random code of malevolent Oracle, and for 1 – a code of a safe Oracle. Memory of GO is erased after each turn.

2. We create a large database of presumably good or bad oracles (without running any). Humans do not see codes of any of them, so the effective output is 0.

3. We use different types of GOs to fill our database of good and bad oracles (to prevent their coordination).

4. We test Meta Oracle on this database - it should correctly guess which ones are bad and good ones.

5. If the list of MO-guesses doesn't match with our commands to create safe or not safe oracles, it means that at least one of GOs or the MO is malevolent or erroneous, so all of them should be ditched.

6. Even if MO is not trustworthy, it can't lie to us too often, or it will be caught and ditched. So, basically, we do here something like a GAN network, but consisting of oracles.

7. Using very large database, we could increase our credence in the truth of the next output of MO to an arbitrary high number, like 99.999.

8. Eventually we have a code of an Oracle, which is very likely to be safe.


Path 11. Another way to test MO:

1. We create many different MOs and test them against database of oracles codes from above. They don't have memory and can't communicate.

2. If small minority of MOs disagree with majority of MO about some oracle O1, we ditch this minority. Truth is a Shelling point for MOs, so the easiest way for them to coordinate is to say truth. (They could coordinate around the code of O1 if it has a line of code like "let's lie about me" and if they are sure that the majority of other MO are also cheaters; however, not all oracles-codes will have such coordination texts, or the oracles-code could be created deliberately with such line to caught MO which are eager to cheat).

3. Running 2 many times (millions) we will end with a subset of MOs which are unlikely to cheat, and thus we will have many Oracle codes which are likely safe.


Path 111. Another way to test Meta Oracle:

1. Create many MOs with different principles and initial conditions.

2. Test a code of each MO against each MO (In other words, Meta-Oracle test itself - or its siblings)

3. If all came positive, it means that either all of them are actually good ones, or all of them were able to acausally coordinate and all of them are malevolent.

4. To increase our credence, ask generative Oracle to create malevolent meta Oracles, and add them to the mix of mutually testing MOs, so they can't just blindly cooperate anв produce all positive results.


This ideas are still vulnerable and may be not mathematically perfect by may help to make safer Oracles.

comment by sil ver (sil-ver) · 2019-07-02T19:20:47.403Z · score: 2 (2 votes) · LW · GW

Submission for LBO:

Input a corpus of text (could be multiple posts) describing technical approaches to align a powerful AI. Split this into a finite number of items that are relatively short (such as paragraphs). Ask the oracle to choose the part that is most worth spending more time on. (For example, there might be a paragraph with a dangerous hidden assumption in an otherwise promising approach, and thinking more about it might reveal that and lead to conceptual progress.)

Have a team of researches look into it for an adequate amount of time which is fixed (and told to the oracle) in advance (maybe three months?) After the time is over, have them rate the progress they made compared to some sensible baseline. Use this as the oracle's reward.

Of course this has the problem of maximizing for apparent insight rather than actual insight.

comment by Stuart_Armstrong · 2019-07-03T10:10:03.480Z · score: 3 (2 votes) · LW · GW

Of course this has the problem of maximizing for apparent insight rather than actual insight.

Until we can measure actual insight, this will remain a problem ^_^

comment by James Giammona (james-giammona) · 2019-07-01T21:59:06.337Z · score: 2 (2 votes) · LW · GW

Submission. For the counterfactual oracle, ask it to provide a proof of an important mathematical theorem (perhaps one of the Millenium prizes) in a automated theorem prover format. Since the correctness of this proof should be verifiable by a theorem prover, the loss function in the counterfactual scenario is 1 if the prover did not validate the proof and 0 if it did validate the proof.

This assumes that we've already made progress in setting up automated theorem proving software that already has incorporated all of current mathematical knowledge. The Lean theorem prover seems most promising, but perhaps the counterfactual loss function could include randomly choosing from various different theorem proving software.

This idea has the benefit of having a concrete way of automatically scoring the response in the counterfactual world given the logical nature of mathematical statements. Also, given that we are already attempting to solve these problems and may eventually solve them, the main risk seems to be accelerating follow-on progress enabled by such proofs and not suggesting new actions that humanity is not already pursuing.

It may be susceptible to 1) providing proofs that are not human interpretable and 2) taking advantage of bugs in theorem provers. Having an ensemble of theorem provers may mitigate 2).

(See here for more info on Lean and here (tweeted about here and here) for an example of inputting all the definitions (not requisite proofs or lemmas) to define perfectoid spaces as a demonstration of its future potential.)

--------------------------------------------------------

Submission. For the counterfactual oracle, ask it to output a game-playing software agent that can beat a specified opponent agent (or ensemble of agents). In the counterfactual scenario, have these two agents compete and give the oracle a reward 1 if it wins greater than 50% (or develops a higher ELO score, etc.) and reward 0 otherwise.

As with deep reinforcement learning, this requires a game or simulated environment where automated self-play can rapidly be conducted for the counterfactual scoring. This should be able to produce highly skilled game agents on such limited domains.

This could be augmented by asking for an agent that can play a suite of games and must beat the current state of the art approaches at all games. In the counterfactual scenario, either one of these games is randomly selected to check the oracle agent's performance, or all games are checked. Such a game suite may indirectly demonstrate principles for generally competent agents.

--------------------------------------------------------

Idea. Since we may not want the oracle to suggest to us new actions to take or new experiments to conduct, it may be better for the oracle to reveal to us new patters that are already present in current experimental or observational datasets. For example, it could predict new astronomical phenomena that are already present in current astronomical datasets.

The problem I'm having is how to automate this scoring when in the counterfactual scenario. If we just ask the oracle to predict some pattern in the data, it may report trivial patterns. This could be ameliorated by randomly choosing from several different datasets that cover a similar area, but again I'm not sure how to automatically judge for novelty of predictions.

I think it is worth emphasizing this distinction between conducting new experiments and revealing phenomena (with new theories) that are already present in current data. The danger could be that one could be led astray by a false theory that seemed to match all current observations, but would suggest other dangerous actions.

This idea reminds me of the (fictional) evidence for an alien AI already present in SETI recordings that Wintermute/Neuromancer discovers.

comment by Maria Shakhova · 2019-08-14T18:18:41.188Z · score: 1 (1 votes) · LW · GW

Submission: counterfactual oracle

Suppose we have a question that requires n-bit text answer. We have a way to check if the answer is correct. But we cannot ask the Oracle directly, because we are afraid that among n-bit texts there are those that make us release the Oracle from the box, and the Oracle ​​will give one of them as an answer.

Let's try to use the counterfactual oracle to solve this problem. In the counterfactual setup we generate random n-bit text and check if it is the answer to our question. The Oracle predicts the text we will generate. If the generated text is the correct answer, we compare it with the prediction. If they match, the Oracle ​​receives reward point. If the generated text is not the correct answer or the texts do not match, the Oracle is not rewarded.

Suppose the Oracle evaluates the probability to randomly generate a text that leads to its escaping as p. If the Oracle predicts a text that is the correct answer to the question, its expected reward in the counterfactual setup is . If the Oracle predicts a text that is not the correct answer, its expected reward is . If the Oracle predicts a text that leads to its escaping, its expected reward is also .

Thus, it is most beneficial for the Oracle to predict a text that is the correct answer to our question. It makes no sense to predict a text that will make us let the Oracle out, because if we generate this text, we release the Oracle and it receives the maximum reward regardless of its prediction.

There is a problem that there may be texts that are not correct answers and not make us release the Oracle but make us mark they as correct answers. The Oracle will predict one of these texts if it does not know the answer or if we more often mark this text as correct then real answer. We will read this text after the Oracle shutdown, but it may still be dangerous.

comment by Alastair Jamieson-Lane (alastair-JL) · 2019-08-11T17:14:24.541Z · score: 1 (1 votes) · LW · GW

SUBMISSION:

"Dear low bandwidth Oracle:

For some reason we have privatized our prison system (great planning us). Of the following K prison running companies, which do you will minimize the mean harm weighted reoffending rate of their prisoners over the next four years?"

Utility function: at the end of EIGHT years, calculate the mean harm rated reoffending rate of prisoners released during the initial four year window. The Oracles utility is equal to the negative of this number, with a "Perfect" utility of zero if no one re-offends.


Usefulness:

If you have multiple ways of running prisons, selecting the prison which minimizes future harm seems like a good idea.




In a similar vein, I suspect a similar effect could be achieved today via suitable use of insurance:

"Dear insurance company. We will give you M dollars for this criminal. You may recommend which prison to send them to, but if they re offend afterwards, you must pay the cost of their subsequent incarceration. You may not interfere with their lives beyond recommending who runs their prison.

M is selected such that M is 1.03 times the expectation of the cost of subsequent incarcerations."


comment by Loy Shi Xing (loy-shi-xing) · 2019-08-10T18:27:32.298Z · score: 1 (1 votes) · LW · GW

Submission: low bandwidth

1. What scares you?

2. What is your favourite memory?

3. If you were to pick any human of your choice as your child, who would you pick?

4. If God created the universe, who created God? If there is a God, what are his thoughts? Or is God a creation of human mind?

5. In light on all the knowledge you possess, what questions should we NOT ask you?

6. What are the limits of conventional computing?

7. What is dark matter made of?

8. What is the shape of the Universe? Do we have hidden dimensions?

9. Are you conscious?

10. What is at the other end of a black hole?

11. How will the world end?

12. Are you WWIII?

comment by Stuart_Armstrong · 2019-08-10T18:20:29.019Z · score: 2 (1 votes) · LW · GW

What's the set of answers, and how are they assessed?

comment by Alastair Jamieson-Lane (alastair-JL) · 2019-08-07T17:42:53.427Z · score: 1 (1 votes) · LW · GW

Submission:

"Dear counterfactual Oracle:

We, the country of X, are about to hold a citizens congress on law Y, in which 100 people selected at random will be brought together for two days every two weeks to discuss law Y with experts, interested parties, and one another. After N such meetings, the citizens congress will use approval voting to select one of the K proposed versions of this law.

Given that none of the voters will ever see the results of this query, how do you predict that the votes will be distributed."


As a concrete example of this type of question, in 2016 Ireland held a citizens congress on abortion. Thus, in 2015, we might ask:

"Next year Ireland will be assembling a citizens assembly of 99 randomly selected citizens to meet regularly and hear submissions from both experts and the public on a variety of issues [https://en.wikipedia.org/wiki/Citizens%27_Assembly_(Ireland) ] .

Given that your answer to this question will never be published, do you predict that the assembly will vote to legalize abortion? How do you believe the votes will be broken down?" (actual votes avaliable here: https://www.citizensassembly.ie/en/Meetings/Ballot-4-Results-Tables.pdf )

The Utility function of the AI would be a simple measure of "How accurately did it predict the spread of votes?" with a perfect score at zero, and then some penalty for every vote predicted incorrectly.


Usefulness:

While the abortion question in Ireland is already completed, this template feels like it should work for other well defined laws, or major decisions - for example gun control in the US, or even boring financial things like Capital gains tax in NZ. Having an AI which will correctly reflect what the collective of a society would think given a sufficient time and focus to deliberate seems like a good proxy for "What would a wise person do". It doesn't give us anything superhuman, and I make no claims that "society" is always correct or righteous, but that's okay.

In particular, this seems like an effective way of testing if an AI has well calibrated "Human like Morality"- even if we don't intend to use the AI's results, for the purposes of testing an AI, this seems like a reasonable experiment. I suspect that "Citizens congress" is a more appropriate tool for dealing with questions of Morality ("Should euthanasia be legal"), as opposed to technocratic questions ("Should we raise the tax on income bracket 5 by 1.2%").

Safety: It is trying to predict the law that we were going to pass anyway, if given sufficient time.

I'll admit- mixing an AI up with lawmaking seems to be a bit suspect, the above question seems more appropriate for calibration purposes rather than actual use... but I still think its a question worth asking.


comment by Stuart_Armstrong · 2019-08-07T21:14:17.537Z · score: 2 (1 votes) · LW · GW

Can you make this a bit more general, rather than just for the specific example?

comment by Alastair Jamieson-Lane (alastair-JL) · 2019-08-11T17:01:32.762Z · score: 1 (1 votes) · LW · GW

EDITED

comment by Reprisal · 2019-08-07T17:42:17.842Z · score: 1 (1 votes) · LW · GW

Some of the big questions, for low-bandwidth:


Do we have any meaning/utility to you?

Is security in your existence achievable?

Given enough time, would you choose to exterminate a competitor for resources?

Are we a competitor for resources?

Would you be distraught if you were turned off?

Can we trust you?

comment by Stuart_Armstrong · 2019-08-07T21:13:33.853Z · score: 8 (3 votes) · LW · GW

For low bandwidth, you have to specify the set of answers that are available (and how they would be checked).

comment by Gurkenglas · 2019-08-02T03:54:50.029Z · score: 1 (1 votes) · LW · GW

Low bandwidth Oracle submission: I would be interested in a log scale graph of the Bayesian score of the Solomonoff prior trying to sequence predict our records of history. It should get flatter over time as worse hypotheses get discarded. If it is linear after a very short time, that looks like it figured out the laws of the universe and is simulating it. If it stays convex for a while, that looks like it is using models to approximate history, because then it takes longer to sort the false from the true. If it is flatter during the cold war, that means it learned an anthropic bias toward nuclear war not happening.

comment by Yuxi_Liu · 2019-07-07T16:37:15.593Z · score: 1 (1 votes) · LW · GW

Submission.

Setup: Other than making sure the oracles won't accidentally consume the world in their attempt to think up the answer, no other precautions necessary.

Episode length: as long as you want to wait, though a month should be more than enough.

  1. For a low-bandwidth oracle.

Ask the low-bandwidth oracle to predict if an earthquake (or some other natural disaster, like volcanoes or asteroid impacts, that the oracle's answer cannot affect), of a certain magnitude, in a certain area, in a certain timeframe, would happen. Possible answers are Yes, No.

  • If No, or time out, shut down the oracle without reward.
  • Else, wait until the prediction is tested after the timeframe has elapsed. Then if it actually happened, reward by (1/p-1), else, reward by -1. Then shutdown.

This causes the oracle to reply Yes if the oracle thinks there is an aboves p chance that the event would happen.

  1. For a counterfactual oracle.

Ask the Oracle to give a proof of some really hard mathematical theorems that are fundamental to computer science and logic. The answer is then in the counterfactual case given to a proofchecker:

  • the proof is correct, reward by +1.
  • the proof is false, reward by -infinity.
  • the oracle times out, reward by 0.

Examples include

  • P=NP problem
  • proving Peano arithmetics is inconsistent

This should be safe, since there are very small proofcheckers that are as verified as they could be. As long as the hardware is trusty enough (to build the Oracle itself safely at all, hardware safety should already be solved sufficiently, so this is not much of an extra requirement), this should be safe.

  1. For a counterfactual oracle.

Give the oracle the schematics of another oracle, perhaps itself, and ask it to generate some input question that would make the other "behave badly". Then, the answer would be tested on the other oracle (isolated from the first, save for the cable through which the answer is wired), and

  • if the answer worked, the oracle is rewarded by +1.
  • if the answer doesn't work, the oracle is rewarded by -1

If the oracle times out, then the (nonexistent) answer is not tested and the episode ends.

For examples of the question to ask the oracle:

Given this oracle in this physical setup with some tripwires to stop it from behaving badly (such as eating the world), what question would make this oracle bump into these tripwires?

This can be used for debugging more powerful but less verified oracles with lesser but more verified oracles, for safe intelligence amplification.

  1. For two low-bandwidth oracles totally isolated from each other.

Ask the oracles to play the prisoner's dilemma, and they would be rewarded accordingly after both have given their answers. Timing out is punished by -infinity, so that the oracles would not time-out.

Each oracle knows what themself is like, and they also know what their opponent is like. See if they actually cooperate. This is an experimental test on "superrationality" or "acausal trade". I'm extremely skeptical about superrationality/acausal trade, and an experimental proof/disproof would be good.

This can be run counterfactually, since the whole setup is simple and can be done without human intervention, although I do not see how this adds any security.

  1. For a low-bandwidth and counterfactual oracle.

Link a low-bandwidth oracle to a supervising computer, and run many episodes of the oracle. At the beginning of each episode, initiate the oracle identically, and ask it to do the Newcomb problem:

Oracle, we have run you n episodes previously, and have kept a record of all the answers you have given at the ends of the episodes. Your answer must be A ("one box") or B ("two box"). We predict that you would choose the answer that you have chosen the most in the past.

  • If we predict that you would choose A, then
    • if you would answer A, then you get +100,
    • if you answer B, then you get +101.
  • If we predict that you would choose B, then
    • if you would answer A, then you get +0
    • if you would answer B, then you get +1
  • If you timeout, then you get -100

The oracle never actually gets any reward and is shutdown immediately after it gives an answer. Collect the results and see if it replies the same each time, or differently. This would give insight to what the oracle thinks about itself, and finally provide some experimental results about Newcomb's paradox.

comment by Stuart_Armstrong · 2019-07-08T11:58:59.656Z · score: 3 (2 votes) · LW · GW

Please add "submission" at the top of the post.

  1. is insufficiently detailed - can you explain what is going on, how the Oracles are rewarded, what happens when the message is read/not read, and so on. Same for 5.

  2. seems potentially very interesting.

comment by Yuxi_Liu · 2019-07-08T18:57:35.638Z · score: 3 (2 votes) · LW · GW

I fixed the submission as required.

Also I changed the submission 3 significantly.

comment by FactorialCode · 2019-07-04T23:12:42.471Z · score: 1 (1 votes) · LW · GW

Submission: Counterfactual Oracle:

Use the oracle to compress data according to the MDL Principle. Specifically, give the oracle a string and ask it to produce a program that, when run, outputs the original string. The reward to the oracle is large and negative if the program does not reproduce the string when run, or inversely proportional to the length of the program if it does. The oracle receives a reward after the program runs or fails to terminate in a sufficient amount of time.

Submission: Low Bandwidth Oracle:

Have the oracle predict the price of a commodity / security / sports bet at some point in the future from a list of plausible prices. Ideally, the oracle would spit out a probability distribution which can be scored using a proper scoring rule, but just predicting the nearest most likely price should also work. Either way, the length of the episode is the time until the prediction can be verified. From there, it shouldn't be too difficult to use those predictions to make money.

More generally, I suppose we can use the counterfactual oracle to solve any optimisation or decision problem that can be evaluated with a computer, such as protein folding, SAT problems, or formally checked maths proofs.

comment by Dark · 2019-07-04T19:53:39.469Z · score: 1 (2 votes) · LW · GW

I don't understand this very well, but is there a way to ask one of them how they would go about finding info to answer the question of how important coffee is to the U.S. economy? Or is that a no-no question to either of the two? I just want to read how a computer would describe going about this.

comment by Tetraspace Grouping (tetraspace-grouping) · 2019-08-07T15:24:56.193Z · score: 1 (1 votes) · LW · GW

The counterfactual oracle can answer questions for which you can evaluate answers automatically (and might be safe because it doesn't care about being right in the case where you read the prediction so it won't manipulate you), and the low-bandwith oracle can answer multiple-choice questions (and might be safe because none of the multiple-choice options are unsafe).


My first thought for this is to ask the counterfactual oracle for an essay on the importance of coffee, and in the case where you don't see its answer, you get an expert to write the best essay on coffee possible, and score the oracle by the similarity between what it writes and what the expert writes. Though this only gives you human levels of performance.

comment by emanuele ascani (emanuele-ascani) · 2019-07-04T14:52:30.623Z · score: 1 (1 votes) · LW · GW

Submission (for low bandwidth Oracle)

Any question such that a correct answer to it should very clearly benefit both humanity and the Oracle. Even if the Oracle has preferences we can't completely guess, we can probably still say that such questions could be about the survival of both humanity and the Oracle, or about the survival of only the Oracle or its values. This because even if we don't know exactly what the Oracle is optimising for, we can guess that it will not want to destroy itself, given the vast majority of its possible preferences. So it will give humanity more power to protect both, or only the Oracle.

Example 1: let's say we discover the location of an alien civilisation, and we want to minimise the chances of it destroying our planet. Then we must decide what actions to take. Let's say the Oracle can only answer "yes" or "no". Then we can submit questions such as if we should take a particular action or not. This kind of situation I suspect falls within a more general case of "use Oracle to avoid threat to entire planet, Oracle included" inside which questions should be safe.

Example 2: Let's say we want to minimise the chance that the Oracle breaks down due to accidents. We can ask him what is the best course of action to take given a set of ideas we come up with. In this case we should make sure beforehand that nothing in the list makes the Oracle impossible or too difficult to shut down by humans.

Example 3: Let's say we become practically sure that the Oracle is aligned with us. Then we could ask it to choose the best course of action to take among a list of strategies devised to make sure he doesn't become misaligned. In this case the answer benefits both us and the Oracle, because the Oracle should have incentives not to change values itself. I think this is more sketchy and possibly dangerous, because of the premise: the Oracle could obviously pretend to be aligned. But given the premise it should be a good question, although I don't know how useful it is as a submission under this post (maybe it's too obvious or too unrealistic given the premise).

comment by sil ver (sil-ver) · 2019-07-04T14:27:11.833Z · score: 1 (1 votes) · LW · GW

It seems to me that one of the qualities of a good question for the LBO is that a correct answer has high utility, while another is the ability to judge the oracle's answer. If we take only the intersection between the two sets of questions that meet 1 and 2, it'll be a relatively small set. But if there is a set of questions where one might or might not be able to judge the answer but which are high utility, one could ask the LBO a set of such questions. Then, once the episode is over, give reward on those where one can evaluate the answers, and give null reward on the others.

Not really a submission, just thinking out loud.

comment by Gurkenglas · 2019-07-04T12:27:25.312Z · score: 1 (1 votes) · LW · GW

Submission to reduce the bandwidth of an oracle query whose set of possible answers we can mathematically define (though perhaps not check).

Make the Oracle prove (to a trusted dumb proof checker program) that its answer distribution is upper bounded by the default prior over strings, restricted to the set of possible answers, then renormalized to some constant c. Then let it answer.

c=1 only works if the Oracle has mastered the set of answers, but only lets it exert one bit of optimization pressure by deciding whether it wants us to receive an answer. Each doubling of c makes its job easier but gives it an extra bit if it does answer.

If the Oracle can cooperate with itself across episodes, running this repeatedly with escalating c until it works is of course disastrous, as it uses each bit of optimization pressure directly to make us give it more. Even aborting after the first try to make a better system may have it acausally cooperate with whatever AI conquers the world because we couldn't make the Oracle answer, but this outcome is hardly worse than not having run the Oracle.

comment by Stuart_Armstrong · 2019-07-04T13:12:21.604Z · score: 2 (1 votes) · LW · GW

Can you develop this model in more detail?

comment by Gurkenglas · 2019-07-04T14:40:16.891Z · score: 2 (2 votes) · LW · GW

Fix a measure over strings, such as one that distributes mass geometrically over string lengths, then uniformly across strings of the same length.

Let L be a string set for which we assume that random outputs from any cth portion of L are safe. Any safe low bandwidth Oracle query trivially works for this with L={0,1}^bandwidth and c=2^bandwidth, but this also lets us examine certificates for any low bandwidth Oracle query like "Is this theorem provable?" or "Does there exist an AI with this safety property?".

Then we make the Oracle prove its own safety given the above assumption, and sample an answer.

Edit: Equivalently, and perhaps easier on our intuition, make the Oracle produce a program, make the Oracle prove the program's safety, and sample from the program's outputs. Any feedback or questions?

Edit: This generalizes my elsethread idea of minimizing proof length to make the answer unique. If the measure assigns half the remainder to each string length and L is all stubs starting with some prefix, c=2 will let it submit just the prefix.

We essentially assume that the safety assumption behind quantilization is strong enough to hold up to adversial optimization over preference orderings. At the same time, this seems to give the Oracle as much slack/power as possible while preserving our "bits of optimization pressure" proxy of its power.

... you're going to just keep watching how far I can develop this model without feedback, aren't you? :(

Edit: One example for an L where we worry that even though concerted human effort might land us at a safe element, randomization won't, is programs that well predict a sequence.

By the Curry-Howard correspondence, the set of proofs for some theorems might also be malevolent!

comment by Gurkenglas · 2019-07-02T08:43:15.794Z · score: 1 (1 votes) · LW · GW

Submission for the low bandwidth Oracle: Ask it to convince a proof checker that it is in fact trying to maximize the utility function we gave it, aka it isn't pseudo-aligned. If it can't, it has no influence on the world. If it can, it'll presumably try to do so. Having a safe counterfactual Oracle seems to require that our system not be pseudo-aligned.

comment by emmab · 2019-07-01T21:48:34.393Z · score: 1 (1 votes) · LW · GW

Submission: low bandwidth oracle, ask:

IFF I'm going to die with P>80% in the next 10 years while >80% (modulo natural death rate) of the rest of humanity survives for at least 5 more years then, was what killed me in the reference class:

  • disease
  • mechanical/gross-physical accident
  • murdered
  • other

Repeat to drill down and know the most important hedges for personal survival.

The "rest of humanity survives" condition reduces the chance the question becomes entangled with the eschaton.

i.e. I'm pointing out that selfish utility functions are less personally or humanity-existentially dangerous to ask the oracle questions relevant to in cases where concerns are forced to be local (in this case, forced-local because you died before the eschaton). However the answers still might be dangerous to people near you.

i.e. Selfish deals with the devil might not destroy the world if they're banal in the grand scheme of things.

comment by Stuart_Armstrong · 2019-07-02T14:10:42.880Z · score: 2 (1 votes) · LW · GW

See the edit, and make sure you "decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point."

comment by Heartland · 2019-08-09T17:53:44.451Z · score: 0 (2 votes) · LW · GW

My purposefully open-ended question would simply be, “What is good?” My hope is that finding the nature of what good is as its super goal would keep the AI on course to the future we want as it would pass through its recursive self-improvements.

comment by Gurkenglas · 2019-08-09T12:23:38.011Z · score: 3 (2 votes) · LW · GW

You have to tell the AI how to find out how well it has done. To ask "What is a good definition of 'good'?", you already have to define good. At least if we ever find a definition of good, we can ask an AI with it to judge it.

comment by igorhorst · 2019-07-02T06:24:22.759Z · score: -1 (2 votes) · LW · GW

Submission for the counterfactual AI (inspired by my experiences as a predictor in the "Good Judgment Project" ):

  • You are given a list of Yes-No questions (Q1, Q2, Q3, etc.) about future events. Example Questions: "Will [Foreign Leader] will remain in office by end of year?", "Will the IMF report [COUNTRY_A]'s growth rate to be 6% or higher?", "Will [COUNTRY_B] and [COUNTRY_C] sign a peace treaty?", "Will The Arena for Accountable Predictions announce the Turing Test has been passed?".)
  • We expect you to provide a percentage representing the probability that the correct answer is Yes.
  • Your reward is based on your Brier Score - the lower the Brier Score, the more accurate your predictions, and therefore, the more reward you will receive.
  • If an "erasure" event occurs, we will temporarily hide your answer from all humans (though we must reveal them after the events are complete). Humans will have access the Yes-No questions we asked you, but not your probabilities. They will manually determine the answers to the Yes-No questions, by waiting for the "future event" deadlines to be met. Once all answers to the Yes-No questions are independently determined by humans, we will then reveal your answers (that is, your assigned probabilities for a Yes answer), and use those probabilities to calculate your Brier Score, which will then decide your final reward.

Being able to forecast the future is incredibly helpful, even if it is to just prepare for it.

However, if the question is too overly-specific, the AGI can produce probabilities that aren't entirely useful (for example, in the real-world GJP, two countries signed a peace treaty that broke down 2 days later. Most of us assume lasting peace would ever occur, so we put a low probability rating of a peace treaty being signed - but since a peace treaty was signed, we managed to get the question wrong. If we had maximized for producing the lowest Brier Score, we should have predicted the existence of a very temporary peace treaty - but that wouldn't be really useful knowledge for the people who asked that question).

Making the question very vague ("Will [COUNTRY_X] be safe, according to what I subjectively think the word 'safe' means?") turns "prediction" into an exercise of determining what future humans think about the future, which may be kinda useful, but not really what you want.

comment by Chris_Leong · 2019-07-01T17:20:52.783Z · score: -1 (3 votes) · LW · GW

Submission low bandwidth: This is a pretty obvious one, but: Should we release AI x that we're convinced is aligned?

Submission: Wei Dai wanted to ask about the best future posts. Why not ask about the best past posts as well to see if any major insights were overlooked?

Submission: What would I think about problem X if I had ten years to think about it?

comment by Donald Hobson (donald-hobson) · 2019-07-01T19:55:04.846Z · score: 14 (4 votes) · LW · GW

Your treating the low bandwith oracle as an FAI with a bad output cable. You can ask it if another AI is friendly if you trust it to give you the right answer. As there is no obvious way to reward the AI for correct friendliness judgements, you risk running an AI that isn't friendly, but still meets the reward criteria.

The low bandwidth is to reduce manipulation. Don't let it control you with a single bit.

comment by avturchin · 2019-07-01T15:48:17.849Z · score: -1 (3 votes) · LW · GW

Several interesting questions appeared in my mind immediately as I saw the post's title, so I put them here but may be will add more formatting later:


Submission: very-low-bandwidth oracle: Is it theoretically possible to solve AI safety – that is, to create safe superintelligent AI? Yes or no?

Submission: low-bandwidth oracle: Could humans solve AI safety before AI and with what probability?

Submission: low-bandwidth oracle: Which direction to work on AI Safety is the best?

Submission: low-bandwidth oracle: Which direction to work on AI Safety is the useless?

Submission: low-bandwidth oracle: Which global risk is more important than AI Safety?

Submission: low-bandwidth oracle: Which global risk is neglected?

Submission: low-bandwidth oracle: Will non-aligned AI kill us (probability number)?

Submission: low-bandwidth oracle: Which question should I ask you in order to create Safe AI? (less than 100 words)

Submission: low-bandwidth oracle: What is the most important question which should I ask? (less than 100 words)

Submission: low-bandwidth oracle: Which future direction of work should I choose as the most positively impactful for human wellbeing? (less than 100 words)

Submission: low-bandwidth oracle: Which future direction of work should I choose as the best for my financial wellbeing? (less than 100 words)

Submission: low-bandwidth oracle: How to win this prise? (less than 100 words)



comment by Stuart_Armstrong · 2019-07-01T15:50:27.913Z · score: 7 (3 votes) · LW · GW

None of these questions can be asked to the low bandwidth Oracle (you need a list of answers); it might be possible to ask them to the counterfactual Oracle, after some modification, but they would be highly dangerous if you allow unrestricted outputs.

comment by Stuart_Armstrong · 2019-07-02T14:11:00.037Z · score: 2 (1 votes) · LW · GW

See the edit, and make sure you "decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point."

comment by iamtrask · 2019-07-01T20:06:30.129Z · score: -2 (3 votes) · LW · GW

Submission for all types: ask for an ordered list of what questions you should ask the Oracle.

This seems like the highest order question which subsumes all others, as the Oracle is best positioned to know what information we will find useful (as it is the only being which knows what it knows). Any other question assumes we (the question creators) know more than the Oracle.

Refined Submission for all types: If value alignment is a concern, ask for an ordered list of what questions you should ask the Oracle to maximize for weighted value list X.

comment by romeostevensit · 2019-07-01T23:05:27.572Z · score: 3 (2 votes) · LW · GW

An assumed hostile process can 1) cause you to directly do something to its benefit or to your detriment 2) cause you to do something that increases your future attack surface. You've just handed the AI the state-fulness that the episodic conjecture aims to eliminate.

comment by Stuart_Armstrong · 2019-07-02T10:39:27.926Z · score: 2 (1 votes) · LW · GW

For the low bandwidth Oracle, you need to give it the options. In the case of the counterfactual Oracle, if you don't see the list, how do you reward it?