Posts

Is there an intuitive way to explain how much better superforecasters are than regular forecasters? 2020-02-19T01:07:52.394Z
Machine Learning Projects on IDA 2019-06-24T18:38:18.873Z
Reinforcement Learning in the Iterated Amplification Framework 2019-02-09T00:56:08.256Z
HCH is not just Mechanical Turk 2019-02-09T00:46:25.729Z
Amplification Discussion Notes 2018-06-01T19:03:35.294Z
Understanding Iterated Distillation and Amplification: Claims and Oversight 2018-04-17T22:36:29.562Z
Improbable Oversight, An Attempt at Informed Oversight 2017-05-24T17:43:53.000Z
Informed Oversight through Generalizing Explanations 2017-05-24T17:43:39.000Z
Proposal for an Implementable Toy Model of Informed Oversight 2017-05-24T17:43:13.000Z

Comments

Comment by William_S on Call for research on evaluating alignment (funding + advice available) · 2021-09-07T20:47:42.162Z · LW · GW

I don't think all work of that form would measure misalignment, but some work of that form might, here's a description of some stuff in that space that would count as measuring misalignment.

Let A be some task (e.g. add 1 digit numbers), B be a task that is downstream of A (to do B, you need to be able to do A, e.g. add 3 digit numbers), M is the original model, M1 is the model after finetuning.

If the training on a downstream task was minimal, so we think it's revealing what the model knew before finetuning rather than adding knew knowledge, then better performance of M1 than M on A would demonstrated misalignment (don't have a precise definition of what would make finetuning minimal in this way, would be good to have a clearer criteria for that).

If M1 does better on B after finetuning in a way that implicitly demonstrates better knowledge of A, but does not do better on A when asked to do it explicitly, that would demonstrate that the finetuned M1 is misaligned (I think we might expect some version of this to happen by default though, since M1 might overfit to only doing tasks of type B. Maybe if you have a training procedure where M1 generally doesn't get worse at any tasks then I might hope that it would get better on A and be disappointed if it doesn't).

Comment by William_S on The case for aligning narrowly superhuman models · 2021-03-15T23:49:18.414Z · LW · GW

Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.

Comment by William_S on The case for aligning narrowly superhuman models · 2021-03-15T23:46:11.141Z · LW · GW

One easy way to make people who can't solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.

A simpler way to measure whether you are making progress towards sandwiching if you can't go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.

Both of these approaches feel like they aren't quite solving the whole problem, because ultimately we want systems that help humans supervise tasks where they haven't developed the right concepts, or couldn't understand them even with years of study.

Comment by William_S on Covid Canada Jan25: low & slow · 2021-01-26T16:44:02.123Z · LW · GW

Here is a regularly updated version of the vaccine chart https://covid19tracker.ca/vaccinegap.html

Comment by William_S on Why I'm excited about Debate · 2021-01-18T21:30:32.841Z · LW · GW

If the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn't expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don't need to increase in complexity.

As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently (as in the recent OpenAI debate work), such that the debate AI can tackle more complicated questions with steps that are restricted to the complexity that humans can currently work with. Hard to know if this is possible, but still seems worth trying to work on.

Comment by William_S on Some AI research areas and their relevance to existential safety · 2021-01-01T22:16:27.065Z · LW · GW

For the preference learning skepticism, does this extend to the research direction (that isn't yet a research area) of modelling long term preferences/preferences on reflection? This is more along the lines of the "AI-assisted deliberation" direction from ARCHES.

To me it seems like AI alignment that can capture preferences on reflection could be used to find solutions to many of other problems. Though there are good reasons to expect that we'd still want to do other work (because we might need theoretical understanding and okay solutions before AI reaches the point where it can help on research, because we want to do work ourselves to be able to check solutions that AIs reach, etc.)

It also seems like areas like FairML and Computational Social Choice will require preference learning as components - my guess is that people's exact preferences about fairness won't have a simple mathematical formulation, and will instead need to be learned. I could buy the position that the necessary progress in preference learning will happen by default because of other incentives.

Comment by William_S on Some AI research areas and their relevance to existential safety · 2021-01-01T21:50:39.220Z · LW · GW

One thing I'd like to see are some more fleshed out examples of the kinds of governance demands that you think might be important in the future and would be bottlenecked on research progress in these areas.

Comment by William_S on Traversing a Cognition Space · 2020-12-17T21:31:08.367Z · LW · GW

It seems that in principle a version of debate where only one agent makes statements and the other chooses which statements to expand could work, but it seems like it requires the judge to be very strict that the statement is 100% true. It seems hard to apply this kind of system to statements outside of formal mathematics.

Systems where both agents can make statements seem like they might be less vulnerable to judges accepting statements that aren't 100% true. For one example, if both agents take turns being the arguer, then if both agents submit a path that is judged to be correct, you can stipulate that the agent with the shortest path wins (like imposing a simplicity prior).

Comment by William_S on Clarifying Factored Cognition · 2020-12-17T21:21:29.723Z · LW · GW

HCH could implement the decomposition oracle by searching over the space of all possible decompositions (it would just be quite expensive).

Comment by William_S on Traversing a Cognition Space · 2020-12-17T21:17:57.392Z · LW · GW

https://www.kialo.com/ lets people build debates on controversial topics in a heirarchical structure (more like stock debate, with both sides providing arguments), but doesn't seem to have been used for explanations/arguments. I'd also be pretty interested to see more attempts at heirarchical explanations.

Comment by William_S on Hiding Complexity · 2020-12-17T20:57:36.615Z · LW · GW

I think there are situations where you can still have subproblems where the output of the subproblem is long. A contrived example: suppose you have a problem where you want to calculate XOR(f(a), f(b)), where f(a) and f(b) are long strings. It seems reasonable to decompose into x=f(a), y=f(b), z=XOR(x, y), despite x and y being long, because there's a simple way to combine them.

If we had an AI system that could work on "making progress on a problem for an hour", then write down a complete description of everything it had figured out and pass that to another AI system, I'd count that as dividing the problem into subproblems, just in a way that's probably inefficient.

I'd evaluate decompositions into subproblems by something like the total cost of solving a problem by dividing it into subproblems. Some decompositions would be efficent and others would be inefficient, sometimes this would be because the output is large but in other cases it could be because it takes a long time to write the input, or because there's a lot of work repeated between subproblems.

Comment by William_S on Learning the prior and generalization · 2020-11-19T21:17:49.085Z · LW · GW

Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y,  amplification system A, verification in your quote is asking the A to answer "Would A(Z) output answer Y to question X?", as opposed to asking A to answer "X", and then checking if it equals "Y". This can at most be as hard as running the original system, and maybe could be much more efficient.

Comment by William_S on Do we have updated data about the risk of ~ permanent chronic fatigue from COVID-19? · 2020-11-02T01:55:30.789Z · LW · GW

https://institute.global/policy/long-covid-reviewing-science-and-assessing-risk

From the COVID Symptom Study in the UK (app based questionaire), "10 per cent of those taking part in the survey had symptoms of long Covid for a month, with between 1.5 and 2 per cent still experiencing them after three months", and they claim "long Covid is likely a bigger issue than excess deaths as a result of Covid, which are between 0.5 per cent and 1 per cent".

App-based survey, so not necessarily representative of population. Not clear how severe the 3 month cases are, though they state "The most common reported symptom has been described by doctors as “profound fatigue”". Article also summarizes other related studies.

Comment by William_S on Learning the prior and generalization · 2020-10-24T02:25:35.166Z · LW · GW

Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it's simple to verify the results of amplification/debate systems?)

Comment by William_S on Learning the prior and generalization · 2020-10-22T19:52:11.555Z · LW · GW

for extremely large  which are represented only implicitly as in Paul's post, we might not always check whether the model matches the ground truth by actually generating the ground truth and instead just ask the human to verify the answer given 

 

I'm not sure what "just ask the human to verify the answer given " looks like, for implicitly represented 

Comment by William_S on Have the lockdowns been worth it? · 2020-10-13T23:32:54.092Z · LW · GW

I'm skeptical of this.

  • Wuhan needed 2 months on lockdown: https://en.wikipedia.org/wiki/COVID-19_pandemic_lockdown_in_Hubei
  • I'd expect that imposing China-style lockdowns in the West would require significant force and might end up causing a large-scale panic in and of itself.
  • I'd expect that any lockdown in the West wouldn't have been effective enough to stamp out 100% of cases, and if you don't eradicate it then you need ongoing measures or it will just flare up again later, so one strictly enforced lockdown wouldn't cut it. (Though maybe you could do very rigorous contact tracing and lock down just people who might have been in contact with cases, which could be less costly than full lockdown but probably still need significant enforcement).
Comment by William_S on Do we have updated data about the risk of ~ permanent chronic fatigue from COVID-19? · 2020-09-02T18:09:51.552Z · LW · GW

https://www.microcovid.org/paper/2-riskiness#fn6 discusses https://covid.joinzoe.com/post/covid-long-term which has an app-based survey claiming 1 in 10 people still have symptoms after 3 weeks. (but since people can just sign up for the app I'd guess this is harder to know how to interpret than the telephone survey). Microcovid.org uses this 1 in 10 figure as the estimate for chance of some ongoing health consequence, and claims the risk of ongoing health problems from a 1% chance of COVID is equivalent to the risk from 1 year of driving (but this comparison involves even more assumptions).

Comment by William_S on microCOVID.org: A tool to estimate COVID risk from common activities · 2020-09-02T17:57:40.856Z · LW · GW

https://www.cdc.gov/mmwr/volumes/69/wr/mm6930e1.htm found that ~1 in 5 of 18-34 year olds with no underlying health conditions had symptoms 3 weeks later (telephone survey of people who'd been symptomatic and had a positive test).

Other discussion in comments of https://www.lesswrong.com/posts/ahYxBHLmG7TiGDqxG/do-we-have-updated-data-about-the-risk-of-permanent-chronic

Comment by William_S on Do we have updated data about the risk of ~ permanent chronic fatigue from COVID-19? · 2020-09-02T17:24:27.433Z · LW · GW

Not addressing fatigue, and just a study in progress, but this study is looking for long term neurological problems, might another weak bit of evidence when it releases results https://www.cambridgebrainsciences.com/studies/covid-brain-study

Comment by William_S on Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns · 2020-07-24T01:40:07.510Z · LW · GW

Seems like it could be helpful if people who've thought about this would also predict on the question of what the survey value would be today. (e.g. via elicit snapshots)

Comment by William_S on OpenAI announces GPT-3 · 2020-05-29T21:20:50.960Z · LW · GW

Google's Meena (2.6 billion parameters, February 2020) creates original puns in "Conversation D", and I think "Cross-turn Repetition Example 2" is absurdist comedy, but maybe more as a result of the way the model fails.

Conversation D
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.
Comment by William_S on Does the 14-month vaccine safety test make sense for COVID-19? · 2020-03-18T19:56:24.872Z · LW · GW

Are there any sources that describe why 14 months is the trial period (or provide justification for picking trial periods of various lengths)?

Comment by William_S on Does the 14-month vaccine safety test make sense for COVID-19? · 2020-03-18T19:55:56.678Z · LW · GW

Seems like it ought to be more of a continuous variable, rather than this discrete 14 month trial: at time t, we've observed x people for y months to see if they have wierd long-term side effects, so we should be willing to vaccinate z more people.

Comment by William_S on How to have a happy quarantine · 2020-03-18T17:31:46.116Z · LW · GW

The chrome extention Netflix Party lets you synchronize playing the same video on netflix other people, which you can use along with Skype to watch something together.

(You can always fall back to counting down "3,2,1" to start playing the video at the same time, but the experience is nicer if you ever need to pause and resume)

Comment by William_S on Zoom In: An Introduction to Circuits · 2020-03-10T21:12:12.205Z · LW · GW

The worry I'd have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?

Comment by William_S on What "Saving throws" does the world have against coronavirus? (And how plausible are they?) · 2020-03-05T05:03:41.385Z · LW · GW

Might be interesting to look at information that was available at the start of H1N1 and how accurate it turned out to be in retrospect (though there's no guarantee that we'd make errors in the same direction this time around).

Comment by William_S on What "Saving throws" does the world have against coronavirus? (And how plausible are they?) · 2020-03-05T02:53:59.712Z · LW · GW

Virus mutates to a less severe form, quarantine measures select for the less severe form, fighting off less severe form provides immunity against more severe form, severe form dies out.

According to https://en.wikipedia.org/wiki/Spanish_flu

Another theory holds that the 1918 virus mutated extremely rapidly to a less lethal strain. This is a common occurrence with influenza viruses: There is a tendency for pathogenic viruses to become less lethal with time, as the hosts of more dangerous strains tend to die out[15] (see also "Deadly Second Wave", above).

Article today suggested that COV19 has already split into two strains and hypothesized that selection pressure from quarantine changed the relative frequencies of the strains, don't think there's evidence about whether one strain is more severe https://academic.oup.com/nsr/advance-article/doi/10.1093/nsr/nwaa036/5775463?searchresult=1

I'm not an expert and this isn't great evidence, so it's maybe in the "improbable" category

Comment by William_S on Reinforcement Learning in the Iterated Amplification Framework · 2020-02-16T01:21:30.248Z · LW · GW

I'm talking about an imitation version where the human you're imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of "How good is answer X to Y?" to try to find X*. So I'm more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what "safe limited search" actually looks like.

Comment by William_S on Reinforcement Learning in the Iterated Amplification Framework · 2020-02-16T01:17:18.898Z · LW · GW
If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn't the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?

I agree, and think that this problem occurs both in imitation IA and RL IA

For example is the plan to make sure M2 has no such robustness problems (if so how)?

I believe the answer is yes, and I think this is something that would need to be worked out/demonstrated. I think there is one hope that if M2 can increase the amount computing/evaluation power it uses for each new sample X as we take more samples, then you can keep taking more samples without ever accepting an adversarial one (This assumes something like for any adversarial example, all M2 with at least some finite amount of computing power will reject it). There's maybe another hope that you could make M2 robust if you're allowed to reject many plausibly good X in order to avoid false positives. I think both of these hopes are in the IOU status, and maybe Paul has a different way to put this picture that makes more sense.

Comment by William_S on Outer alignment and imitative amplification · 2020-02-16T01:04:53.418Z · LW · GW

Overall, I think imitative amplification seems safer, but I maybe don't think the distinction is as clear cut as my impression of this post gives.

if you can instruct them not to do things like instantiate arbitrary Turing machines

I think this and "instruct them not to search over arbitrary text strings for the text string that gives the most approval", and similar things, are the kind of details that would need to be filled out to make the thing you are talking about actually be in a distinct class from approval-based amplification and debate (My post on imitation and RL amplification was intended to argue that without further restrictions, imitation amplification is in the same class as approval-based amplification, which I think we'd agree on). I also think that specifying these restrictions in a way that still lets you build a highly capable system could require significant additional alignment work (as in the Overseer's Manual scenario here)

Conversely, I also think there are ways that you can limit approval-based amplification or debate - you can have automated checks, for example, that discard possible answers that are outside of a certain defined safe class (e.g. debate where each move can only be from either a fixed library of strings that humans produced in advance or single direct quotes from a human-produced text). I'd also hope that you could do something like have a skeptical human judge that quickly discards anything they don't understand + an ML imitation of the human judge that discards anything outside of the training distribution (don't have a detailed model of this, so maybe it would fail in some obvious way)

I think I do believe that for problems where there is a imitative amplification decomposition that solves the problem without doing search, that's more likely to be safe by default than approval-based amplification or debate. So I'd want to use imitative amplification as much as possible, falling back to approval only if needed. On imitative amplification, I'm more worried that there are many problems it can't solve without doing approval-maximizing search, which brings the old problems back in again. (e.g. I'm not sure how to use imitative amplification at the meta-level to produce better decomposition strategies than humans use without using approval-based search)

Comment by William_S on Use-cases for computations, other than running them? · 2020-01-21T00:55:46.560Z · LW · GW

Substituting parts of the computation (replace a slow, correct algorithm for part of the computation with a fast, approximate one)

Comment by William_S on Use-cases for computations, other than running them? · 2020-01-21T00:54:07.155Z · LW · GW
  • Formally verifying properties of the computation
  • Informally checking properties of the computation (is this algorithm for making loan decisions fair?)
  • Debugging the computation, or more generally "modifying the computation to do what you actually want"
Comment by William_S on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2019-11-30T21:45:04.540Z · LW · GW

One situation is: maybe an HBO tree of size 10^20 runs into a security failure with high probability, but an HBO tree of size 10^15 doesn't and is sufficient to output a good LBO overseer.

Comment by William_S on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2019-11-30T21:43:55.358Z · LW · GW

Yeah, that's my current picture of it.

Comment by William_S on Can corrigibility be learned safely? · 2019-11-30T21:43:17.397Z · LW · GW
I understand that, but I don't see why #2 is likely to be achievable. Corrigibility seems very similar to Wei Dai's translation example, so it seems like there could be many deceptive actions that humans would intuitively recognize as not corrigible, but which would fool an early-stage LBO tree into assigning a high reward. This seems like it would be a clear example of "giving a behaviour a high reward because it is bad". Unfortunately I can't think of any good examples, so my intuition may simply be mistaken.

For factored cognition: I think the reason #2 might be achievable relies on assuming that there is some reason that the bad features is selected in the first place and is assigned high reward. For example, this might have happened because the agent ran a simulation forward, and then realized that if they punch the human they can take the reward button from them. The hope is that we can figure out that the simulation process happened and why it lead to the wrong thing (or outlaw simluations of this form in the first place).

For factored evaluation, I think the story is a bit different (relying on the other expert being able to understand the reasons for the reward assignment and point it out to the judge, but I don't think the judge needs to be able to find it on there own).

Incidentally, it seems like Ought could feasibly test whether meta-execution is sufficient to ensure corrigibility; for example, a malicious expert could recommend deceptive/influence-seizing actions to an agent in a simulated environment, and the meta-execution tree would have to detect every deceptive action without any contextual knowledge . Are there any plans to do this?

No plans currently, but it would be interesting.

Comment by William_S on Can corrigibility be learned safely? · 2019-11-03T01:04:34.045Z · LW · GW

Suppose an action is evaluated as a linear combination of a set of human interpretable features.

The action "punch the human" could be selected because 1) many of the reward weights of these features could be wrong, or it could be selected because 2) there is one feature "this action prevents the human from turning me off" that is assigned high reward. I think the thing we'd want to prevent in this case is 2) but not 1), and I think that's more likely to be achievable.

Comment by William_S on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2019-11-03T00:56:41.177Z · LW · GW

I think it's a general method that is most applicable in LBO, but might still be used in HBO (eg. an HBO overseer can read one chapter of a math textbook, but this doesn't let it construct an ontology that let's it solve complicated math problems, so instead it needs to use meta-execution to try to manipulate objects that it can't reason about directly.

Comment by William_S on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2019-11-03T00:54:04.817Z · LW · GW

I'd interpreted it as "using the HBO system to construct a "core for reasoning" reduces the chances of failure by exposing it to less inputs/using it for less total time", plus maybe other properties (eg. maybe we could look at and verify an LBO overseer, even if we couldn't construct it ourselves)

Comment by William_S on Concrete experiments in inner alignment · 2019-09-26T22:52:13.788Z · LW · GW

Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)

https://arxiv.org/abs/1905.12149

Comment by William_S on 2-D Robustness · 2019-09-26T18:34:37.193Z · LW · GW

One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:

  • use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
  • train a new system to do as well on the reward function as the original system
  • measure the number of training steps needed to reach this point for the new system.

This would let you make comparisons between different systems as to which was more capability robust.

Maybe there's a version that could train the new system using behavioural cloning, but it's less clear how you measure when you're as competent as the original agent (maybe using a discriminator?)

The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems's ontology and capabilities.

Comment by William_S on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-26T18:22:32.665Z · LW · GW

I think the better version of this strategy would involve getting competing donations from both sides, using some weighting of total donations for/against pushing the button to set a probability of pressing the button, and tweaking the weighting of the donations such that you expect the probability of pressing the button will be low (because pressing the button threatens to lower the probability of future games of this kind, this is an iterated game rather than a one-shot).

Comment by William_S on Problems with AI debate · 2019-09-05T18:08:11.444Z · LW · GW

For Alaska vs. Bali, alternative answer is "You could be convinced that either Alaska or Bali is a good vacation destination". It's an interesting question whether this could actually win in debate. I think it might have a better chance in Factored Evaluation, because we can spin up two seperate trees to view the most compelling argument for Alaska and the most compelling argument for Bali and verify that these are convincing. In debate, you'd need view either Alaska Argument before Bali Argument, or Bali Argument before Alaska Argument, and you might just be convinced by the first argument you see in which case you wouldn't agree that you could be convinced either way.

Comment by William_S on HCH is not just Mechanical Turk · 2019-08-07T02:55:24.913Z · LW · GW

I'd say that the claim is not sufficient - it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it's hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I'd want to produce along the way while proceeding with HCH-like approaches)

Comment by William_S on Deceptive Alignment · 2019-07-21T20:54:27.734Z · LW · GW

I think a cleaner way of stating condition 3 might be "there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating".

This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)

This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there's any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)

Comment by William_S on An Increasingly Manipulative Newsfeed · 2019-07-14T03:53:52.360Z · LW · GW

To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it "Will you do things I don't like if given more capability?" or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it's true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn't happen if the developers are trying to test for treacherous turn behaviour during development.

Comment by William_S on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-01T16:48:25.834Z · LW · GW

Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)

Comment by William_S on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-01T16:46:57.502Z · LW · GW

Submission: low-bandwidth oracle

Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.

Comment by William_S on The Main Sources of AI Risk? · 2019-03-22T18:55:56.945Z · LW · GW
  • AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people's values are loaded into the system), and is relevant for overall strategy.
Comment by William_S on The Main Sources of AI Risk? · 2019-03-22T18:52:24.938Z · LW · GW
  • Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human's priors are most accurate (on potentially irrelevant issues) if this isn't what humans actually want.
Comment by William_S on Some Thoughts on Metaphilosophy · 2019-03-08T18:58:55.169Z · LW · GW

Maybe one AI philosophy service could look like: would ask you a bunch of other questions that are simpler than the problem of qualia, then show you what those answers imply about the problem of qualia if you use some method of reconciling those answers.