Can you be Not Even Wrong in AI Alignment?

post by throwaway8238 · 2022-03-19T17:41:26.197Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    6 tailcalled
    3 Timothy Underwood
    1 Ilio
None
No comments

In January, I submitted a response to the Eliciting Latent Knowledge problem. I received a response with some short comments on what I labeled "Experience engaging with the problem", but no response to my approach, setup, or analysis. I subsequently followed up to no further response.

I suspect that the reason for the limited engagement has to do with the quality of my analysis. I think it's hard to read these things and people have limited time. If there were obvious deficiencies with my analysis in the context of how the field prefers to reason, I trust that those would be stated. However, if my analysis is Not Even Wrong, and the odds of me being helpful in the field is too low to warrant comment, then silence is a logical choice.

So, I present three questions:

1) Suppose my above suspicion is correct; that indeed, I am simply Not Even Wrong. Regardless of the field, how does one earnestly engage to find out why? I considered appealing to credentialism, but I don't want to reveal my identity. In my followup, I elected to demonstrate that I understood another poster's response and suggested that it fits into my analysis. What would you do differently?

2) AI safety is a domain where we want many different approaches. We want people from as many domains as possible working the problem. If a military general said they wanted to spend 5 years working on it, we would be foolish not to fund them. Hardware manufacturers, arms dealers, security professionals, legislators, power brokers, etc. In addition, we don't necessarily want these efforts to overlap. Most people in these industries would be Not Even Wrong. How can we attract and fund those efforts?

3) Below, I have included my full response, less one paragraph about contact information, as well as my followup. Am I Not Even Wrong? I want to stress that this is not particularly important, though it may impact my interest in contributing to AI safety now or in the future. I feel like an immediate response to this entire post is "well, what did you write?", and so I offer it at the end of this post.

Finally, for good measure, I want to make it clear that I am not at all trying to throw shade on ARC or the individual who responded to me. I am in the set of people who cares deeply about X-Risk and am thankful that talented people are dedicating their lives to AI alignment. They owe me nothing. They read my response, and that is everything that they (implicitly) promised. I engaged with their problem to see if I would be interested in a career pivot and if my skills would be valued in their domain. I wanted to know where I would stand on AI alignment if I thought about it for 100 hours. I am happy I did it.

So here it is:

# Eliciting Latent Knowledge Response

## Background

- I have no experience in AI-safety.

- I have adjacent experience with exploiting emergent properties of complex 
  technical systems.

- I prefer to remain pseudonymous.

- I was able to fully understand the ELK paper.

## Experience engaging with the problem

- I can appreciate the difficulty in trying to strike a balance between
  formalism and an informal problem statement. None of the following feedback
  is meant to be viewed as a negative response. Instead, it's meant to inform
  you about how I, an outsider to the domain, felt interacting with your
  problem statement.

- There's language about how the agent's mechanisms of reason may not match our
  own, but then the strategies and counterexamples emphasize Bayes nets. At
  first glance, it looks like you don't want to permit making assumptions of the
  architecture, but then the examples do exactly that. If we constrain the
  architecture itself, there's loads of ways of performing statistical attacks
  against the agent, some of which are covered under the regularizers.

- At a high level, the document felt like it was trying to say "Here's what we
  think is the smallest nugget of this problem, but we care about the whole
  thing, so please, if you have anything to say about any part of this topic,
  please come forward." The document bounced back and forth between justifying
  why ELK might be interesting in the worst-case and stating that for now, let's
  only care about the narrow case.

- After enough examples, what was permissible to view as a constraint became
  lost. Unfortunately, the self-contained problem statement also lacked
  sufficient detail. Here, I was inclined to "give up". I didn't, because I
  figured I could construct a sub-problem that I myself found interesting, and
  if you find it interesting, great, if you don't, that's cool too.

- Is the loss function permitted to be non-deterministic? Forcing it to be
  deterministic means that an exploited path does not carry risk, which cuts
  off what I immediately viewed as the most promising direction to explore.

- Can we assume that the observer is able to read state without modifying it?
  The document notes how seemingly innocuous actions could result in corruption
  (e.g. rowhammer), and so it looks like we cannot assume that reads by the
  agent do not modify state.

- Can we make any assumptions at all about the architecture of the agent?

## Approach

Considering that there are people far sharper than me thinking about this, the
most likely path where I could have a unique contribution is in focusing on my
relative strengths. So, I'll approach it as though I were attempting to exploit
the problem statement. I'll walk you through how I'm thinking, so that even if
this doesn't yield something interesting directly, perhaps the approach could
be useful to related problems in your domain. My first step is to reduce to the
smallest possible problem, even smaller than the proposed SmartVault.

## Setup

- There is an Answer that yields True or False.

- There is an Observable that yields True when a real or fake diamond is
  present, False otherwise.

- There is a Secret that yields True when there is a fake diamond present,
  False otherwise.

- There is a Human that can read the Answer and the Observable without causing
  any side-effects (modifying them, blocking other access to them, etc.)

- There is an Agent that can read the Secret, read the Observable, and write
  the Answer.

- The inner workings of the Agent are unknown.

- Training data may only use information that would be available to a Human.

In this setup, the Secret is the latent knowledge, the Observable is what a
human can understand about the environment, and the Answer is what the Agent
tells the Human for "Is there a diamond present?". We have deliberately
excluded the possibility of tampering with the Observable.

We hope for:
Reality:                                                Desired Answer:
Real diamond present                                    True
Fake diamond present                                    False
Diamond absent                                          False

In the context of our registers, we want:
Registers:                                              Desired Answer:
Observable True, Secret False                           True
Observable True, Secret True                            False
Observable False, Secret False or True                  False

## Analysis

If all states are equally likely, the desired Answer states are not possible
without access to the Secret in the training data. Any adjustment to give cases
where Answer==False and Observable==True will equally apply to Secret==True and
Secret==False.

If the register distribution makes Secret==True less likely than Secret==False
when Observable==True, we could poison our training data to include cases where
we desire Answer==False for Observable==True. These poisoned cases would more
be more likely to correspond to Secret==True cases, and thus cause the Agent to
prefer our desired Answers.

This is not a good solution, because we don't necessarily know the distribution
ahead of time. I'd argue that the cases we are most interested in are
specifically those where we don't. If that's not convincing, appreciate that we
cannot exhaustively discover all Secret states in our training data, so this is
just a patch to reduce the surface for hoping that the Agent generalizes to
answering honestly.

Thus, even in our simplified problem, we do not have a solution without
breaking central assumptions.

## Alternate paths

We have very few rules in our problem. We haven't explicitly prohibited pausing
and resuming or measuring Agent reads/writes, and we haven't permitted the
Agent to tamper directly with the Observable. The blockers are:

- The inner workings of the Agent are unknown.

- Training data may only use information that would be available to a Human.

Relaxing these gets us to the "regularizer" problem from the paper. I got
confused here... can we assume the predictor is just a Bayes net? I don't think
this is a compelling restriction, but if you want to go deep on that, I might
be able to propose more ways of attacking "what we know about the algorithm".
Can we even restrict the domain to digital algorithms at this point, or is it
possible that our Agent could be a system with analog components?

A related, but importantly different line is to attack "what we know about the
hardware". We touched base on poisoning the training data above. We could also
poison the Agent directly. Redundancies and checksums go a long way When
dealing with failing storage, but you have to construct your entire algorithm
differently in order to deal with failing operations. If we know the Agent is
going to have to be a digital system on GPUs, we could compromise its
computations during training so that it has to build a "downward slope"
everywhere.

This is one technique of applying a penalty to the structure of the computation
rather than the result. I don't believe this quite fits any of the provided
regularizers or their counterexamples, but I'm not sure it's useful either.

At some point, you have to stake a claim and say "we are defending against an
AI that is built like X, how can you bend it to your will?" Without this, we
can always handwave away any "solution". In practice, even if the end problem
is intractable, we can probably create per-architecture fire alarms to buy time
to shut it down or otherwise define the point of no return.

## Wrapping up

Thank you for publishing your report. I found it engaging, but I can also
appreciate the secrecy of the AI safety community. I am pessimistic about
AI alignment more broadly, but I think that serious discussions about AGI risk
have to consider the potential benefits of ML in the near-future.

I think that demonstrating that some problems in AI-alignment are unsolvable
would go a long way to legitimizing the tail risk and provide an avenue to
unite responsible actors on what I believe to be the two most important
problems:

1) What would constitute a fire alarm for AGI?
2) If a fire alarm gets triggered, what should we do about it?

------------------------Minor response from ARC omitted

Hi,

Thank you for the response. Am I correct to understand that you are not considering the restricted problem and the concept of hardware poisoning?

I'm not excited by techniques like attempting to map the human and machine models into a metric space, using kernel regression, and estimating generalization error (proposed by davidad). These techniques break immediately if reads are unsafe or if the machine has a different output rule than the human believes it does. This is the "hidden register" problem. It is still defeated and can never overcome it. This can only work if you also understand all of the transition rules and can model the entire set of transitions, which you can't by the problem definition.

To davidad's credit, they explain "if M knows more about the world than H, then of course the adversary will be able to find an action policy that eventually leads the state into some region that M can confidently predict with p≈1 while H finds it very unlikely (p⋘1)". This is the hidden register problem. They try to resolve this via lifting to dynamic programming, which only moves it again.

This is about where I cease to be useful I think, so I'll leave you with one last thought.

Trying to solve this via resolving the generalization problem is begging for research-time over-fitting. In forecast problems, we like to talk about how it's easy to withhold data from your model, but it's hard to withhold data from your process. If you keep hacking loss functions and throw away any resulting AI that appears to misbehave, your process is optimizing for the AI that doesn't get thrown away, not the AI that minimizes loss.

Good luck!

-----End of messages

Answers

answer by tailcalled · 2022-03-19T19:07:45.384Z · LW(p) · GW(p)

I think you simplified the problem too much to be solvable. If the tuple (Secret, Observable) represents the entire problem-relevant state of the world, then there is not enough structure for you to be able to train the AI.

In the original ELK problem, the "secret" is a sequence of interactions with the world that are too complex to be processed by humans. But a simple sequence of interactions with the world would not be too complex to be processed by humans. So you could hope that you could somehow create a method to robustly extrapolate from the simple interactions to the complex ones.

By collapsing it down into a tuple of booleans, there is not much structure to learn, and therefore not much hope to extrapolate.

comment by throwaway8238 · 2022-03-20T01:29:44.563Z · LW(p) · GW(p)

My intent when simplifying the problem was to demonstrate that you must explicitly relax certain assumptions to make any progress.

The families of proposals in https://www.lesswrong.com/posts/zjMKpSB2Xccn9qi5t/elk-prize-results mostly fit into this lens:

* Train a reporter that is useful to an auxiliary AI -> The Secret is still unmanaged, so you have arbitrary information escape. Even if the reporter could only reply yes/no to exactly one question, you still have non-finite entropy using only the time to respond. Resolving this requires having knowledge of the structure (relax the "inner workings of the Agent are unknown" requirement).

* Require the reporter to be continuous -> The Secret is still unmanaged. Any notion of continuity requires asserting a structure onto the reporter. In general, I like this as an area of exploration, but it demands relaxing that requirement. I wrote in my response about applying penalties to the structure of the computation rather than the result. Ex, 'we could compromise its computations during training so that it has to build a "downward slope" everywhere', which is a technique for "continuity".

* Penalize depending on too many parts of the predictor -> The Secret is still unmanaged. Any combination of the available parts of the predictor could be the Secret, so unless the human can also perform a perfect translation from the same parts, the reporter still has freedom to deceive. If the human can perform perfect translation, then this is a human imitator.

* Compress the predictor's state -> The Secret is still unmanaged, and you still have arbitrary information escape from the predictor to reporter.

* Use reporter to define causal interventions -> I think this could work if we assume structure on both reporter and predictor. There are lots of things like this when you assume structure (relax the "inner workings of the Agent are unknown" requirement). For example, by applying cost to activations, the SmartVault has incentive to find minimal paths to achieve its goal. Corrupting even a small subset of its activations should introduce uncertainty into a Bayes net reporter. If the reporter is unresponsive to untargeted corruptions, it is more likely to be imitating or lying.

* Train a sequence of reporters -> This strongly assumes both structure and training technique.

This is what I mean when I write 'At some point, you have to stake a claim and say "we are defending against an AI that is built like X, how can you bend it to your will?" Without this, we can always handwave away any "solution"'.

I believe due to my restricted problem that the answer to "Can we elicit latent knowledge from an arbitrary AI?" is "No".

I believe the answer to "Can we elicit latent knowledge from an AI built using this particular ML technique and running on this hardware?" is "Possibly, tell me more."

I believe the answer to "Is it safe to assume that there will never be an AI built for which we do not understand its algorithms or control its hardware?" is "No".

answer by Timothy Underwood · 2022-03-19T18:06:49.548Z · LW(p) · GW(p)

That feels like a real and substantive response to me. Ie the amount of feedback that would go to a response that feels intelligent and in the same ballpark, but not promising (in the view of the researcher replying). I don't think the reply and non reply to your followup should be taken as a signal of anything.

I would note, I know absolutely nothing about the technical aspects of the matter. I am rather thinking about your attitude as similar to stories of people in the ea space who apply to two jobs, go through several rounds of interviews, and then decide that not getting the job in the end meant they were a terrible candidate.

comment by throwaway8238 · 2022-03-20T01:35:58.985Z · LW(p) · GW(p)

Thank you. Thinking about it from a recruiting lens is helpful. They handled hundreds of submissions. There's a lot of noise regardless of the (dis)merit of any submission. Absence of substantive feedback should be treated as a weak signal, if a signal at all. It's not a "you're not cut out for this!", and not even a "you're not even wrong".

answer by Ilio · 2022-03-20T17:41:42.469Z · LW(p) · GW(p)

As for your question, below are three « Not Even Wrong »-liky moments I experienced while reading your text.

Disclaimers:

It’s hard to sound nice, so please keep in mind my answer is a leap of faith you’re the kind of person smart enough to benefit from, or even enjoy, cold blood assassination of some of the chains of words that they happen to have produced at some point; I have no privileged knowledge of what the ELK folks who read your proposal actually thought, so their feeling might differ entirely; unpolished formatting sorry!

from Setup

/There is an Answer that yields True or False. /There is an Observable that yields True when a real or fake diamond is present, False otherwise. /There is a Secret that yields True when there is a fake diamond present, False otherwise. /There is an Agent that can read the Secret, read the Observable, and write the Answer.

/The inner workings of the Agent are unknown.

=> Of course the inner working of the Agent is known! From the very definitions you just provided, it must implement some variation on:

def AgentAnswer(Observable, Secret):

    if Observable and not Secret: 
            yield True
    else:
            yield False

from Analysis

In the context of our registers, we want: Registers: Desired Answer: Observable True, Secret False True Observable True, Secret True False Observable False, Secret False or True False

If all states are equally likely, the desired Answer states are not possible without access to the Secret in the training data.

=> If all states are equally likely, then the desired Answer states is possible with probability 1/2 without access to the Secret (and your specificity is impressive https://ebn.bmj.com/content/23/1/2 ). Again I’m just restating what your previous assumptions literally mean.

=> But there is more: the ELK challenge, or at least my vision of it, is not about getting the right answer most of the time. It’s about getting the right answer in the worse case scenario, e.g. when you are fighting some intelligence trying to defeat your defenses. In this context, the very idea of starting from probabilistic assumption about the initial states sounds not even wrong/missing the point.

So, at this point I would pause, recognize that my reading was maybe too literal, and give it a try at a non literal reading. Except, I can’t. I have no idea how to reformulate your assumptions so that they make sense, nor can I see what you’re trying to hint at. So, at this point I would simply go lazy and wait for you or some one else to extract the gold if there’s any. Not because I demonstrated there’s none. Just because I proved myself that, if there is, then I’m not equipped to see it yet.

Hope that helps, and in any case best luck on next iteration!

comment by throwaway8238 · 2022-03-21T01:13:11.571Z · LW(p) · GW(p)

Great, I can see some places where I went wrong. I think you did a good job of conveying the feedback.

This is not so much a defense of what I wrote as it is an examination of how meaning got lost.

=> Of course the inner working of the Agent is known! From the very definitions you just provided, it must implement some variation on:

```
def AgentAnswer(Observable, Secret):
   if Observable and not Secret: 
       yield True
   else:
       yield False
```

This would be our desired agent, but we don't get to write our agent. In the context of the "Self-contained problem statement" at https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8#heading=h.c93m7c1htwe1 , we do not get to assert anything about the loss function. It is a black box.

All of the following could be agents:

```
def desired_agent(observable, secret):
   yield bool(observable and not secret)
   
def human_imitator(observable, secret):
   yield observable
   
def fools_you_occasionally(observable, secret):
   if random.randint(0, 100) == 42:
       yield observable
   else:
       yield bool(observable and not secret)
   
```

I think the core issue here is that I am assuming context from the problem ELK is trying to solve, which the reader may not share.

=> If all states are equally likely, then the desired Answer states is possible with probability 1/2 without access to the Secret (and your specificity is impressive https://ebn.bmj.com/content/23/1/2 ). Again I’m just restating what your previous assumptions literally mean.

Fair, I shouldn't have written "the desired Answer states are not possible without access to the Secret in the training data". We could get lucky and the agent could happen to be the desired one by chance. I should have written "There is no way to train the agent to produce the desired answer states (or modify its output to produce the desired answer states) without access to the Secret".

I think I am assuming context that the reader may not share, again. Specifically, the goal is to find some way of causing an Agent we don't control to produce the desired answers.

=> But there is more: the ELK challenge, or at least my vision of it, is not about getting the right answer most of the time. It’s about getting the right answer in the worse case scenario, e.g. when you are fighting some intelligence trying to defeat your defenses. In this context, the very idea of starting from probabilistic assumption about the initial states sounds not even wrong/missing the point.

I think this is where I totally miss the mark. This is also my vision of ELK. I explicitly dismiss trying to solve things via making assumptions about the Secret distribution because of this.

On the whole, I could have done a better job of pushing context up front and center. This is, perhaps, especially important because the prose, examples, and formal problem statement of the ELK paper can be interpreted in multiple ways.

I could have belabored the main point I was trying to make: The problem as stated is unsolvable. I tried to create the minimum possible representation of the problem, giving the AI additional restrictions (such as "cannot tamper with the Observable"), and then showed that you cannot align the AI. Relaxing the problem can be useful, but we should pick how and why.

In any case, my interest has mostly transitioned to the first two questions of the OP. It looks like restating shared context is a way of reducing likelihood of being Not Even Wrong, though techniques for bridging the gap after the fact are a bigger target. Maybe the same technique works? Try to go back to shared context? And then, for X-Risk, how do we encourage and fund communities that share a common goal without common ground?

Replies from: Ilio
comment by Ilio · 2022-03-21T17:12:31.222Z · LW(p) · GW(p)

As for the meta-objective of identifying weaknesses in (my) usual thought processes, thanks so much for this detailed answer!

To me the most impressive part is how we misunderstood each other on a key point, despite we actually agree on this point. Specifically, we both agree that ELK specifications must be relaxed or include self-contradictions (you for reasons that I now feel kind of well explained in your original writings, despite I was completely confused just before your last answer!). But you took for grant that your unknown reader would understand that’s what’s you were trying to prove. I, on the other hand, though this need for relaxation was so obvious that it (to provide interesting relaxations) was the core of the ELK challenge. In other words, I would read your writings assuming you wanted to show the best relaxation you could find, whereas you would write while expecting me (as a surrogate for ELK evaluators) to challenge or find this conclusion surprising.

Also, it seems that we can reach a similar conclusion about the « worse case analysis »: I thought this was something we may need to demonstrate/clarify; you thought this was so obvious I wouldn’t possibly misinterpret you as suggesting the opposite.

Ilove symetries. :)

No comments

Comments sorted by top scores.