Some real examples of gradient hacking 2021-11-22T00:11:35.047Z


Comment by Oliver Sourbut on Relaxed adversarial training for inner alignment · 2021-11-24T17:33:34.962Z · LW · GW

I had a gripe about the difficulty of 'proposing' vs 'checking' pseudo-inputs. Classically proposing things is much 'harder' than checking things (e.g. P vs NP, proving vs verifying, decidable vs recognisable, ...).

It seems like the adversary has the really hard job of 'proposing' pseudo-inputs which are particularly likely to lead to unacceptable behaviour, while the (potentially unacceptable) model under scrutiny is only required to do something like 'checking' such inputs (at deploy/runtime) to determine whether it's going to behave acceptably or not. Importantly, it might not even 'know' ahead of time that those inputs would lead to unacceptable behaviour.

In a conversation with Evan about this, he rightly pointed to the section on suboptimality alignment, which is discussed in this post as the case where the model becomes unacceptable or deceptive with e.g. more information about its situation or more 'thinking time'. If we leave that aside, it might seem that the only way for the model to be unacceptable in a situation is for the model to already have done the hard work of 'proposing' (to itself in some way) that situation. In that case, the adversary, assuming enough transparency, has hopefully a less relatively-difficult job again.

Some questions

  1. Does suboptimality alignment definitely indeed capture all cases of unacceptability where the model didn't 'already propose' (in some way) the situation? Or is there another way to characterise such cases?
  2. What literature would readers suggest for this kind of suboptimality alignment (you might call it 'unplanned objective misalignment')?
  3. How big a challenge is this to relaxed adversarial training?
Comment by Oliver Sourbut on Relaxed adversarial training for inner alignment · 2021-11-24T17:12:30.634Z · LW · GW

For an alignment proposal you can ask about where value judgement ultimately bottoms out, and of course in this case at some point it's a human/humans in the loop. This reminds me of a discussion by Rohin Shah about a distinction one can draw between ML alignment proposals: those which load value information 'all at once' (pre-deploy) and those which (are able to) incrementally provide value feedback at runtime.

I think naively interpreted, RAT looks like it's trying to load value 'all at once'. This seems really hard for the poor human(s) having to make value judgements about future incomprehensible worlds, even if they have access to powerful assistance! But perhaps not?

e.g. perhaps one of the more important desiderata for 'acceptability' is that it only includes behaviour which is responsive (in the right ways!) to ongoing feedback (of one form or another)?

Comment by Oliver Sourbut on Relaxed adversarial training for inner alignment · 2021-11-24T17:00:46.685Z · LW · GW

A potential issue with Relaxed Adversarial Training, as factorised in the post. is presumably dependent on the outcome of the training process itself (i.e. the training process has side-effects, most notable the production of a deployed ML artefact which might have considerable impact on the world!). Since the training process is downstream of the adversary, this means that the quality of the adversary's choice of pseudo-inputs to propose depends on the choice itself. This could lead to concerns about different fixed points (or even the existence of any fixed point?) in that system.

(My faint worry is that by being proposed, a problematic pseudo-input will predictably have some gradient 'training it away', making it less plausible to arise in the deploy distribution, making it less likely to be proposed... but that makes it have less gradient predictably 'training it away', making it more plausible in the deploy distribution, making it more likely to be proposed, .......)

Some ways to dissolve this

  1. In conversation with Evan, he already mentioned a preferred reframing of RAT which bypasses pseudo-inputs and prefers to directly inspect some property of the model (e.g. myopia)
  2. I wonder about maybe 'detecting weird fixpoints' by also inspecting the proposed pseudo-inputs for 'is this a weird and concerning pseudo-input?' (if so, the supervisor is predicting weird and concerning post-deployment worlds!)
  3. If we instead consider causal reasoning and the counterfactual of 'what if we did no more training and deployed now?' this dissolves the dependence (I wonder if this is actually the intended idea of the OP). This leaves open the question of how much harder/easier it is to do counterfactual vs predictive reasoning here.
  4. If we instead consider 'deployment' to be 'any moment after now' (including the remainder of the training process) it might cash out similar to 3? This chimes with one of my intuitions about embedded agency which I don't know an official name for but which I think of as 'you only get one action' (because any action affects the world which affects you so there's now a different 'you')

Interesting? Or basically moot? Or something in between?

Comment by Oliver Sourbut on Some real examples of gradient hacking · 2021-11-23T09:52:02.081Z · LW · GW

Affecting 'someone else's gradient'

A case which didn't make the shortlist, but perhaps domestication counts?

It's a deliberate attempt at affecting the (best understanding of the) outer adaptation process. But in the case of domestication, it's targeted primarily at the outer natural selection process of a different lineage. Of course the lineages interact, meaning it does affect the outer natural selection process of the self lineage, but that's not the main legible effect, nor presumably the intended one.

A more modern and 'competent' example might be the (proposed) use of artificial gene drives to perturb an existing genetic population. Again this acts on a different lineage primarily.

Comment by Oliver Sourbut on Some real examples of gradient hacking · 2021-11-23T09:38:58.252Z · LW · GW

That's a very interesting link, thank you! I suppose my reply would be that I don't claim that any of these attempts are particularly competent, merely that they qualify as (incomplete) recognition of an outer adaptation process and deliberate attempts at hacking it.

Comment by Oliver Sourbut on There’s no such thing as a tree (phylogenetically) · 2021-05-03T09:43:13.023Z · LW · GW

I identify strongly with the excitement of discovery and enquiry in this post!

OP or readers may enjoy some additional examples of extinct or living-fossil tree-strategizing clades: (extant, includes larger extinct tree species) (extinct 'seed fern' tree group) (a few extant, includes larger extinct tree species) (extinct tree 'club mosses' - not really mosses) (not even a plant probably!)

When I came across these facts, upon a little wider reading I had a similar additional mind-blowing moment around the whole set of circumstances of the 'alternation of generations' ( exhibited by plants, fungi and a few other groups. For me, this exploded my conception of what reproduction strategies can look like (and my conception was probably already not even that narrow by most standards). Wait til you read about seed development and ploidy!

Comment by Oliver Sourbut on AMA: Paul Christiano, alignment researcher · 2021-05-01T22:06:52.076Z · LW · GW

I'm taking about relationships like

AGI with explicitly represented utility function which is a reified part of its world- and self- model


sure, it has some implicit utility function, but it's about as inscrutable to the agent itself as it is to us

Comment by Oliver Sourbut on AMA: Paul Christiano, alignment researcher · 2021-05-01T21:52:50.403Z · LW · GW

What kind of relationships to 'utility functions' do you think are most plausible in the first transformative AI?

How does the answer change conditioned on 'we did it, all alignment desiderata got sufficiently resolved' (whatever that means) and on 'we failed, this is the point of no return'?

Comment by Oliver Sourbut on Your Cheerful Price · 2021-02-24T15:02:07.690Z · LW · GW

Interesting thought. Could I crudely summarize the above contribution like this?

If the mutual willing price range includes $0 for both parties, in some situations there is a discrete cheerfulness downside to settling on $nonzero

It has the interesting corollary that

Even if there exists a mutual cheerful price range excluding $0, in some situations it might be more net cheerful to settle on $0

Where does the discrete downside come from?

The following is pure speculation and introspection.

I guess we have 'willing price ranges' (our executive would agree in this range) and 'cheerful price ranges' (our whole being would agree in this range).

If we all agree (perhaps implicitly) that some collective fun thing should entail $0 transaction, then (even if we all say it's a cheerful price) some of us may be cheerful and others merely willing. It's a shame but not too socially damaging if someone is willing but pretending to be cheerful. There is at least common knowledge of a reasonable guarantee that everyone partaking (executively) agrees that the thing is intrinsically fun and worth doing which is a socially safe state.

On the other hand, if we agree that some alleged 'collective fun thing' should entail $nonzero transaction, similarly (even if we all say it's a cheerful price) some of us may be cheerful and others merely willing at that price point. But while it's still consistent that we all executively agree the thing is intrinsically fun and worthwhile it's no longer guaranteed (because it's consistent to believe that someone's willing price excludes $0 and they are only coming along because of the fee). Perhaps even bringing up the question of a fee raises that possibility? And countenancing that possibility can be socially/emotionally harmful? (Because it entails disagreement about preferences? Especially if the collective fun thing is an explicitly social activity, like your party example.)

Further speculative corollary

More cheerful outcomes can expected if the mutual willing price range obviously (shared knowledge) excludes $0 than if it ambiguously excludes $0. So be careful about feeding your guests ambiguously-expensive pizza?

Comment by Oliver Sourbut on Great minds might not think alike · 2021-01-02T09:01:48.422Z · LW · GW

Good point. I guess a good manager in the right context might reduce that conflict by observing that having both a Constance and a Shor can, in many cases, be best of all? And working well together, such a team might 'grow the pie' such that salary isn't so zero-sum...?

In that model, being a Constance (or Shor) who is demonstrably good at working with Shors (Constances) might be a better strategy than being a Constance (or Shor) who is good at convincing managers that the other is a waste of money.

Comment by Oliver Sourbut on Is Success the Enemy of Freedom? (Full) · 2020-11-16T12:01:49.182Z · LW · GW

This resonated a lot with me! (And I'm far from as successful as I would 'like' to be - or would I??? :angst:)

Speculative and fuzzy comparison-drawing

I was reminded, I'm not sure exactly why, of this interesting entry I recently came across (I recall I was led there by a link buried in a comment in Slate Star Codex somewhere...)

While I wouldn't necessarily endorse all of it, it's an interesting read. As I understand it, the capability approach advocates a certain way of drawing lines in practical policy-making. Its emphases are on

  • ‘functionings’ ('beings' and 'doings')

    various states of human beings and activities that a person can undertake

    example beings: ...being well-nourished, being undernourished, being housed in a pleasantly warm but not excessively hot house, being educated, being illiterate, being part of a supportive social network, being part of a criminal network, and being depressed

    example doings: ...travelling, caring for a child, voting in an election, taking part in a debate, taking drugs, killing animals, eating animals, consuming lots of fuel in order to heat one's house, and donating money to charity

  • 'capabilities'

    a person's real freedoms or opportunities to achieve functionings

Here's a passage which we can hold alongside the premise of the original post

The ends of well-being freedom, justice, and development should be conceptualized in terms of people's capabilities. Moreover, what is relevant is not only which opportunities are open to me each by themselves, hence in a piecemeal way, but rather which combinations or sets of potential functionings are open to me.

For example, suppose I am a low-skilled poor single parent who lives in a society without decent social provisions. Take the following functionings: (1) to hold a properly feed myself and my family; (2) to care for my children at home and give them all the attention, care and supervision they need. ... (1) and (2) are opportunities open to me, but they are not both together open to me... forced to make some hard, perhaps even tragic choices between two functionings which both reflect basic needs and basic moral duties?

[emphases in source]

Although that summary of the approach and the excerpt I've copied don't articulate this 'success as enemy of freedom' idea, I wonder if it would be helpful to consider the idea with the lens of the capability approach? There's a certain paradoxical symmetry of the examples, which I think is what the OP is drawing attention to. The challenge would be to draw out whether the mechanism is societal or part of human nature or some combination thereof or (...) and what measures we might take (individually or collectively) to mitigate it!

Comment by Oliver Sourbut on Probability vs Likelihood · 2020-11-16T11:27:41.538Z · LW · GW

I like the emphasis on a type distinction between likelihoods and probabilities, thanks for articulating it!

You seem to ponder a type distinction between prior and posterior probabilities (and ask for English language terminology which might align with that). I can think of a few word-pairings which might be relevant.

Credibility ('credible'/'incredible')

Could be useful for talking about posterior, since it nicely aligns with the concept of a credible interval/region on a Bayesian parameter after evidence.

After gathering evidence, it becomes credible that...

...strongly contradicts our results, and as such we consider it incredible...

Plausibility ('plausible'/'implausible')

Not sure! To me it could connote a prior sort of estimate

It seems implausible on the face of it that the chef killed him. But let's consult the evidence.'

The following are plausible hypotheses:...

But perhaps unhelpfully I think it could also connote the relationship between a model and an evidence, which I think would correspond to likelihood.

Ah, that's a more plausible explanation of what we're seeing!

Completely implausible: the chef would have had to pass the housekeeper in the narrow hallway without her noticing...

Aleatoric and epistemic uncertainty

There's also a probably-meaningful type distinction between aleatoric uncertainty (aka statistical uncertainty) and epistemic uncertainty, where aleatoric uncertainty refers to things which are 'truly' random (at the level of abstraction we are considering them), even should we know the 'true underlying distribution' (like rolling dice), and epistemic uncertainty refers to aspects of the domain which may in reality be fixed and determined, but which we don't know (like the weighting of a die).

I find it helpful to try to distinguish these, though in the real world the line is not necessarily clear-cut and it might be a matter of level of abstraction. For example it might in principle be possible to compute the exact dynamics of a rolling die in a particular circumstance, reducing aleatoric uncertainty to epistemic uncertainty about its exact weighting and starting position/velocity etc. The same could be said about many chaotic systems (like weather).

Comment by Oliver Sourbut on When Money Is Abundant, Knowledge Is The Real Wealth · 2020-11-06T15:25:00.070Z · LW · GW

Great summary! A nit:

our lizard-brains love politics

it's more likely our monkey (or ape) brains that love politics. e.g.

On the note of monkey-business - what about investments in collective knowledge and collaboration? If you've not come across this, you might like it

EDIT to add some colour to my endorsement of the 80000hours link: I've personally found it beneficial in a few ways. One such is that although the value of coordination is 'obvious', I nevertheless have recognised in myself some of the traits of 'single-player thinking' described.