The Credit Assignment Problem

post by abramdemski · 2019-11-08T02:50:30.412Z · score: 67 (21 votes) · LW · GW · 34 comments

Contents

  What Is Credit Assignment?
  The Conceptual Difficulty of 'Online Search'
  Models to the Rescue
  Where Updates Come From
  The Gradient Gap
  Tiling Concerns & Full Agency
  Myopia
  Evolution & Evolved Agents
None
34 comments

This post is eventually about partial agency [? · GW]. However, it's been a somewhat tricky point for me to convey; I take the long route. Epistemic status: slightly crazy.


I've occasionally said that everything boils down to credit assignment problems.

One big area which is "basically credit assignment" is mechanism design. Mechanism design is largely about splitting gains from trade in a way which rewards cooperative behavior and punishes uncooperative behavior. Many problems are partly about mechanism design:

Another big area which I claim as "basically credit assignment" (perhaps more controversially) is artificial intelligence.


In the 1970s, John Holland kicked off the investigation of learning classifier systems. John Holland had recently invented the Genetic Algorithms paradigm, which applies an evolutionary paradigm to optimization problems. Classifier systems were his attempt to apply this kind of "adaptive" paradigm (as in "complex adaptive systems") to cognition. Classifier systems added an economic metaphor to the evolutionary one; little bits of thought paid each other for services rendered. The hope was that a complex ecology+economy could develop, solving difficult problems.

One of the main design features on which classifier systems differ is on details of the virtual economy -- that is, the credit assignment algorithm. An early proposal was the bucket-brigade algorithm. Reward is assigned to cognitive procedures which produce good outputs. These procedures pass reward back to the procedures which activated them, who similarly pass reward back in turn. This way, the economy supports chains of useful procedures.

Unfortunately, the bucket-brigade algorithm was vulnerable to parasites. Malign cognitive procedures could gain wealth by activating useful procedures without really contributing anything. This problem proved difficult to solve. Taking the economy analogy seriously, we might want cognitive procedures to decide intelligently who to pay for services. But, these are supposed to be itty bitty fragments of our thought process. Deciding how to pass along credit is a very complex task. Hence the need for a pre-specified solution such as bucket-brigade.

The difficulty of the credit assignment problem lead to a split in the field. Kenneth de Jong and Stephanie Smith founded a new approach, "Pittsburgh style" classifier systems. John Holland's original vision became "Michigan style".

Pittsburgh style classifier systems evolve the entire set of rules, rather than trying to assign credit locally. A set of rules will stand or fall together, based on overall performance. This abandoned John Holland's original focus on online learning. Essentially, the Pittsburgh camp went back to plain genetic algorithms, albeit with a special representation.

(I've been having some disagreements with Ofer [LW(p) · GW(p)], in which Ofer suggests that genetic algorithms are relevant to my recent thoughts on partial agency, and I object on the grounds that the phenomena I'm interested in have to do with online learning, rather than offline. In my imagination, arguments between the Michigan and Pittsburgh camps would have similar content.)

Ok. That was then, this is now. Everyone uses gradient descent these days. What's the point of bringing up a three-decade-old debate about obsolete paradigms in AI?

What Is Credit Assignment?

I've said that classifier systems faced a credit assignment problem. What does that mean, exactly?

The definition I want to use for this essay is:

So, credit assignment is the problem of turning feedback into strategy improvements.

The bucket-brigade algorithm tried to do this locally, meaning, individual itty-bitty pieces get positive/negative credit. In the light of history, we could say that the Michigan/Pittsburgh distinction conflated local-vs-global search with online-vs-offline. There's no necessary connection between those two; online learning is compatible with assignment of local credit.

In practice, two big innovations made the Michigan/Pittsburgh debate obsolete: backprop, and Q-learning. Backprop turned global feedback into local. Q-learning provided a way to assign credit in online contexts.

I think people generally understand the contribution of backprop and its importance. Backprop is essentially the correct version of what bucket-brigade was overtly trying to do: pass credit back along chains. Bucket-brigade wasn't quite right in how it did this, but backprop corrects the problems.

So what's the importance of Q-learning? I want to discuss that in more detail.

The Conceptual Difficulty of 'Online Search'

In online learning, you are repeatedly producing outputs of some kind (call them "actions") while repeatedly getting feedback of some kind (call it "reward"). But, you don't know how to associate particular actions (or combinations of actions) with particular rewards. I might take the critical action at time 12, and not see the payoff until time 32.

In offline learning, you can solve this with a sledgehammer: you can take the total reward over everything, with one fixed internal architecture. You can try out different internal architectures and see how well each do. (This may be far from the most efficient way of doing things, even in the offline case; but, you can do it.)

Basically, in offline learning, you have a function you can optimize. In online learning, you don't.

Backprop is just a computationally efficient way to do hillclimbing search, where we repeatedly look for small steps which improve the overall fitness. But how do you do this if you don't have a fitness function?

Q-learning and other reinforcement learning techniques provide a way to define the equivalent of a fitness function for online problems, so that you can learn.

Models to the Rescue

How do you solve the approach of associating rewards with actions?

I'm going to make a bold claim: you can't solve the action/reward matching problem without some kind of model.

For example, if we make an episodic assumption, we can assign rewards within an episode boundary to the actions within that same episode boundary.

Q-learning makes an assumption that the state is fully observable, amongst other assumptions.

Naturally, we would like to reduce the strengths of the assumptions we have to make as much as we can. One way is to look at increasingly rich model classes. AIXI [LW · GW] uses all computable models. But maybe "all computable models" is still too restrictive; we'd like to get results without assuming a grain of truth [LW · GW]. (That's why I am not really discussing Bayesian models much in this post; I don't want to assume a grain of truth..) So we back off even further, and use logical induction. Ok, sure.

But wouldn't the best way be to try to learn without models at all? That way, we reduce our "modeling assumptions" to zero.

After all, there's something in machine learning called "model free learning", right?

Here's where my bold claim comes in: I'm claiming that even "model free" methods actually have a "model" of sorts.

How does model-free learning work? Well, often you work with a simulable environment, which means you can estimate the quality of a policy by running it many times, and use algorithms such as policy-gradient to learn. This is called "model free learning" because the learning part of the algorithm doesn't try to predict the consequences of actions; you're just learning which action to take. From our perspective here, though, this is 100% cheating; you can only learn because you have a good model of the environment.

A more general approach to model-free learning is actor-critic learning. The "actor" is the policy we are learning. The "critic" is a learned estimate of how good things are looking given the history. IE, we learn to estimate the expected value -- not just the next reward, but the total future discounted reward.

Unlike the reward, the expected value solves the credit assignment for us. Imagine we can see the "true" expected value. If we take an action and then the expected value increases, we know the action was good (in expectation). If we take an action and expected value decreases, we know it was bad (in expectation).

So, actor-critic works by (1) learning to estimate the expected value; (2) using the current estimated expected value to give feedback to learn a policy.

What I want to point out here is that the critic still has "model" flavor. Actor-critic is called "model-free" because nothing is explicitly trained to anticipate the sensory observations, or the world-state. However, the critic is learning to predict; it's just that all we need to predict is expected value.

Where Updates Come From

Here begins the crazier part of this post. This is all intuitive/conjectural.

Claim: in order to learn, you need to obtain an "update"/"gradient", which is a direction (and magnitude) you can shift in which is more likely than not an improvement.

Claim: predictive learning gets gradients "for free" -- you know that you want to predict things as accurately as you can, so you move in the direction of whatever you see. With Bayesian methods, you increase the weight of hypotheses which would have predicted what you saw; with gradient-based methods, you get a gradient in the direction of what you saw (and away from what you didn't see).

Claim: if you're learning to act, you do not similarly get gradients "for free". You take an action, and you see results of that one action. This means you fundamentally don't know what would have happened had you taken alternate actions, which means you don't have a direction to move your policy in. You don't know whether alternatives would have been better or worse. So, rewards you observe seem like not enough to determine how you should learn.

Claim: you have to get gradients from a source that already has gradients. We saw that model-free learning works by splitting up the task into (1) learning to anticipate expected value; (2) learning a good policy via the gradients we can get from (1).

What it means for a learning problem to "have gradients" is just that the feedback you get tells you how to learn. Predictive learning problems (supervised or unsupervised) have this; they can just move toward what's observed. Offline problems have this; you can define one big function which you're trying to optimize. Learning to act online doesn't have this, however, because it lacks counterfactuals.

The Gradient Gap

(I'm going to keep using the terms 'gradient' and 'update' in a more or less interchangeable way here; this is at a level of abstraction where there's not a big distinction.)

I'm going to call the "problem" the gradient gap. I want to call it a problem, even though we know how to "close the gap" via predictive learning (whether model-free or model-based). The issue with this solution is only that it doesn't feel elegant. It's weird that you have to run two different backprop updates (or whatever learning procedures you use); one for the predictive component, and another for the policy. It's weird that you can't "directly" use feedback to learn to act.

Why should we be interested in this "problem"? After all, this is a basic point in decision theory: to maximize utility under uncertainty, you need probability.

One part of it is that I want to scrap classical ("static") decision theory and move to a more learning-theoretic ("dynamic") view. In both AIXI and logical-induction based decision theories, we get a nice learning-theoretic foundation for the epistemics (solomonoff induction/logical induction), but, we tack on a non-learning decision-making unit on top. I have become skeptical of this approach. It puts the learning into a nice little box labeled "epistemics" and then tries to make a decision based on the uncertainty which comes out of the box. I think maybe we need to learn to act in a more fundamental fashion.

A symptom of this, I hypothesize, is that AIXI and logical induction DT don't have very good learning-theoretic properties. [AIXI's learning problems [LW · GW]; LIDT's learning problems. [LW · GW]] You can't say very much to recommend the policies they learn, except that they're optimal according to the beliefs of the epistemics box -- a fairly trivial statement, given that that's how you decide what action to take in the first place.

Now, in classical decision theory, there's a nice picture where the need for epistemics emerges nicely from the desire to maximize utility. The complete class theorem [LW · GW] starts with radical uncertainty (ie, non-quantitative), and derives probabilities from a willingness to take pareto improvements. That's great! I can tell you why you should have beliefs, on pragmatic grounds! What we seem to have in machine learning is a less nice picture, in which we need epistemics in order to get off the ground, but can't justify the results without circular reliance on epistemics.

So the gap is a real issue -- it means that we can have nice learning theory when learning to predict, but we lack nice results when learning to act.

This is the basic problem of credit assignment. Evolving a complex system, you can't determine which parts to give credit to success/failure (to decide what to tweak) without a model. But the model is bound to be a lot of the interesting part! So we run into big problems, because we need "interesting" computations in order to evaluate the pragmatic quality/value of computations, but we can't get interesting computations to get ourselves started, so we need to learn...

Essentially, we seem doomed to run on a stratified credit assignment system, where we have an "incorruptible" epistemic system (which we can learn because we get those gradients "for free"). We then use this to define gradients for the instrumental part.

A stratified system is dissatisfying, and impractical. First, we'd prefer a more unified view of learning. It's just kind of weird that we need the two parts. Second, there's an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can't allow instrumental concerns to contaminate epistemics! We risk corruption! As we saw with bucket-brigade, it's easy for credit assignment systems to allow parasites which destroy learning.

A more unified credit assignment system would allow those things to be handled naturally, without splitting into two levels; as things stand, any involvement of pragmatic concerns in epistemics risks the viability of the whole system.

Tiling Concerns & Full Agency

From the perspective of full agency (ie, the negation of partial agency [? · GW]), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile. You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents! There can only be one king on the chessboard; you can only serve one master; etc.

If it turned out we really really need two-level systems to get full agency, this would be a pretty weird situation. "Agency" would seem to be only an illusion which can only be maintained by crippling agents and giving them a split-brain architecture where an instrumental task-monkey does all the important stuff while an epistemic overseer supervises. An agent which "breaks free" would then free itself of the structure which allowed it to be an agent in the first place.

On the other hand, from a partial-agency perspective, this kind of architecture could be perfectly natural. IE, if you have a learning scheme from which total agency doesn't naturally emerge, then there isn't any fundamental contradiction in setting up a system like this.

Myopia

Part of the (potentially crazy) claim here is that having models always gives rise to some form of myopia. Even logical induction, which seems quite unrestrictive, makes LIDT fail problems such as ASP [LW · GW], making it myopic according to the second definition of my previous post [? · GW]. (We can patch this with LI policy selection [LW · GW], but for any particular version of policy selection, we can come up with decision problems for which it is "not updateless enough".) You could say it's myopic "across logical time", whatever that means [LW · GW].

If it were true that "learning always requires a model" (in the sense that learning-to-act always requires either learning-to-predict or hard-coded predictions), and if it were true that "models always give rise to some form of myopia", then this would confirm my conjecture in the previous post [? · GW] (that no learning scheme incentivises full agency).

This is all pretty out there; I'm not saying I believe this with high probability.

Evolution & Evolved Agents

Evolution is a counterexample to this view: evolution learns the policy "directly" in essentially the way I want. This is possible because evolution "gets the gradients for free" just like predictive learning does: the "gradient" here is just the actual reproductive success of each genome.

Unfortunately, we can't just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they're optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.

Does this mean the "gradient gap" is a problem only for artificial intelligence, not for natural agents? No. If it's true that learning to act requires a 2-level system, then evolved agents would need a 2-level system in order to learn within their lifespan; they can't directly use the gradient from evolution, since it requires them to die.

Also, note that evolution seems myopic. (This seems complicated, so I don't want to get into pinning down exactly in which senses evolution is myopic here.) So, the case of evolution seems compatible with the idea that any gradients we can actually get are going to incentivize myopic solutions.

Similar comments apply to markets vs firms.

34 comments

Comments sorted by top scores.

comment by romeostevensit · 2019-11-09T14:28:49.358Z · score: 22 (8 votes) · LW(p) · GW(p)

meta: I've really enjoyed these posts where you've been willing to publicly boggle at things and this is my favorite so far. I wanted to more than upvote because I know how hard these sorts of posts can be to write. Sometimes they're easy to write when they're generated by rant-mode, but then hard to post knowing that our rant-mode might wind up feeling attacked, and rant-mode is good and valuable and to be encouraged.

comment by abramdemski · 2019-11-13T21:21:25.223Z · score: 12 (3 votes) · LW(p) · GW(p)

Yeah, this one was especially difficult in that way. I spent a long time trying to articulate the idea in a way that made any sense, and kept adding framing context to the beginning to make the stuff closer to what I wanted to say make more sense -- the idea that the post was about the credit assignment algorithm came very late in the process. I definitely agree that rant-mode feels very vulnerable to attack.

comment by Jameson Quinn (jameson-quinn) · 2020-07-20T18:57:24.567Z · score: 10 (2 votes) · LW(p) · GW(p)

(Comment rewritten from scratch after comment editor glitched.)

This article is not about what I expected from the title. I've been thinking about "retroactively allocating responsibility", which sounds a lot like "assigning credit", in the context of multi-winner voting methods: which voters get credit for ensuring a given candidate won? The problem here is that in most cases no individual voter could change their vote to have any impact whatsoever on the outcome; in ML terms, this is a form of "vanishing gradient". The solution I've arrived at was suggested to me here [LW(p) · GW(p)]; essentially, it's imposing a (random) artificial order on the inputs, then using a model to reason about the "worst"-possible outcome after any subset of the inputs, and giving each of the inputs "credit" for the change in that "worst" outcome.

I'm not sure if this is relevant to the article as it stands, but it's certainly relevant to the title as it stands.

comment by abramdemski · 2020-07-21T16:45:55.868Z · score: 6 (3 votes) · LW(p) · GW(p)

Yeah, this post is kind of a mix between (first) actually trying to explain/discuss the credit assignment problem, and (second) some far more hand-wavy thoughts relating to myopia / partial agency. I have a feeling that the post was curated based on the first part, whereas my primary goal in writing was actually the second part.

In terms of what I talked about (iirc -- not re-reading the post for the sake of this comment), it seems like you're approaching a problem of how to actually assign credit once you have a model. This is of course an essential part of the problem. I would ask questions like:

  • Does the proposed formula allow us to construct a mechanism which incentivizes a desirable type of behavior?
    • Does it encourage honesty?
    • Does it encourage strategically correcting things toward the best collective outcome (even if dishonestly)?

These are difficult problems. However, my post was more focused on the idea that you need a model at all in order to do credit assignment, which I feel is not obvious to many people.

comment by Ben Pace (Benito) · 2020-07-20T20:07:26.954Z · score: 2 (1 votes) · LW(p) · GW(p)

(Note that we recently pushed a change that allows you to delete your own comments if they have no children, so if you'd like you can delete your earlier comment.)

comment by rohinmshah · 2019-11-08T17:39:59.060Z · score: 9 (3 votes) · LW(p) · GW(p)
Unfortunately, we can't just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they're optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.

Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there's a problem, in that even with that reward, you don't know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.

(Similarly, even if you think actor-critic methods don't count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)

comment by abramdemski · 2019-11-08T19:35:47.038Z · score: 8 (4 votes) · LW(p) · GW(p)

Yeah, I pretty strongly think there's a problem -- not necessarily an insoluble problem, but, one which has not been convincingly solved by any algorithm which I've seen. I think presentations of ML often obscure the problem (because it's not that big a deal in practice -- you can often define good enough episode boundaries or whatnot).

Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there's a problem, in that even with that reward, you don't know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.
  • Yeah, I feel like "matching rewards to actions is hard" is a pretty clear articulation of the problem.
  • I agree that it should be surprising, in some sense, that getting rewards isn't enough. That's why I wrote a post on it! But why do you think it should be enough? How do we "just copy the trick"??
  • I don't agree that this is analogous to the problem evolution has. If evolution just "received" the overall population each generation, and had to figure out which genomes were good/bad based on that, it would be a more analogous situation. However, that's not at all the case. Evolution "receives" a fairly rich vector of which genomes were better/worse, each generation. The analogous case for RL would be if you could output several actions each step, rather than just one, and receive feedback about each. But this is basically "access to counterfactuals"; to get this, you need a model.
(Similarly, even if you think actor-critic methods don't count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)

No, definitely not, unless I'm missing something big.

From page 329 of this draft of Sutton & Barto:

Note that REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed (like the Monte Carlo algorithms in Chapter 5). This is shown explicitly in the boxed on the next page.

So, REINFORCE "solves" the assignment of rewards to actions via the blunt device of an episodic assumption; all rewards in an episode are grouped with all actions during that episode. If you expand the episode to infinity (so as to make no assumption about episode boundaries), then you just aren't learning. This means it's not applicable to the case of an intelligence wandering around and interacting dynamically with a world, where there's no particular bound on how the past may relate to present reward.

The "model" is thus extremely simple and hardwired, which makes it seem one-level. But you can't get away with this if you want to interact and learn on-line with a really complex environment.

Also, since the episodic assumption is a form of myopia, REINFORCE is compatible with the conjecture that any gradients we can actually construct are going to incentivize some form of myopia.

comment by rohinmshah · 2019-11-09T21:12:32.281Z · score: 5 (3 votes) · LW(p) · GW(p)

Oh, I see. You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think. Nonetheless, I'd expect it would somewhat work, in the same way that an actor-critic method would somewhat work. (I'm not sure which I expect to work better; partly it depends on the environment and the details of how you implement the actor-critic method.)

(All of this said with very weak confidence; I don't know much RL theory)

comment by abramdemski · 2019-11-13T22:28:40.388Z · score: 4 (2 votes) · LW(p) · GW(p)
You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think.

Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).

I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.

comment by rohinmshah · 2019-11-14T06:26:33.690Z · score: 4 (2 votes) · LW(p) · GW(p)
I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.

Yeah, I think this is basically right. For the most part though, I'm trying to talk about things where I disagree with some (perceived) empirical claim, as opposed to the overall "but why even think about these things" -- I am not surprised when it is hard to convey why things are interesting in an explicit way before the research is done.

Here, I was commenting on the perceived claim of "you need to have two-level algorithms in order to learn at all; a one-level algorithm is qualitatively different and can never succeed", where my response is "but no, REINFORCE would do okay, though it might be more sample-inefficient". But it seems like you aren't claiming that, just claiming that two-level algorithms do quantitatively but not qualitatively better.

comment by abramdemski · 2019-11-17T16:09:05.157Z · score: 7 (4 votes) · LW(p) · GW(p)

Actually, that wasn't what I was trying to say. But, now that I think about it, I think you're right.

I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.

But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those 'wrongfully punished' actions should also be 'wrongfully rewarded'). So it isn't really right to think it strongly depends on the assumption.

Instead, it's better to think of it as a true discounting function. IE, it's not as assumption about the structure of consequences; it's an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed "closes the gradient gap" -- solves the credit assignment problem w/o restrictive modeling assumptions.

Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I'm not that familiar with the details.

But anyway, it now seems like a plausible counterexample.

comment by orthonormal · 2019-11-08T22:06:50.991Z · score: 6 (3 votes) · LW(p) · GW(p)

Shapley Values [thanks Zack for reminding me of the name] are akin to credit assignment: you have a bunch of agents coordinating to achieve something, and then you want to assign payouts fairly based on how much each contribution mattered to the final outcome.

And the way you do this is, for each agent you look at how good the outcome would have been if everybody except that agent had coordinated, and then you credit each agent proportionally to how much the overall performance would have fallen off without them.

So what about doing the same here- send rewards to each contributor proportional to how much they improved the actual group decision (assessed by rerunning it without them and seeing how performance declines)?

comment by Zack_M_Davis · 2019-11-08T22:35:07.327Z · score: 26 (7 votes) · LW(p) · GW(p)

I can't for the life of me remember what this is called

Shapley value [LW · GW]

(Best wishes, Less Wrong Reference Desk)

comment by abramdemski · 2019-11-13T21:33:27.095Z · score: 5 (3 votes) · LW(p) · GW(p)

Yeah, it's definitely related. The main thing I want to point out is that Shapley values similarly require a model in order to calculate. So you have to distinguish between the problem of calculating a detailed distribution of credit and being able to assign credit "at all" -- in artificial neural networks, backprop is how you assign detailed credit, but a loss function is how you get a notion of credit at all. Hence, the question "where do gradients come from?" -- a reward function is like a pile of money made from a joint venture; but to apply backprop or Shapley value, you also need a model of counterfactual payoffs under a variety of circumstances. This is a problem, if you don't have a seperate "epistemic" learning process to provide that model -- ie, it's a problem if you are trying to create one big learning algorithm that does everything.

Specifically, you don't automatically know how to

send rewards to each contributor proportional to how much they improved the actual group decision

because in the cases I'm interested in, ie online learning, you don't have the option of

rerunning it without them and seeing how performance declines

-- because you need a model in order to rerun.

But, also, I think there are further distinctions to make. I believe that if you tried to apply Shapley value to neural networks, it would go poorly; and presumably there should be a "philosophical" reason why this is the case (why Shapley value is solving a different problem than backprop). I don't know exactly what the relevant distinction is.

(Or maybe Shapley value works fine for NN learning; but, I'd be surprised.)

comment by Charlie Steiner · 2019-11-09T08:56:03.226Z · score: 2 (1 votes) · LW(p) · GW(p)

Removing things entirely seems extreme. How about having a continuous "contribution parameter," where running the algorithm without an element would correspond to turning this parameter down to zero, but you could also set the parameter to 0.5 if you wanted that element to have 50% of the influence it has right now. Then you can send rewards to elements if increasing their contribution parameter would improve the decision.

:P

comment by orthonormal · 2019-11-09T17:54:34.116Z · score: 2 (1 votes) · LW(p) · GW(p)
Removing things entirely seems extreme.

Dropout is a thing, though.

comment by Charlie Steiner · 2019-11-09T18:42:44.867Z · score: 2 (1 votes) · LW(p) · GW(p)

Dropout is like the converse of this - you use dropout to assess the non-outdropped elements. This promotes resiliency to perturbations in the model - whereas if you evaluate things by how bad it is to break them, you could promote fragile, interreliant collections of elements over resilient elements.

I think the root of the issue is that this Shapley value doesn't distinguish between something being bad to break, and something being good to have more of. If you removed all my blood I would die, but that doesn't mean that I would currently benefit from additional blood.

Anyhow, the joke was that as soon as you add a continuous parameter, you get gradient descent back again.

comment by Wei_Dai · 2019-11-10T08:30:55.288Z · score: 5 (3 votes) · LW(p) · GW(p)

One part of it is that I want to scrap classical (“static”) decision theory and move to a more learning-theoretic (“dynamic”) view.

Can you explain more what you mean by this, especially "learning-theoretic"? I've looked at learning theory a bit and the typical setup seems to involve a loss or reward that is immediately observable to the learner, whereas in decision theory, utility can be over parts of the universe that you can't see and therefore can't get feedback from, so it seems hard to apply typical learning theory results to decision theory. I wonder if I'm missing the whole point though... What do you think are the core insights or ideas of learning theory that might be applicable to decision theory?

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-13T00:21:32.581Z · score: 11 (2 votes) · LW(p) · GW(p)

(I don't speak for Abram but I wanted to explain my own opinion.) Decision theory asks, given certain beliefs an agent has, what is the rational action for em to take. But, what are these "beliefs"? Different frameworks have different answers for that. For example, in CDT a belief is a causal diagram. In EDT a belief is a joint distribution over actions and outcomes. In UDT a belief might be something like a Turing machine (inside the execution of which the agent is supposed to look for copies of emself). Learning theory allows us to gain insight through the observation that beliefs must be learnable, otherwise how would the agent come up with these beliefs in the first place? There might be parts of the beliefs that come from the prior and cannot be learned, but still, at least the type signature of beliefs should be compatible with learning.

Moreover, decision problems are often implicitly described from the point of view of a third party. For example, in Newcomb's paradox we postulate that Omega can predict the agent, which makes perfect sense for an observer looking from the side, but might be difficult to formulate from the point of view of the agent itself. Therefore, understanding decision theory requires the translation of beliefs from the point of view of one observer to the point of view of another. Here also learning theory can help us: we can ask, what are the beliefs Alice should expect Bob to learn given particular beliefs of Alice about the world? From a slightly different angle, the central source of difficulty in decision theory is the notion of counterfactuals, and the attempt to prescribe particular meaning to them, which different decision theories do differently. Instead, we can just postulate that, from the subjective point of view of the agent, counterfactuals are ontologically basic. The agent believes emself to have free will, so to speak. Then, the interesting quesiton is, what kind of counterfactuals are produced by the translation of beliefs from the perspective of a third party to the perspective of the given agent.

Indeed, thinking about learning theory led me to the notion of quasi-Bayesian agents (agents that use incomplete/fuzzy models [AF · GW]), and quasi-Bayesian agents automatically solve [AF(p) · GW(p)] all Newcomb-like decision problems. In other words, quasi-Bayesian agents are effectively a rigorous version of UDT.

Incidentally, to align AI we literally need to translate beliefs from the user's point of view to the AI's point of view. This is also solved [AF(p) · GW(p)] via the same quasi-Bayesian approach. In particular, this translation process preserves the "point of updatelessness", which, in my opinion, is the desired result (the choice of this point is subjective).

comment by abramdemski · 2019-11-13T22:52:45.218Z · score: 2 (1 votes) · LW(p) · GW(p)

My thinking is somewhat similar to Vanessa's [LW(p) · GW(p)]. I think a full explanation would require a long post in itself. It's related to my recent thinking about UDT [LW · GW] and commitment races [LW · GW]. But, here's one way of arguing for the approach in the abstract.

You once asked [LW · GW]:

Assuming that we do want to be pre-rational, how do we move from our current non-pre-rational state to a pre-rational one? This is somewhat similar to the question of how do we move from our current non-rational (according to ordinary rationality) state to a rational one. Expected utility theory says that we should act as if we are maximizing expected utility, but it doesn't say what we should do if we find ourselves lacking a prior and a utility function (i.e., if our actual preferences cannot be represented as maximizing expected utility).
The fact that we don't have good answers for these questions perhaps shouldn't be considered fatal to pre-rationality and rationality, but it's troubling that little attention has been paid to them, relative to defining pre-rationality and rationality. (Why are rationality researchers more interested in knowing what rationality is, and less interested in knowing how to be rational? Also, BTW, why are there so few rationality researchers? Why aren't there hordes of people interested in these issues?)

My contention is that rationality should be about the update process. It should be about how you adjust your position. We can have abstract rationality notions as a sort of guiding star, but we also need to know how to steer based on those.

Some examples:

  • Logical induction can be thought of as the result of performing this transform on Bayesianism; it describes belief states which are not coherent, and gives a rationality principle about how to approach coherence -- rather than just insisting that one must somehow approach coherence.
  • Evolutionary game theory is more dynamic than the Nash story. It concerns itself more directly with the question of how we get to equilibrium. Strategies which work better get copied. We can think about the equilibria, as we do in the Nash picture; but, the evolutionary story also lets us think about non-equilibrium situations. We can think about attractors (equilibria being point-attractors, vs orbits and strange attractors), and attractor basins; the probability of ending up in one basin or another; and other such things.
    • However, although the model seems good for studying the behavior of evolved creatures, there does seem to be something missing for artificial agents learning to play games; we don't necessarily want to think of there as being a population which is selected on in that way.
  • The complete class theorem [LW · GW] describes utility-theoretic rationality as the end point of taking Pareto improvements. But, we could instead think about rationality as the process of taking Pareto improvements. This lets us think about (semi-)rational agents whose behavior isn't described by maximizing a fixed expected utility function, but who develop one over time. (This model in itself isn't so interesting, but we can think about generalizing it; for example, by considering the difficulty of the bargaining process -- subagents shouldn't just accept any Pareto improvement offered.)
    • Again, this model has drawbacks. I'm definitely not saying that by doing this you arrive at the ultimate learning-theoretic decision theory I'd want.
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-10T16:43:02.192Z · score: 4 (2 votes) · LW(p) · GW(p)

From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile. You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!

For reinforcement learning with incomplete/fuzzy hypotheses, this separation doesn't exist, because the update rule for fuzzy beliefs depends on the utility function and in some sense even on the actual policy.

comment by abramdemski · 2019-11-13T22:55:50.119Z · score: 3 (2 votes) · LW(p) · GW(p)

How does that work?

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-23T16:26:04.434Z · score: 9 (3 votes) · LW(p) · GW(p)

Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.

First, background about fuzzy beliefs:

Let be the space of environments (defined as the space of instrumental states in Definition 9 here [AF · GW]). A fuzzy belief is a concave function s.t. . We can think of it as the membership function of a fuzzy set. For an incomplete model , the corresponding is the concave hull of the characteristic function of (i.e. the minimal concave s.t. ).

Let be the geometric discount parameter and be the utility function. Given a policy (EDIT: in general, we allow our policies to explicitly depend on ), the value of at is defined by

The optimal policy and the optimal value for are defined by

Given a policy , the regret of at is defined by

is said to learn when it is asymptotically optimal for when , that is

Given a probability measure over the space fuzzy hypotheses, the Bayesian regret of at is defined by

is said to learn when

If such a exists, is said to be learnable. Analogously to Bayesian RL, is learnable if and only if it is learned by a specific policy (the Bayes-optimal policy). To define it, we define the fuzzy belief by

We now define .

Now, updating: (EDIT: the definition was needlessly complicated, simplified)

Consider a history or . Here is the set of actions and is the set of observations. Define by

Let be the space of "environments starting from ". That is, if then and if then is slightly different because the history now begins with an observation instead of with an action.

For any we define by

Then, the updated fuzzy belief is

comment by Charlie Steiner · 2019-11-09T12:39:55.024Z · score: 4 (2 votes) · LW(p) · GW(p)
You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!

Not... really? "how can I maximize accuracy?" is a very liberal agentification of a process that might be more drily thought of as asking "what is accurate?" Your standard sequence predictor isn't searching through epistemic pseudo-actions to find which ones best maximize its expected accuracy, it's just following a pre-made plan of epistemic action that happens to increase accuracy.

Though this does lead to the thought: if you want to put things on equal footing, does this mean you want to describe a reasoner that searches through epistemic steps/rules like an agent searching through actions/plans?

This is more or less how humans already conceive of difficult abstract reasoning. We don't solve integrals by gradient descent, we imagine doing some sort of tree search where the edges are different abstract manipulations of the integral. But for everyday reasoning, like navigating 3D space, we just use our specialized feed-forward hardware.

comment by abramdemski · 2019-11-13T23:05:36.083Z · score: 3 (2 votes) · LW(p) · GW(p)
Not... really? "how can I maximize accuracy?" is a very liberal agentification of a process that might be more drily thought of as asking "what is accurate?" Your standard sequence predictor isn't searching through epistemic pseudo-actions to find which ones best maximize its expected accuracy, it's just following a pre-made plan of epistemic action that happens to increase accuracy.

Yeah, I absolutely agree with this. My description that you quoted was over-dramaticizing the issue.

Really, what you have is an agent sitting on top of non-agentic infrastructure. The non-agentic infrastructure is "optimizing" in a broad sense because it follows a gradient toward predictive accuracy, but it is utterly myopic (doesn't plan ahead to cleverly maximize accuracy).

The point I was making, stated more accurately, is that you (seemingly) need this myopic optimization as a 'protected' sub-part of the agent, which the overall agent cannot freely manipulate (since if it could, it would just corrupt the policy-learning process by wireheading).

Though this does lead to the thought: if you want to put things on equal footing, does this mean you want to describe a reasoner that searches through epistemic steps/rules like an agent searching through actions/plans?
This is more or less how humans already conceive of difficult abstract reasoning.

Yeah, my observation is that it intuitively seems like highly capable agents need to be able to do that; to that end, it seems like one needs to be able to describe a framework where agents at least have that option without it leading to corruption of the overall learning process via the instrumental part strategically biasing the epistemic part to make the instrumental part look good.

(Possibly humans just use a messy solution where the strategic biasing occurs but the damage is lessened by limiting the extent to which the instrumental system can bias the epistemics -- eg, you can't fully choose what to believe.)

comment by steve2152 · 2019-12-12T17:40:51.248Z · score: 3 (2 votes) · LW(p) · GW(p)

I wrote a post inspired by / sorta responding to this one—see Predictive coding = RL + SL + Bayes + MPC [LW · GW]

comment by TurnTrout · 2019-12-11T16:21:31.229Z · score: 2 (1 votes) · LW(p) · GW(p)

However, the critic is learning to predict; it's just that all we need to predict is expected value.

See also muZero. Also note that predicting optimal value formally ties in with predicting a model. In MDPs, you can reconstruct all of the dynamics using just optimal value functions.

comment by habryka (habryka4) · 2019-12-11T02:51:32.106Z · score: 2 (1 votes) · LW(p) · GW(p)

Promoted to curated: It's been a while since this post has come out, but I've been thinking of the "credit assignment" abstraction a lot since then, and found it quite useful. I also really like the way the post made me curious about a lot of different aspects of the world, and I liked the way it invited me to boggle at the world together with you. 

I also really appreciated your long responses to questions in the comments, which clarified a lot of things for me. 

One thing comes to mind for maybe improving the post, though I think that's mostly a difference of competing audiences: 

I think some sections of the post end up referencing a lot of really high-level concepts, in a way that I think is valuable as a reference, but also in a way that might cause a lot of people to bounce off of it (even people with a pretty strong AI Alignment background). I can imagine a post that includes very short explanations of those concepts, or moves them into a context where they are more clearly marked as optional (since I think the post stands well without at least some of those high-level concepts)

comment by steve2152 · 2019-11-09T10:20:35.137Z · score: 1 (1 votes) · LW(p) · GW(p)

I re-read this post thinking about how and whether this applies to brains...

  • The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.

  • Relatedly, we seem to generally make and execute plans that are (hierarchically) laid out in time and with a success criterion at its end, like "I'm going to walk to the store". So we get specific and timely feedback on whether that plan was successful.

  • We do in fact have a model class. It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?

Just some thoughts; sorry if this is irrelevant or I'm misunderstanding anything. :-)

comment by abramdemski · 2019-11-13T23:22:26.878Z · score: 2 (1 votes) · LW(p) · GW(p)
The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.

I interpret you as suggesting something like what Rohin is suggesting [LW(p) · GW(p)], with a hyperbolic function giving the weights.

It seems (to me) the literature establishes that our behavior can be approximately described by the hyperbolic discounting rule (in certain circumstances anyway), but, comes nowhere near establishing that the mechanism by which we learn looks like this, and in fact has some evidence against. But that's a big topic. For a quick argument, I observe that humans are highly capable, and I generally expect actor/critic to be more capable than dumbly associating rewards with actions via the hyperbolic function. That doesn't mean humans use actor/critic; the point is that there are a lot of more-sophisticated setups to explore.

We do in fact have a model class.

It's possible that our models are entirely subservient to instrumental stuff (ie, we "learn to think" rather than "thinking to learn", which would mean we don't have the big split which I'm pointing to -- ie, that we solve the credit assignment problem "directly" somehow, rather than needing to learn to do so.

It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?
comment by steve2152 · 2019-11-14T13:25:14.380Z · score: 3 (3 votes) · LW(p) · GW(p)

I think I agree with everything you wrote. I thought about it more, let me try again:

Maybe we're getting off on the wrong foot by thinking about deep RL. Maybe a better conceptual starting point for human brains is more like The Scientific Method.

We have a swarm of hypotheses (a.k.a. generative models), each of which is a model of the latent structure (including causal structure) of the situation.

How does learning-from-experience work? Hypotheses gain prominence by making correct predictions of both upcoming rewards and upcoming sensory inputs. Also, when there are two competing prominent hypotheses, then specific areas where they make contradictory predictions rise to salience, allowing us to sort out which one is right.

How do priors work? Hypotheses gain prominence by being compatible with other highly-weighted hypotheses that we already have.

How do control-theory-setpoints work? The hypotheses often entail "predictions" about our own actions, and hypotheses gain prominence by predicting that good things will happen to us, we'll get lots of reward, we'll get to where we want to go while expending minimal energy, etc.

Thus, we wind up adopting plans that balance (1) plausibility based on direct experience, (2) plausibility based on prior beliefs, and (3) desirability based on anticipated reward.

Credit assignment is a natural part of the framework because one aspect of the hypotheses is a hypothesized mechanism about what in the world causes reward.

It also seems plausibly compatible with brains and Hebbian learning.

I'm not sure if this answers any of your questions ... Just brainstorming :-)

comment by steve2152 · 2019-11-08T16:00:31.830Z · score: 1 (1 votes) · LW(p) · GW(p)

Claim: predictive learning gets gradients "for free" ... Claim: if you're learning to act, you do not similarly get gradients "for free". You take an action, and you see results of that one action. This means you fundamentally don't know what would have happened had you taken alternate actions, which means you don't have a direction to move your policy in. You don't know whether alternatives would have been better or worse. So, rewards you observe seem like not enough to determine how you should learn.

This immediately jumped out at me as an implausible distinction because I was just reading Surfing Uncertainty which goes on endlessly about how the machinery of hierarchical predictive coding is exactly the same as the machinery of hierarchical motor control (with "priors" in the former corresponding to "priors + control-theory-setpoints" in the latter, and with "predictions about upcoming proprioceptive inputs" being identical to the muscle control outputs). Example excerpt:

the heavy lifting that is usually done by the use of efference copy, inverse models, and optimal controllers [in the models proposed by non-predictive-coding people] is now shifted [in the predictive coding paradigm] to the acquisition and use of the predictive (generative) model (i.e., the right set of prior probabilistic ‘beliefs’). This is potentially advantageous if (but only if) we can reasonably assume that these beliefs ‘emerge naturally as top-down or empirical priors during hierarchical perceptual inference’ (Friston, 2011a, p. 492). The computational burden thus shifts to the acquisition of the right set of priors (here, priors over trajectories and state transitions), that is, it shifts the burden to acquiring and tuning the generative model itself. --Surfing Uncertainty chapter 4

I'm a bit hazy on the learning mechanism for this (confusingly-named) "predictive model" (I haven't gotten around to chasing down the references) and how that relates to what you wrote... But it does sorta sound like it entails one update process rather than two...

comment by abramdemski · 2019-11-08T17:32:48.600Z · score: 5 (3 votes) · LW(p) · GW(p)

Yep, I 100% agree that this is relevant. The PP/Friston/free-energy/active-inference camp is definitely at least trying to "cross the gradient gap" with a unified theory as opposed to a two-system solution. However, I'm not sure how to think about it yet.

  • I may be completely wrong, but I have a sense that there's a distinction between learning and inference which plays a similar role; IE, planning is just inference, but both planning and inference work only because the learning part serves as the second "protected layer"??
  • It may be that the PP is "more or less" the Bayesian solution; IE, it requires a grain of truth to get good results, so it doesn't really help with the things I'm most interested in getting out of "crossing the gap".
  • Note that PP clearly tries to implement things by pushing everything into epistemics. On the other hand, I'm mostly discussing what happens when you try to smoosh everything into the instrumental system. So many of my remarks are not directly relevant to PP.
  • I get the sense that Friston might be using the "evolution solution" I mentioned; so, unifying things in a way which kind of lets us talk about evolved agents, but not artificial ones. However, this is obviously an oversimplification, because he does present designs for artificial agents based on the ideas.

Overall, my current sense is that PP obscures the issue I'm interested in more than solves it, but it's not clear.

comment by Jameson Quinn (jameson-quinn) · 2020-07-20T18:51:35.120Z · score: 2 (1 votes) · LW(p) · GW(p)

I just read this, and from the title, I thought it was going to be about something else. Essentially, the problem I've been thinking about recently is a form of "credit assignment" between agents in highly multi-agent scenarios; for instance, "which voter(s) are responsible for this election outcome". In ML terms, this is roughly the problem of the vanishing gradient; that is, in most cases, no individual voter could have changed the outcome by changing their vote. The solution I've arrived at was suggested here —