Posts

Refactoring Alignment (attempt #2) 2021-07-26T20:12:15.196Z
Re-Define Intent Alignment? 2021-07-22T19:00:31.629Z
Progress, Stagnation, & Collapse 2021-07-22T16:51:04.595Z
The Homunculus Problem 2021-05-27T20:25:58.312Z
The Argument For Spoilers 2021-05-21T12:23:49.127Z
Time & Memory 2021-05-20T15:16:49.042Z
Formal Inner Alignment, Prospectus 2021-05-12T19:57:37.162Z
Fractal Conversations vs Holistic Response 2021-05-05T15:04:40.314Z
Death by Red Tape 2021-05-01T18:03:34.780Z
Gradations of Inner Alignment Obstacles 2021-04-20T22:18:18.394Z
Superrational Agents Kelly Bet Influence! 2021-04-16T22:08:18.201Z
A New Center? [Politics] [Wishful Thinking] 2021-04-12T15:19:35.430Z
My Current Take on Counterfactuals 2021-04-09T17:51:06.528Z
Reflective Bayesianism 2021-04-06T19:48:43.917Z
Affordances 2021-04-02T20:53:35.639Z
Voting-like mechanisms which address size of preferences? 2021-03-18T23:23:55.393Z
MetaPrompt: a tool for telling yourself what to do. 2021-03-16T20:49:19.693Z
Rigorous political science? 2021-03-12T15:30:53.837Z
Four Motivations for Learning Normativity 2021-03-11T20:13:40.175Z
Kelly *is* (just) about logarithmic utility 2021-03-01T20:02:08.300Z
"If You're Not a Holy Madman, You're Not Trying" 2021-02-28T18:56:19.560Z
Support vs Advice & Holding Off Solutions 2021-02-23T01:12:33.156Z
Calculating Kelly 2021-02-22T17:32:38.601Z
Mathematical Models of Progress? 2021-02-16T00:21:44.298Z
The Pointers Problem: Clarifications/Variations 2021-01-05T17:29:45.698Z
Debate Minus Factored Cognition 2020-12-29T22:59:19.641Z
Babble Challenge: Not-So-Future Coordination Tech 2020-12-21T16:48:20.515Z
Fusion and Equivocation in Korzybski's General Semantics 2020-12-21T05:44:41.064Z
Writing tools for tabooing? 2020-12-13T19:50:37.301Z
Mental Blinders from Working Within Systems 2020-12-10T19:09:50.720Z
Quick Thoughts on Immoral Mazes 2020-12-09T01:21:40.210Z
Number-guessing protocol? 2020-12-07T15:07:48.019Z
Recursive Quantilizers II 2020-12-02T15:26:30.138Z
Nash Score for Voting Techniques 2020-11-26T19:29:31.187Z
Deconstructing 321 Voting 2020-11-26T03:35:40.863Z
Normativity 2020-11-18T16:52:00.371Z
Thoughts on Voting Methods 2020-11-17T20:23:07.255Z
Signalling & Simulacra Level 3 2020-11-14T19:24:50.191Z
Learning Normativity: A Research Agenda 2020-11-11T21:59:41.053Z
Probability vs Likelihood 2020-11-10T21:28:03.934Z
Time Travel Markets for Intellectual Accounting 2020-11-09T16:58:44.276Z
Kelly Bet or Update? 2020-11-02T20:26:01.185Z
Generalize Kelly to Account for # Iterations? 2020-11-02T16:36:25.699Z
Dutch-Booking CDT: Revised Argument 2020-10-27T04:31:15.683Z
Top Time Travel Interventions? 2020-10-26T23:25:07.973Z
Babble & Prune Thoughts 2020-10-15T13:46:36.116Z
One hub, or many? 2020-10-04T16:58:40.800Z
Weird Things About Money 2020-10-03T17:13:48.772Z
"Zero Sum" is a misnomer. 2020-09-30T18:25:30.603Z
What Does "Signalling" Mean? 2020-09-16T21:19:00.968Z

Comments

Comment by abramdemski on Refactoring Alignment (attempt #2) · 2021-08-04T18:26:37.850Z · LW · GW

Seems fair. I'm similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.

Comment by abramdemski on Re-Define Intent Alignment? · 2021-08-04T18:21:55.000Z · LW · GW

Great, I feel pretty resolved about this conversation now.

Comment by abramdemski on Re-Define Intent Alignment? · 2021-08-04T18:16:44.813Z · LW · GW

I would further add that looking for difficulties created by the simplification seems very intellectually productive. (Solving "embedded agency problems" seems to genuinely allow you to do new things, rather than just soothing philosophical worries.) But yeah, I would agree that if we're defining mesa-objective anyway, we're already in the business of assuming some agent/environment boundary.

Comment by abramdemski on Re-Define Intent Alignment? · 2021-08-04T17:59:04.746Z · LW · GW

(see the unidentifiability in IRL paper)

Ah, I wasn't aware of this!

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.

For example, AIXI assumes a hard boundary between agent and environment. One manifestation of this assumption is how AIXI doesn't predict its own future actions the way it predicts everything else, and instead, must explicitly plan its own future actions. This is necessary because AIXI is not computable, so treating the future self as part of the environment (and predicting it with the same predictive capabilities as usual) would violate the assumption of a computable environment. But this is unfortunate for a few reasons. First, it forces AIXI to have an arbitrary finite planning horizon, which is weird for something that is supposed to represent unbounded intelligence. Second, there is no reason to carry this sort of thing over to finite, computable agents; so it weakens the generality of the model, by introducing a design detail that's very dependent on the specific infinite setting.

Another example would be game-theoretic reasoning. Suppose I am concerned about cooperative behavior in deployed AI systems. I might work on something like the equilibrium selection problem in game theory, looking for rationality concepts which can select cooperative equilibria where they exist. However, this kind of work will typically treat a "game" as something which inherently comes with a pointer to the other agents. This limits the real-world applicability of such results, because to apply it to real AI systems, those systems would need "agent pointers" as well. This is a difficult engineering problem (creating an AI system which identifies "agents" in its environment); and even assuming away the engineering challenges, there are serious philosophical difficulties (what really counts as an "agent"?).

We could try to tackle those difficulties, but my assumption will tend to be that it'll result in fairly brittle abstractions with weird failure modes. 

Instead, I would advocate for Pavlov-like strategies which do not depend on actually identifying "agents" in order to have cooperative properties. I expect these to be more robust and present fewer technical challenges.

Of course, this general heuristic may not turn out to apply in the specific case we are discussing. If you control the training process, then, for the duration of training, you control the agent and the environment, and these concepts seem unproblematic. However, it does seem unrealistic to really check every environment; so, it seems like to establish strong guarantees, you'd need to do worst-case reasoning over arbitrary environments, rather than checking environments in detail. This is how I was mainly interpreting jbkjr; perturbation sets could be a way to make things more feasible (at a cost).

Comment by abramdemski on Re-Define Intent Alignment? · 2021-08-04T17:33:40.835Z · LW · GW

Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)

Comment by abramdemski on Delta Strain: Fact Dump and Some Policy Takeaways · 2021-08-02T20:55:39.331Z · LW · GW

Thanks for writing this! 

It seems like the main thrust of this is to compute individual average risk and construct policy recommendations from there. I have two main objections to this approach.

  1. My utility might be closer to log in quality-adjusted life-days. For example, I think I would much prefer to lose 1/100th of my remaining life rather than take a 1/100 chance of immediate death. Similarly, it's not clear to me that a 100% chance of a 50% loss of productivity is like a 50% chance of a 100% loss. This makes your averages seem to gloss over important distinctions.
  2. I'm not sure the individualistic approach to policy recommendation makes sense. For example, suppose vaccines are exactly 90% effective. This isn't enough for me to make big policy changes (EG, if going to a party felt unsafe before, it still feels unsafe after). However, me and my entire extended group of friends getting vaccinated makes a huge difference (the hypothetical party could now feel safe).

So I don't feel like your average-case cost is that relevant to whether the rationalist community should take specific precautions. The cost of a specific precaution scales linearly with the number of people taking that precaution, but the benefit scales much faster, for a community of interacting people.

I'm left feeling like you probably ended up underestimating the level of caution we should have.

Comment by abramdemski on Refactoring Alignment (attempt #2) · 2021-08-02T19:24:11.150Z · LW · GW

I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."

But the graph structure makes perfect sense, I just am doing the mental substitution of "intent alignment means 'what the model is actually trying to do' is aligned with 'what we want it to do'." (Similar for inner robustness.)

I too am a fan of broadening this a bit, but I am not sure how to.

I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.

I agree with your point about using "does this definition include humans" as a filter, and I think it would be easy to mess that up (and I wasn't thinking about it explicitly until you raised the point).

However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.

Comment by abramdemski on Refactoring Alignment (attempt #2) · 2021-08-02T19:09:13.658Z · LW · GW

Maybe a very practical question about the diagram: is there a REASON for there to be no "sufficient together" linkage from "Intent Alignment" and "Robustness" up to "Behavioral Alignment"?

Leaning hard on my technical definitions:

  • Robustness: Performing well on the base objective in a wide range of circumstances.
  • Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)

These two together do not quite imply behavioral alignment, because it's possible for a model to have a human-friendly mesa-objective but be super bad at achieving it, while being super good at achieving some other objective.

So, yes, there is a little bit of gear-grinding if we try to combine the two plans like that. They aren't quite the right thing to fit together.

It's like we have a magic vending machine that can give us anything, and we have a slip of paper with our careful wish, and we put the slip of paper in the coin slot.

That being said, if we had technology for achieving both intent alignment and robustness, I expect we'd be in a pretty good position! I think the main reason not to go after both is that we may possibly be able to get away with just one of the two paths.

Comment by abramdemski on Refactoring Alignment (attempt #2) · 2021-08-02T18:47:06.306Z · LW · GW

I think there's another reason why factorization can be useful here, which is the articulation of sub-problems to try.

For example, in the process leading up to inventing logical induction, Scott came up with a bunch of smaller properties to try for. He invented systems which got desirable properties individually, then growing combinations of desirable properties, and finally, figured out how to get everything at once. However, logical induction doesn't have parts corresponding to those different subproblems.

It can be very useful to individually achieve, say, objective robustness, even if your solution doesn't fit with anyone else's solutions to any of the other sub-problems. It shows us a way to do it, which can inspire other ways to do it.

In other words: tackling the whole alignment problem at once sounds too hard. It's useful to split it up, even if our factorization doesn't guarantee that we can stick pieces back together to get a whole solution.

Though, yeah, it's obviously better if we can create a factorization of the sort you want.

Comment by abramdemski on Re-Define Intent Alignment? · 2021-08-02T18:30:22.452Z · LW · GW

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)

For myself, my reaction is "behavioral objectives also assume a system is well-described as EU maximizers". In either case, you're assuming that you can summarize a policy by a function it optimizes; the difference is whether you think the system itself thinks explicitly in those terms.

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems. 

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.

In this picture, there is no clear distinction between terminal values and instrumental values. Something is "more terminal" if you treat it as more fixed (you resolve contradictions by updating the other values), and "more instrumental" if its value is more changeable based on other things.

I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior.

(Possibly you should consider my "approximately coherent expectations" idea)

Comment by abramdemski on Re-Define Intent Alignment? · 2021-08-02T18:07:32.731Z · LW · GW

They can't? Why not?

Answer 1

I meant to invoke a no-free-lunch type intuition; we can always construct worlds where some particular tool isn't useful.

My go-to would be "a world that checks what an InfraBayesian would expect, and does the opposite". This is enough for the narrow point I was trying to make (that InfraBayes does express some kind of regularity assumption about the world), but it's not very illustrative or compelling for my broader point (that InfraBayes plausibly addresses your concerns about learning theory). So I'll try to tell a better story.

Answer 2

I might be describing logically impossible (or at least uncomputable) worlds here, but here is my story:

Solomonoff Induction captures something important about the regularities we see in the universe, but it doesn't explain NN learning (or "ordinary human learning") very well, because NNs and humans mostly use very fast models which are clearly much smaller (in time-complexity and space-complexity) than the universe. (Solomonoff induction is closer to describing human science, which does use these very simple but time/space-complex models.)

So there's this remaining question of induction: why can we do induction in practice? (IE, with NNs and with nonscientific reasoning)

InfraBayes answers this question by observing that although we can't easily use Solomonoff-like models of the whole universe, there are many patterns we can take advantage of which can be articulated with partial models. 

This didn't need to be the case. We could be in a universe in which you need to fully model the low-level dynamics in order to predict things well at all.

So, a regularity which InfraBayes takes advantage of is the fact that we see multi-scale phenomena -- that simple low-level rules often give rise to simple high-level behavior as well.

I say "maybe I'm describing logically impossible worlds" here because it is hard to imagine a world where you can construct a computer but where you don't see this kind of multi-level phenomena. Mathematics is full of partial-model-type regularities; so, this has to be a world where mathematics isn't relevant (or, where mathematics itself is different).

But Solomonoff induction alone doesn't give a reason to expect this sort of regularity. So, if you imagine a world being drawn from the Solomonoff prior vs a world being drawn from a similar InfraBayes prior, I think the InfraBayes prior might actually generate worlds more like the one we find ourselves in (ie, InfraBayes contains more information about the world).

(Although actually, I don't know how to "sample from an infrabayes prior"...)

"Usefully Describe"

Maybe the "usefully" part is doing a lot of work here -- can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn't seem like any of the technical results in InfraBayes depend on some notion of "usefulness".

Part of what I meant by "usefully describe" was to contrast runnable models from non-runnable models. EG, even if Solomonoff induction turned out to be the more accurate prior for dealing with our world, it's not very useful because it endorses hypotheses which we can't efficiently run. 

I mentioned that I think InfraBayes might fit the world better than Solomonoff. But what I actually predict more strongly is that if we compare time-bounded versions of both priors, time-bounded InfraBayes would do better thanks to its ability to articulate partial models.

I think it's also worth pointing out that the technical results of InfraBayes do in fact address a notion of usefulness: part of the point of InfraBayes is that it translates to decision-making learning guarantees (eg, guarantees about the performance of RL agents) better than Bayesian theories do. Namely, if there is a partial model such that the agent would achieve nontrivial reward if it believed it, then the agent will eventually do at least that well. So, to succeed, InfraBayes relies on an assumption about the world -- that there is a useful partial model. (This is the analog of the Solomonoff induction assumption that there exists a best computable model of the world.)

So although it wasn't what I was originally thinking, it would also be reasonable to interpret "usefully describe" as "describe in a way which gives nontrivial reward bounds". I would be happy to stand by this interpretation as well: as an assumption about the real world, I'm happy to assert that there are usually going to be partial models which (are accurate and) give good reward bounds.

What I Think You Should Think

I think you should think that it's plausible we will have learning-theoretic ideas which will apply directly to objects of concern, in the sense of under some plausible assumptions about the world, we can argue a learning-theoretic guarantee for some system we can describe, which theoretically addresses some alignment concern.

I don't want to strongly argue that you should think this will be competitive with NNs or anything like that. Obviously I prefer worlds where that's true, but I am not trying to argue that. Even if in some sense InfraBayes (or some other theory) turns out to explain the success of NNs, that does not actually imply it'll give rise to something competitive with NNs.

I'm wondering if that's a crux for your interest. Honestly, I don't really understand what's going on behind this remark:

My central complaint about existing theoretical work is that it doesn't seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do.

Why is this your central complaint about existing theoretical work? My central complaint is that pre-existing learning theory didn't give us what we need to slot into a working alignment argument. In your presentation you listed some of those complaints, too. This seems more important to me that whether we can fully explain the success of large NNs.

My original interpretation about your remark was that you wanted to argue "learning theory makes bad assumptions about the world. To make strong arguments for alignment, we need to make more realistic assumptions. But these more realistic assumptions are necessarily of an empirical, non-theoretic nature." But I think InfraBayes in fact gets us closer to assumptions that are (a) realistic and (b) suited to arguments we want to make about alignment.

In other words, I had thought that you had (quite reasonably!) given up on learning theory because its results didn't seem relevant. I had hoped to rekindle your interest by pointing out that we can now do much better than 90s-era learning theory, in ways that seem relevant for EG objective robustness.

My personal theory about large NNs is that they act as a mixture model. It would be surprising if I told you that some genetic algorithm found a billion-bit program that described the data perfectly and then generalized well. It would be much less surprising if I told you that this billion-bit program was actually a mixture model that had been initialized randomly and then tuned by the genetic algorithm. From a Bayesian perspective, I expect a large random mixture model which then gets tuned to eliminate sub-models which are just bad on the data to be a pretty good approximation of my posterior, and therefore, I expect it to generalize well.

But my beliefs about this don't seem too cruxy for my beliefs about what kind of learning theory will be useful for alignment.

Comment by abramdemski on Re-Define Intent Alignment? · 2021-08-02T16:07:15.406Z · LW · GW

No such thing is possible in reality, as an agent cannot exist without its environment, so why shouldn't we talk about the mesa-objective being over a perturbation set, too, just that it has to be some function of the model's internal features?

This makes some sense, but I don't generally trust some "perturbation set" to in fact capture the distributional shift which will be important in the real world. There has to at least be some statement that the perturbation set is actually quite broad. But I get the feeling that if we could make the right statement there, we would understand the problem in enough detail that we might have a very different framing. So, I'm not sure what to do here.

Comment by abramdemski on How can there be a godless moral world ? · 2021-07-28T18:25:13.618Z · LW · GW

I understand why there is niceness. I don't understand why I should be nice if God doesn't exist.

I mean, you could argue that being nice has many perks, but why should I desire these perks ?

You should desire them because they are good.

The buck has to stop somewhere. You suggest that the buck stops with "because God said so". I suggest that the buck stops with some specific "terminal goods" being good. So, what's the problem? (I'm hoping that you are still responding to stuff on this post.)

Comment by abramdemski on How can there be a godless moral world ? · 2021-07-28T18:20:58.320Z · LW · GW

By "a moral world", I mean that many actions or states are categorized as good or evil, and that this is a good measure to evaluate whether we should do these actions of reach these states, regardless of other measures such as expected utility or pleasure.
For example, I can understand why societies that discourage murder will probably fare better than societies that promote it. I don't understand why murder is bad, if not because God said so.

So it seems like "moral world" doesn't mean world that is fundamentally moral, but rather, world in which things are good and bad, am I right?

Seems like most answers here don't try to answer this :P

I think it is conceptually similar to property rights. In the beginning, there is no fundamental fact of who own what. At first, it's a game of dibs. Animals squat on territory and defend it. It's a tooth-and-claw negotiation. Eventually, we end up with a highly systematized notion of property rights, which mostly avoids conflict. Who owns what is mostly a straightforward fact.

Indeed, "don't steal" is part of moral reality, so this is an example, not just an analogy.

We can easily extend this story to other parts of "natural law", for example, murder. A person's life is sort of their own property, and so, murder is similar to theft.

Thing like lying are a bit less analogous, but I claim that we can think of these things as the result of a complex distributed negotiation in the same way.

Skeptic: if ownership and morality are the result of a complex distributed negotiation, how can they be fundamental facts? Isn't my story contrary to that idea?

Me: A fact is just something that is true. Are you saying it's not true that (EG) I own my computer? This isn't such a strange state of affairs. An extreme materialist might say that only particles exist. But I think tables and chairs exist. Particles might be more "fundamental", but tables and chairs are real. A materialist might call tables and chairs social constructs, but this does not make them less real.

Skeptic: But you admit that property rights are the result of a complex negotiation. This suggests that the negotiation could have gone differently. Wouldn't that make morality different? How can we rely on morality, if it's arbitrary?

Me: We rely on facts that could have been different all the time. For example, I'm currently sitting in a chair which could have easily been placed in a different location. If that were the case, I could not be sitting here. So what is the objection, exactly?

Skeptic: But moral facts don't seem like that. They seem more like mathematical facts: 1+1 is necessarily 2.

Me: Well, just because our current beliefs about moral concepts are the result of a complex negotiation, doesn't mean morality is defined as the result of such a negotiation. We could be wrong about moral facts. We've changed our minds on such topics in the past, so, we could change our minds in the future.

Skeptic: Which is it, then? Are moral facts necessarily one way, or are they contingent?

Me: I don't have to make my mind up on that. I'm just trying to defend the idea that there are, indeed, moral facts.

Skeptic: I already believe that. I'm looking for an explanation of how there could be such a thing, if not because God said so.

Me: Well, I'm not 100% sure on that, but I think it's similar to looking for an explanation of tables and chairs. Which would be: it's nice to have flat surfaces to put things on. Similarly for morality: it's nice to have.

Skeptic: Using Aristotle's terminology, that's a final cause. I'm looking for a material cause, like how light is explained by photons. What do moral facts consist of?

Me: How very physical-reductionist.

Skeptic: Hardly. I think the material cause is God's word.

Me: If I say that killing is wrong, I think that fact consists of all the anguish felt by people when loved ones die, together with the fear of death, plus other negative consequences which would plague society if murder wasn't seen as wrong.

Skeptic: That seems very inspecific.

Me: If I had described what murder consists of, it would have been simpler. The material cause of tables and chairs is simply wood and nails (or, whatever it happens to be made of). That's simple because tables are an object. The material cause of a fact like "it's conventient to set things on tables" would be much more complex, consisting of facts about how center of balance works, the height of human hands, etc. Maybe I don't think facts have nice material causes like objects do.

Skeptic: I think the fact that something is good consists of God's word. I also think the good consists of God's word.

Me: I guess I think the good is just, like, all the good things in the world taken together.

Skeptic: Then how do you decide what's on that list?

Me: I still think the fact that something is good consists of all the positive consequences.

Skeptic: Doesn't that just go into an infinite regress?

Me: Not really? I think there's some stuff that's just good, and the rest is derived from that. Like how in physics, there's got to be some stuff at the beginning which just happened (IE, the big bang). But that's efficient cause. For material cause, I guess the analogue is that particles or strings or something just are, they aren't made by any further things.

Skeptic: But isn't it dissatisfying to have something that just is good, with no explanation?

Me: You won't let me have an infinite regress, and you also won't let me cut off an infinite regress? I could similarly ask you want "God's Word" is made of, and put you in a similar dilemma. You have to stop somewhere. Or go on forever. One of the two.

Skeptic: The difference is that I've explained "good" purely in terms of something else, where you've stopped short of that.

Me: I think that's a weird standard to apply here. Particles aren't explained in terms of something else. And your story will similarly have to stop somewhere, like with "God" or something.

Skeptic: Particles can be explained with equations.

Me: Particles don't consist of equations. You asked for "material cause" type explanation, not description. I think I could describe the good purely in terms of other things. Not easily, and not completely, but I think the holes would be due to my imperfect knowledge of the good, which seems like an acceptable excuse.

Skeptic: It sounds like such a description would be long and detailed. Sort of like asking for a description of particles, and getting a big list of things that happen in experiments, the current locations of specific particles, and so on. I want a simple explanation -- the sort of thing scientists demand.

Me: Well, scientific explanations aren't necessarily material causes. Material composition can be complex, agreed? My scientific explanation is the property rights thing.

Skeptic: That was supposed to be a scientific explanation?

Me: I have in mind evolutionary game theory.

Skeptic: Wait, you're saying morality is genetic, or something?

Me: Not biological evolution -- not necessarily. Although, certainly many species including humans have genes relating to territorial behavior. But, no, I'm saying that the math of game theory relates to the evolution of property rights and other facets of morality. I don't have the full equations, like physicists have for the standard model, but I think game theory would be the place to look. I'd recommend The Evolution of the Social Contract by Skyrms, which shows how some basic ideas like fairness can emerge. I am fairly confident that these ideas explain how these concepts got into human brains in actual historical terms, at least.

Skeptic: I don't know why you think that's plausible, but let's suppose that to be true. I'm still baffled why that would make them true, or make them real. Presumably, you think similar things explain how religions got into human brains, but you don't think those things are true and real.

Me: Humans suppose that something is "real" if it's a supposition which helps them explain the world. I think religions have been surpassed by better explanations. I don't think the same is true of morality. Hence, I still think morality is real.

Skeptic: I agree that humans have a tendency to suppose things when they help make sense of the world -- but that doesn't make those things real. It sounds like you think morality is a useful mass hallucination.

Me: I wouldn't disagree with that statement, but I further think it's real.

Skeptic: Because you're going along with what you see as a useful mass hallucination.

Me: Not just that. As a general principle, it doesn't make sense to discard an idea that helps you make sense of the world, until you've found a better idea which makes more sense.

Skeptic: But this isn't like tables and chairs. It's not made of anything. It doesn't exist anywhere

Me: It's a bit more like 1+1=2. It makes sense to postulate numbers, because they help me make sense of the world.

Skeptic: But numbers don't physically exist. They're mental constructs.

Me: I thought you weren't a materialist?

Skeptic: I thought you were.

Me: I think numbers exist in the same way other things exist. To be honest, I believe in tables and chairs, and that's already a pretty big departure from extreme physical reductionism. Numbers are a mental construct the same way tables and chairs are. That is: they're not. The mental construct describes numbers. The numbers themselves exist independently of the minds.

Skeptic: That's quite a statement.

Me: It's about what "exists" is supposed to mean. Do you want to hand over "exists" purely to physicists, and place other things, like tables and chairs, into a grey zone of "sort of exists for everyday purposes, but actually, when we're talking seriously, only physical particles and such exist"?

Skeptic: Obviously not.

Me: So don't be so reductionist. Tables and chairs exist without a simple scientific description. So morality can, too. I think there may be a good theory of morality, but I don't have to believe that in order to believe in moral facts. Why do you have to reduce morality to God's word? Don't get me wrong, reductionism is great. You should reduce things to parts if you can. But that doesn't mean things don't exist if you can't. 

This isn't quite relevant, but I'm curious what you think of Beyond the Reach of God, and it seems a little related.

Comment by abramdemski on Refactoring Alignment (attempt #2) · 2021-07-28T15:13:43.912Z · LW · GW

Great! I feel like we're making progress on these basic definitions.

Comment by abramdemski on Re-Define Intent Alignment? · 2021-07-28T15:12:40.936Z · LW · GW

InfraBayes doesn't look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about "what kind of regularity assumptions can we realistically make about reality?" You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question "what useful+realistic regularity assumptions could we make?"

The InfraBayesian answer is "partial models". IE, the idea that even if reality cannot be completely described by usable models, perhaps we can aim to partially describe it. This is an assumption about the world -- not all worlds can be usefully described by partial models. However, it's a weaker assumption about the world than usual. So it may not have presented itself as an assumption about the world in your mind, since perhaps you were thinking more of stronger assumptions.

If it's a good answer, it's at least plausible that NNs work well for related reasons.

But I think it also makes sense to try to get at the useful+realistic regularity assumptions from scratch, rather than necessarily making it all about NNs

Comment by abramdemski on DeepMind: Generally capable agents emerge from open-ended play · 2021-07-27T15:39:38.752Z · LW · GW

The machine learning stuff comes with preexisting artificial encoding. We label stuff ourselves.

Generally speaking, that's not so true as it used to be. In particular, a lot of stuff from DeepMind (such as the Atari-playing breakthrough from a while ago) works with raw video inputs. I haven't looked at the paper from the OP to verify it's the same.)

Also, I have the impression that DeepMind takes a "copy the brain" approach fairly seriously, and they think of papers like this as relevant to that. But I am not sure of the details.

Comment by abramdemski on Refactoring Alignment (attempt #2) · 2021-07-27T15:30:43.712Z · LW · GW

Yep, fixed.

Comment by abramdemski on Refactoring Alignment (attempt #2) · 2021-07-27T15:30:14.062Z · LW · GW

I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that's done.

But it seems to me that there's something missing in terms of acceptability.

The definition of "objective robustness" I used says "aligns with the base objective" (including off-distribution). But I think this isn't an appropriate representation of your approach. Rather, "objective robustness" has to be defined something like "generalizes acceptably". Then, ideas like adversarial training and checks and balances make sense as a part of the story.

WRT your suggestions, I think there's a spectrum from "clean" to "not clean", and the ideas you propose could fall at multiple points on that spectrum (depending on how they are implemented, how much theory backs them up, etc). So, yeah, I favor "cleaner" ideas than you do, but that doesn't rule out this path for me.

Comment by abramdemski on Re-Define Intent Alignment? · 2021-07-27T14:56:21.792Z · LW · GW

All of that made perfect sense once I thought through it, and I tend to agree with most it. I think my biggest disagreement with you is that (in your talk) you said you don't expect formal learning theory work to be relevant. I agree with your points about classical learning theory, but the alignment community has been developing basically-classical-learning-theory tools which go beyond those limitations. I'm optimistic that stuff like Vanessa's InfraBayes could help here.

Granted, there's a big question of whether that kind of thing can be competitive. (Although there could potentially be a hybrid approach.)

Comment by abramdemski on Re-Define Intent Alignment? · 2021-07-26T21:19:03.977Z · LW · GW

I've watched your talk at SERI now.

One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective:

  1. It requires approximately the same "magic transparency tech" as we need to extract mesa-objectives.
  2. Even with magical transparency tech, it requires additional insight as to which reasoning is acceptable vs unacceptable. 

If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason? More generally, what do you think "acceptability" might look like?

(By no means do I mean to say your view is crazy; I am just looking for your explanation.)

Comment by abramdemski on Progress, Stagnation, & Collapse · 2021-07-26T15:04:19.747Z · LW · GW

Thanks, very interesting!

I still think that a good modification to the progress equation would be to lose progress if population dips below some number, and, that this predicts that severe population crashes will be amplified even further as progress is lost.

Or, in less formal terms: I still think knowledge can be lost quickly in times of chaos, particularly when population takes a nose-dive.

So I believe in the gears of the theory I was referring to, even if it's not the "primary cause"?

Comment by abramdemski on Progress, Stagnation, & Collapse · 2021-07-23T14:57:43.229Z · LW · GW

Right. My point is just that the state will heavily favor centralized technology to address these challenges, because it prefers to maintain control. Seeing like a State illustrates how this can result in much worse productivity overall (in contrast to pre-existing noncentralized systems), while still being much better for the state (due to increased tax revenue, and diminished risk of rebellion).

Comment by abramdemski on Progress, Stagnation, & Collapse · 2021-07-23T14:54:01.299Z · LW · GW

From the perspective of the state, you want to tax that excess, and store as much of it as you can for lean times (at which point you do hand it back out to the people, to preserve the population). This was a major function of bronze age states. So yeah, it increases robustness, but the state still isn't really incentivized to let people keep wealth (except for the key players, which the state has to make happy to avoid coups).

In Dictators Handbook, Bruce Bueno de Mesquita presents evidence that centralized authoritarian governments (like most bronze age governments) tend to avoid enriching their citizens if they can accumulate resources without doing so. In the other hand, if the people themselves are the only available source of wealth (ie if natural resources are scarce and a state's economy must therefore rely on skilled labor and trade), the state will tend to become less authoritarian, I think.

Comment by abramdemski on Progress, Stagnation, & Collapse · 2021-07-23T14:37:03.947Z · LW · GW

https://www.lesswrong.com/posts/RXLbe6oZGxNWvQawn/the-youtube-revolution-in-knowledge-transfer

Comment by abramdemski on Progress, Stagnation, & Collapse · 2021-07-23T14:35:41.150Z · LW · GW

Fascinating!

Comment by abramdemski on Progress, Stagnation, & Collapse · 2021-07-23T14:29:01.955Z · LW · GW

Interesting, any references?

Comment by abramdemski on Re-Define Intent Alignment? · 2021-07-22T20:40:04.828Z · LW · GW

(Meta: was this meant to be a question?)

I originally conceived of it as such, but in hindsight, it doesn't seem right.

In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.

I don't think this is actually a con of the generalization-focused approach.

By no means did I intend it to be a con. I'll try to edit to clarify. I think it is a real pro of the generalization-focused approach that it does not rely on models having mesa-objectives (putting it in Evan's terms, there is a real possibility of addressing objective robustness without directly addressing inner alignment). So, focusing on objective robustness seems like a potential advantage -- it opens up more avenues of attack. Plus, the generalization-focused approach requires a much weaker notion of "outer alignment", which may be easier to achieve as well.

But, of course, it may also turn out that the only way to achieve objective robustness is to directly tackle inner alignment. And it may turn out that the weaker notion of outer alignment is insufficient in reality.

Are you the historical origin of the robustness-centric approach? I noticed that Evan's post has the modified robustness-centric diagram in it, but I don't know if it was edited to include that. The "Objective Robustness and Inner Alignment Terminology" post attributes it to you (at least, attributes a version of it to you). (I didn't look at the references there yet.)

Comment by abramdemski on The Alignment Forum should have more transparent membership standards · 2021-07-20T19:58:55.536Z · LW · GW

I'm currently fairly swayed that the alignment forum should be more "open" in the sense Peter intends:

Per my definition of closed, no academic discussion is closed, because anyone in theory can get a paper accepted to a journal/conference, attend the related meaning, and participate in the discourse. I am not actually talking about visibility to the broader public, but rather the access of any individual to the discourse, which feels more important to me.

However, I am not sure how to accomplish this. (Specifically, I am not sure how to accomplish this without too much added work, and maintaining other properties we want the forum to have.)

Comment by abramdemski on Discussion: Objective Robustness and Inner Alignment Terminology · 2021-07-20T19:27:34.039Z · LW · GW

If there were a "curated posts" system on the alignment forum, I would nominate this for curation. I think it's a great post.

Comment by abramdemski on My Current Take on Counterfactuals · 2021-07-20T14:38:26.507Z · LW · GW

All of which I really should have remembered, since it's all stuff I have known in the past, but I am a doofus. My apologies.

(But my error wasn't being too mired in EDT, or at least I don't think it was; I think EDT is wrong. My error was having the term "counterfactual" too strongly tied in my head to what you call linguistic counterfactuals. Plus not thinking clearly about any of the actual decision theory.)

I'm glad I pointed out the difference between linguistic and DT counterfactuals, then!

It still feels to me as if your proof-based agents are unrealistically narrow. Sure, they can incorporate whatever beliefs they have about the real world as axioms for their proofs -- but only if those axioms end up being consistent, which means having perfectly consistent beliefs. The beliefs may of course be probabilistic, but then that means that all those beliefs have to have perfectly consistent probabilities assigned to them. Do you really think it's plausible that an agent capable of doing real things in the real world can have perfectly consistent beliefs in this fashion?

I'm not at all suggesting that we use proof-based DT in this way. It's just a model. I claim that it's a pretty good model -- that we can often carry over results to other, more complex, decision theories.

However, if we wanted to, then yes, I think we could... I agree that if we add beliefs as axioms, the axioms have to be perfectly consistent. But if we use probabilistic beliefs, those probabilities don't have to be perfectly consistent; just the axioms saying which probabilities we have. So, for example, I could use a proof-based agent to approximate a logical-induction-based agent, by looking for proofs about what the market expectations are. This would be kind of convoluted, though.

Comment by abramdemski on My Current Take on Counterfactuals · 2021-07-17T15:47:57.952Z · LW · GW

It's obvious how ordinary conditionals are important for planning and acting (you design a bridge so that it won't fall down if someone drives a heavy lorry across it; you don't cross a bridge because you think the troll underneath will eat you if you cross), but counterfactuals? I mean, obviously you can put them in to a particular problem

All the various reasoning behind a decision could involve material conditionals, probabilistic conditionals, logical implication, linguistic conditionals (whatever those are), linguistic counterfactuals, decision-theoretic counterfactuals (if those are indeed different as I claim), etc etc etc. I'm not trying to make the broad claim that counterfactuals are somehow involved.

The claim is about the decision algorithm itself. The claim is that the way we choose an action is by evaluating a counterfactual ("what happens if I take this action?"). Or, to be a little more psychologically realistic, the cashed values which determine which actions we take are estimated counterfactual values.

What is the content of this claim?

A decision procedure is going to have (cashed-or-calculated) value estimates which it uses to make decisions. (At least, most decision procedures work that way.) So the content of the claim is about the nature of these values.

If the values act like Bayesian conditional expectations, then the claim that we need counterfactuals to make decisions is considered false. This is the claim of evidential decision theory (EDT).

If the values are still well-defined for known-false actions, then they're counterfactual. So, a fundamental reason why MIRI-type decision theory uses counterfactuals is to deal with the case of known-false actions.

However, academic decision theorists have used (causal) counterfactuals for completely different reasons (IE because they supposedly give better answers). This is the claim of causal decision theory (CDT).

My claim in the post, of course, is that the estimated values used to make decisions should match the EDT expected values almost all of the time, but, should not be responsive to the same kinds of reasoning which the EDT values are responsive to, so should not actually be evidential.

Could you give a couple of examples where counterfactuals are relevant to planning and acting without having been artificially inserted?

It sounds like you've kept a really strong assumption of EDT in your head; so strong that you couldn't even imagine why non-evidential reasoning might be part of an agent's decision procedure. My example is the troll bridge: conditional reasoning (whether proof-based or expectation-based) ends up not crossing the bridge, where counterfactual reasoning can cross (if we get the counterfactuals right).

The thing you call "proof-based decision theory" involves trying to prove things of the form "if I do X, I will get at least Y utility" but those look like ordinary conditionals rather than counterfactuals to me too.

Right. In the post, I argue that using proofs like this is more like a form of EDT rather than CDT, so, I'm more comfortable calling this "conditional reasoning" (lumping it in with probabilistic conditionals).

The Troll Bridge is supposed to show a flaw in this kind of reasoning, suggesting that we need counterfactual reasoning instead (at least, if "counterfactual" is broadly understood to be anything other than conditional reasoning -- a simplification which mostly makes sense in practice).

though this is pure prejudice and maybe there are better reasons for it than I can currently imagine: we want agents that can act in the actual world, about which one can generally prove precisely nothing of interest

Oh, yeah, proof-based agents can technically do anything which regular expectation-based agents can do. Just take the probabilistic model the expectation-based agents are using, and then have the proof-based agent take the action for which it can prove the highest expectation. This isn't totally slight of hand; the proof-based agent will still display some interesting behavior if it is playing games with other proof-based agents, dealing with Omega, etc.

At any rate, right now "passing Troll Bridge" looks to me like a problem applicable only to a very specific kind of decision-making agent, one I don't see any particular reason to think has any prospect of ever being relevant to decision-making in the actual world -- but I am extremely aware that this may be purely a reflection of my own ignorance.

Even if proof-based decision theory didn't generalize to handle uncertain reasoning, the troll bridge would also apply to expectation-based reasoners if their expectations respect logic. So the narrow class of agents for whome it makes sense to ask "does this agent pass the troll bridge" are basically agents who use logic at all, not just agents who are ristricted to pure logic with no probabilistic belief.

Comment by abramdemski on Decision Theory · 2021-07-16T23:21:54.952Z · LW · GW

Agreed. The asymmetry needs to come from the source code for the agent.

In the simple version I gave, the asymmetry comes from the fact that the agent checks for a proof that x>y before checking for a proof that y>x. If this was reversed, then as you said, the Lobian reasoning would make the agent take the 10, instead of the 5.

In a less simple version, this could be implicit in the proof search procedure. For example, the agent could wait for any proof of the conclusion x>y or y>x, and make a decision based on whichever happened first. Then there would not be an obvious asymmetry. Yet, the proof search has to go in some order. So the agent design will introduce an asymmetry in one direction or the other. And when building theorem provers, you're not usually thinking about what influence the proof order might have on which theorems are actually true; you usually think of the proofs as this static thing which you're searching through. So it would be easy to mistakenly use a theorem prover which just so happens to favor 5 over 10 in the proof search.

Comment by abramdemski on Decision Theory · 2021-07-16T23:13:14.187Z · LW · GW

While I agree that the algorithm might output 5, I don't share the intuition that it's something that wasn't 'supposed' to happen, so I'm not sure what problem it was meant to demonstrate.

OK, this makes sense to me. Instead of your (A) and (B), I would offer the following two useful interpretations:

1: From a design perspective, the algorithm chooses 5 when 10 is better. I'm not saying it has "computed argmax incorrectly" (as in your A); an agent design isn't supposed to compute argmax (argmax would be insufficient to solve this problem, because we're not given the problem in the format of a function from our actions to scores), but it is supposed to "do well". The usefulness of the argument rests on the weight of "someone might code an agent like this on accident, if they're not familiar with spurious proofs". Indeed, that's the origin of this code snippet -- something like this was seriously proposed at some point.

2: From a descriptive perspective, the code snippet is not a very good description of how humans would reason about a situation like this (for all the same reasons).

When I try to examine my own reasoning, I find that when I do so, I'm just selectively blind to certain details and so don't notice any problems. For example: suppose the environment calculates "U=10 if action = A; U=0 if action = B" and I, being a utility maximizer, am deciding between actions A and B. Then I might imagine something like "I chose A and got 10 utils", and "I chose B and got 0 utils" - ergo, I should choose A. 

Right, this makes sense to me, and is an intuition which I many people share. The problem, then, is to formalize how to be "selectively blind" in an appropriate way such that you reliably get good results.

Comment by abramdemski on Decision Theory · 2021-07-16T23:00:07.363Z · LW · GW

Yep, agreed. I used the language "false antecedents" mainly because I was copying the language in the comment I replied to, but I really had in mind "demonstrably false antecedents".

Comment by abramdemski on Escaping the Löbian Obstacle · 2021-07-16T22:56:50.970Z · LW · GW

I like the alief/belief distinction, this seems to carry the distinction I was after. To make it more formal, I'll use "belief" to refer to 'things which an agent can prove in its reasoning engine/language (L)', and "alief" to refer to beliefs plus 'additional assumptions which the agent makes about the bearing of that reasoning on the environment', which together constitute a larger logic (L'). Does that match the distinction you intended between these terms?

Unfortunately, this seems almost opposite to the way I was defining the terminology. I had it that the aliefs are precisely what is proven by the formal system, and the beliefs are what the agent would explicitly endorse if asked. Aliefs are what you feel in your bones. So if the "bones" of the agent are the formal system, that's the aliefs.

Note also that your definition implies that if an agent alieves something, it must also believe it. In contrast, part of the point for me is that an agent can alieve things without believing them. I would also allow the opposite, for humans and other probabilistic reasoners, though for pure-logic agents this would have to correspond to unsoundness. But pure-logical agents have to have the freedom to alieve without believing, on pain of inconsistency, even if we can't model belief-without-alief in pure logic.

I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine! I think there's probably a deep reason for that (having to do with how difficult it is to reliably distinguish alief/belief), but I'm not grasping it for now. It is a symptom of my confusion in this regard that I'm not even sure we're pointing to different notions of belief/alief even though your definition sounds almost opposite to me. It is well within the realm of possibility that we mean the same thing, and are just choosing very different ways to talk about it.

Specifically, your definition seems fine if L is not the formal language which the agent is hard-wired with, but rather, some logic which the agent explicitly endorses (like the relationship that we have with Peano Arithmetic). Then, yeah, "belief" is about provability in L, while "alief" implies that the agent has some "additional assumptions about the bearing of that reasoning on the environment". Totally! But then, this suggests that those additional assumptions are somehow represented in some other subsystem of the agent (outside of L). The logic of that other subsystem is what I'm interested in. If that other subsystem uses L', then it makes sense that the agent explicitly believes L. But now the aliefs of the agent are represented by L'. That is: L' is the logic within the agent's bones. So it's L' that I'm talking about when I define "aliefs" as the axiomatic system, and "beliefs" as more complicated (opposite to your definition).

Over this discussion, a possible interpretation of what you're saying that's been in the back of my mind has been that you think agents should not rely on formal logic in their bones, but rather, should use formal logic only as part of their explicit thinking IE a form of thinking which other, "more basic" reasoning systems use as a tool (a tool which they can choose to endorse or not). For example, you might believe that an RL system decides whether to employ logical reasoning. Or deep learning. Etc. In this case, you might say that there is no L' to find (no logical system represents the aliefs).

To be clear, I do think this is a kind of legitimate response to the Lobstacle: the response of rejecting logic (as a tool for understanding what's going on in an agent's "bones" IE their basic decision architecture). This approach says: "Look what happens when you try to make the agent overly reliant on logic! Don't do that!"

However, the Lobstacle is specifically about using logic to describe or design decision-making procedures. So, this 'solution' will not be very satisfying for people trying to do that. The puzzling nature of the Lobstacle remains: the claim is that RL (or something) has to basically solve the problem; we can't use logic. But why is this? Is it because agents have to "be irrational" at some level (ie, their basic systems can't conform to the normative content of logic)?

Anyway, this may or may not resemble your view. You haven't explicitly come out and said anything like that, although it does seem like you think there should be a level beyond logic in some sense.

An immediate pedagogical problem with this terminology is that we have to be careful not to conflate this notion of belief with the usual one: an agent will still be able to prove things in L even if it doesn't believe (in the conventional sense) that the L-proof involved is valid.

It seems like you think this property is so important that it's almost definitional, and so, a notion of belief which doesn't satisfy it is in conflict with the usual notion of belief. I just don't have this intuition. My notion of belief-in-contrast-to-alief does contrast with the informal notion, but I would emphasize other differences:

  1. "belief" contrasts with the intuitive notion of belief in that it much less implies that you'll act on a belief; our intuitive notion more often predicts that you'll act on something if you really believe it.
  2. "alief" contrasts with the intuitive notion of belief in that it much less implies that you'll explicitly endorse something, or even be able to articulate it in language at all.

In other words, I see the intuitive notion of belief as really consisting of two parts, belief and alief, which are intuitively assumed to go together but which we are splitting apart here.

The property you mention is derived from this conflation, because in fact we need to alieve a reasoning process in order to believe its outputs; so if we're conflating alief and belief, then it seems like we need to believe that L-proofs are valid in order to see L-proofs and (as a direct result) come to believe what's been proved.

But this is precisely what's at stake, in making the belief/alief distinction: we want to point out that this isn't a healthy relationship to have with logic (indeed, Godel shows how it leads to inconsistency).

There is a more serious formalization issue at play, though, which is the problem of expressing a negative alief. How does one express that an agent "does not alieve that a proof of X in L implies that X is true"?  is classically equivalent to , which in particular is an assertion of both the existence of a proof of X and the falsehood of X, which is clearly far stronger than the intended claim. This is going off on a tangent, so for now I will just assume that it is possible to express disaliefs by introducing some extra operators in L' and get on with it.

I like that you pointed this out, but yeah, it doesn't seem especially important to our discussion. In any case, I would define disalief as something like this:

  • The agent's mental architecture lacks a deduction rule accepting proofs in L, or anything tantamount to such a rule.

Note that the agent might also disbelieve in such a rule, IE, might expect some such proofs to have false conclusions. But this is, of course, not necessary. In particular it would not be true in the case of logics which the agent explicitly endorses (and therefore must not alieve).

Yes. The mathematical proofs and Löb's theorem are absolutely fine. What I'm refuting is their relevance; specifically the validity of this claim:

An agent can only trust another agent if it believes that agent's aliefs.

My position is that *when their beliefs are sound* an agent only ever needs to *alieve* another agents *beliefs* in order to trust them.

Hrm. I suspect language has failed us, and we need to make some more distinctions (but I am not sure what they are).

In my terms, if A alieves B's beliefs, then (for example) if B explicitly endorses ZFC, then A must be using ZFC "in its bones". But if B explicitly endorses ZFC, then the logic which B is using must be more powerful than that. So A might be logically weaker than B (if A only alieves ZFC, and not the stronger system which B used to decide that ZFC is sound). If so, A cannot trust B (A does not alieve that B's reasoning is sound, that is, A does not believe B).

I have to confess that I'm confusing myself a bit, and am tempted to give yet another (different, incompatible) definition for the alief/belief split. I'll hold off for now, but I hope it's very clear that I'm not accusing all confusion about this as coming from you -- I'm aiming to minimize confusion, but I still worry that I've introduced contradictory ideas in this conversation. (I'm actually tempted to start distinguishing 3 levels rather than 2! Alief/belief are relative terms, and I worry we're actually conflating important subtleties by using only 2 terms rather than having multple levels...)

A definition of trust which fails to be reflexive is clearly a bad definition, and with this modified definition there is no obstacle

This goes back to the idea that you seem to think "belief in X implies belief in the processes whereby you came to believe in X" is so important as to be definitional, where I think this property has to be a by-product of other things.

In my understanding, the definition of "trust" should not explicitly allow or disallow this, if we're going to be true to what "trust" means. Rather, for the Lobstacle, "A trusts B" has to be defined as "A willingly relies on B to perform mission-critical tasks". This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.

Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their "relevance" to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).

Note that following the construction in the article, the secondary agent B can only act on the basis of a valid L-proof, so there is no need to distinguish between trusting what B says (the L-proofs B produces) and what B does (their subsequent action upon producing an L-proof).

Ok, but if agent B can only act on valid L-proofs, it seems like agent B has been given a frontal lobotomy (IE this is just the "make sure my future self is dumber" style of solution to the problem).

Or, on the other hand, if the agent A also respects this same restriction, then A cannot delegate tasks to B (because A can't prove that it's OK to do so, at least not in L, the logic which it has been restricted to use when it comes to deciding how to act).

Which puts us back in the same lobstacle.

Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.

Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you're admitting that your approach has this problem, but, you think there's value in what you've done so far because you've solved one part of the problem and you think this other part could also work with time. Is that what you're intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don't see what the value in your proposal could be.

For me, solving the Lobstacle means being able to actually decide to delegate.

I was taking "reasoning" here to mean "applying the logic L" (so manipulating statements of belief), since any assumptions lying strictly in L' are only applied passively. It feels strange to me to extend "reasoning" to include this implicit stuff, even if we are including it in our formal model of the agent's behaviour.

I think I get what you're saying here. But if the assumptions lying strictly in L' are only applied passively, how does it help us? I'm thinking your current answer is (as you've already said) "trusting that B will do what they've said they'll do is a separate problem" -- IE you aren't even trying to build the full bridge between B thinking something is a good idea and A trusting B with such tasks.


Both A and B "reason" in L' (B could even work in a distinct extension of L), but will only accept proofs in the fragment L.

[...]

But then, your bolded statement seems to just be a re-statement of the Löbstacle:  logical agents can't explicitly endorse their own logic L' which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.

It's certainly a restatement of Löb's theorem. My assertion is that there is no resultant obstacle.

I still really, really don't get why your language is stuff like "there's no resultant obstacle" and "what I'm refuting is their relevance". Your implication is "there was not a problem to begin with" rather than "I have solved the problem". I asked whether you objected to details of the math in the original paper, and you said no -- so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!). Solving the lobstacle would presumably involve providing an improved agent design which would avoid the problem. Yet this seems like it's not what you want to do -- instead you claim something else, along the lines of claiming that there's not actually a problem?

So, basically, what do you claim to accomplish? I suspect I'm still really misunderstanding that part of it.

Re the rest,

(And I also don't yet see what that part has to do with getting around the Löbstacle.)

It's not relevant to getting around the Löbstacle; this part of the discussion was the result of me proposing a possible advantage of the perspective shift which (I believe, but have yet to fully convince you) resolves the Löbstacle. I agree that this part is distracting, but it's also interesting, so please direct message me (via whatever means is available on LW, or by finding me on the MIRIx Discord server or AI alignment Slack) if you have time to discuss it some more.

Since you're now identifying it as another part of the perspective shift which you're trying to communicate (rather than just some technical distraction), it sounds like it might actually be pretty helpful toward me understanding what you're trying to get at. But, there are already a lot of points floating around in this conversation, so maybe I'll let it drop.

I'm somewhat curious if you think you've communicated your perspective shift to any other person; so far, I'm like "there just doesn't seem to be anything real here", but maybe there are other people who get what you're trying to say?

Comment by abramdemski on My Current Take on Counterfactuals · 2021-07-16T19:43:11.369Z · LW · GW

Yeah, interesting. I don't share your intuition that nested counterfactuals seem funny. The example you give doesn't seem ill-defined due to the nesting of counterfactuals. Rather, the antecedent doesn't seem very related to the consequent, which generally has a tendency to make counterfactuals ambiguous. If you ask "if calcium were always ionic, would Nixon have been elected president?" then I'm torn between three responses:

  1. "No" because if we change chemistry, everything changes.
  2. "Yes" because counterfactuals keep everything the same as much as possible, except what has to change; maybe we're imagining a world where history is largely the same, but some specific biochemistry is different.
  3. "I don't know" because I am not sure what connection between the two you are trying to point at with the question, so, I don't know how to answer.

In the case of your Bach example, I'm similarly torn. On the one hand, if we imagine some weird connection between the ages of Back and Mozart, we might have to change a lot of things. On the other hand, counterfactuals usually try to keep thing fixed if there's not a reason to change them. So the intention of the question seems pretty unclear.

Which, in my mind, has little to do with the specific nested form of your question.

More importantly, perhaps, I think Stalnaker and other philosophers can be said to be investigating linguistic counterfactuals; their chief concern is formalizing the way humans naively talk about things, in a way which gives more clarity but doesn't lose something important. 

My chief concern is decision-theoretic counterfactuals, which are specifically being used to plan/act. This imposes different requirements.

The philosophy of linguistic counterfactuals is complex, of course, but personally I really feel that I understand fairly well what linguistic counterfactuals are and how they work. My picture probably requires a little exposition to be comprehensible, but to state it as simply as I can, I think linguistic counterfactuals can always be understood as "conditional probabilities, but using some reference frame rather than actual beliefs". For example, very often we can understand counterfactuals as conditional probabilities from a past belief state. "If it had rained, we would not have come" can't be understood as a conditional probability of the current beliefs where we knew we did come; but back up time a little bit, and it's true that if it had been raining, we would not have made the trip.

Backing up time doesn't always quite work. In those cases we can usually understand things in terms of a hypothetical "objective judge" who doesn't know details of a situation but who knows things a "reasonable third party" would know. It makes sense that humans would have to consider this detached perspective a lot, in order to judge social situations; so it makes sense that we would have language for talking about it (IE counterfactual language).

We can make sense of nested linguistic counterfactuals in that way, too, if we wish. For example, "if driving had [counterfactually] meant not making it to the party, then we wouldn't have done it" says (on my understanding) that if a reasonable third person would have looked at the situation and said that if we drive we won't make it to the party, then, we would not have driven. (This in turn says that my past self would have not driven if he had believed that a resonable third person wouldn't believe that we would make it to the party, given the information that we're driving.)

So, I think linguistic counterfactuals implicitly require a description of a third party / past self to be evaluated; this is usually obvious enough from conversation, but, can be an ambiguity.

However, I don't think this analysis helps with decision-theoretic counterfactuals. At least, not directly.

Comment by abramdemski on The Alignment Forum should have more transparent membership standards · 2021-07-16T19:15:59.343Z · LW · GW

After reading more of the discussion on this post, I think my reply here was conflating different notions of "open" in a way which has been pointed out already in comments.

Comment by abramdemski on The Alignment Forum should have more transparent membership standards · 2021-07-16T18:33:37.874Z · LW · GW

As someone who had some involvement in the design of the alignment forum, and the systems you are describing, this post has given me a lot to think about. I think it is clear that the current system is not working. However, one point you made seems rather opposed to my picture, and worth discussing more:

  1. The AF being closed to the public is bad for the quality of AF discourse.

The reason, in my mind, for the current system is because it's difficult to know who is and is not suitable for membership. Therefore, we have a very opaque and unpredictable membership standard, "by necessity". In an attempt to compensate, we try to make membership matter as little as possible, IE we hope non-members will feel comfortable posting on LessWrong and then have their posts get promoted. Obviously this is not working very well at all, but that was my hope: (A) keep membership standards very high and therefore opaque, (B) make membership matter as little as possible.

Why do I think standards should be very high?

My model is that if the alignment forum were open to the public, there would be a lot more very low-quality posts and (especially) comments. I could be wrong here -- standards of comments on lesswrong are pretty good -- but I worry about what happens if it's fine for a while and then (perhaps gradually) there starts to be a need to close access down more to maintain quality. Opening things up seems a lot easier than closing them down again, because at that point there would be a need to make tough decisions.

So I'm left wondering what to do. Can we keep standards high, while also making them transparent? I don't know. But it does seem like something needs to change.

Comment by abramdemski on Decision Theory · 2021-07-14T16:57:28.740Z · LW · GW

Hmm. I'm not following. It seems like you follow the chain of reasoning and agree with the conclusion:

The algorithm doesn't try to select an assignment with largest , but rather just outputs  if there's a valid assignment with , and  otherwise. Only  fulfills the condition, so it outputs .

This is exactly the point: it outputs 5. That's bad! But the agent as written will look perfectly reasonable to anyone who has not thought about the spurious proof problem. So, we want general tools to avoid this kind of thing. For the case of proof-based agents, we have a pretty good tool, namely MUDT (the strategy of looking for the highest-utility such proof rather than any such proof). (However, this falls prey to the Troll Bridge problem, which looks pretty bad.)

Conditionals with false antecedents seem nonsensical from the perspective of natural language, but why is this a problem for the formal agent?

More generally, the problem is that for formal agents, false antecedents cause nonsensical reasoning. EG, for the material conditional (the usual logical version of conditionals), everything is true when reasoning from a false antecedent. For Bayesian conditionals (the usual probabilistic version of conditionals), probability zero events don't even have conditionals (so you aren't allowed to ask what follows from them).

Yet, we reason informally from false antecedents all the time, EG thinking about what would happen if 

So, false antecedents cause greater problems for formal agents than for natural language.

For this particular problem, you could get rid of assignments with nonsensical values by also considering an algorithm with reversed outputs and then taking the intersection of valid assignments, since only  satisfies both algorithms. 

The problem is also "solved" if the agent thinks only about the environment, ignoring its knowledge about its own source code. So if the agent can form an agent-environment boundary (a "cartesian boundary") then the problem is already solved, no need to try reversed outputs.

The point here is to do decision theory without such a boundary. The agent just approaches problems with all of its knowledge, not differentiating between "itself" and "the environment".

Comment by abramdemski on My Current Take on Counterfactuals · 2021-07-07T17:20:11.588Z · LW · GW

Ah, I wasn't strongly differentiating between the two, and was actually leaning toward your proposal in my mind. The reason I was not differentiating between the two was that the probability of C(A|B) behaves a lot like the probabilistic value of Prc(A|B). I wasn't thinking of nearby-world semantics or anything like that (and would contrast my proposal with such a proposal), so I'm not sure whether the C(A|B) notation carries any important baggage beyond that. However, I admit it could be an important distinction; C(A|B) is itself a proposition, which can feature in larger compound sentences, whereas Prc(A|B) is not itself a proposition and cannot feature in larger compound sentences. I believe this is the real crux of your question; IE, I believe there aren't any other important consequences of the choice, besides whether we can build larger compound expressions out of our counterfactuals.

Part of why I was not strongly differentiating the two was because I was fond of Stalnaker's Thesis, according to which P(A|B) can itself be regarded as the probability of some proposition, namely a nonstandard notion of implication (IE, not material conditional, but rather 'indicative conditional'). If this were the case, then we could safely pun between P(A->B) and P(B|A), where "->" is the nonstandard implication. Thus, I analogously would like for P(C(A|B)) to equal Prc(A|B). HOWEVER, Stalnaker's thesis is dead in philosophy, for the very good reason that it seemingly supports the chain of reasoning Pr(B|A) = Pr(A->B) = Pr(A->B|B)Pr(B) + Pr(A->B|~B)Pr(~B) = Pr(B|A&B)Pr(B) + Pr(B|A&~B)Pr(~B) = Pr(B). Some attempts to block this chain of reasoning (by rejecting bayes) have been made, but, it seems pretty damning overall.

So, similarly, my idea that P(C(A|B))=Prc(A|B) is possibly deranged, too.

Comment by abramdemski on Escaping the Löbian Obstacle · 2021-07-07T16:44:02.555Z · LW · GW

The Löbian obstacle is about trust in the reasoning performed by a subordinate agent; the promise of subsequent actions taken based on that reasoning are just a pretext for considering that formal problem. So if A constructs B to produce proofs in L, it doesn't matter what B's beliefs are, or even if B has any beliefs; B could just be manipulating these as formal expressions. If you insist that B's beliefs be incorporated into its reasoning, as you seem to want to (more on that below), then I'm saying it doesn't matter what extension of L the agent B is equipped with; it can even be an inconsistent extension, as long as only valid L-proofs are produced.

I'm not insisting on anything about the agent design; I simply remain puzzled how your proposal differs from the one in the paper, and so, continue to assume it's like the one in the paper except those differences which I understand. Your statements have been reading almost like you think you've refuted formal arguments in the paper (implying you don't think the formal agent designs do what I think they do), rather than providing a new agent design which does better. This contributes to my feeling that you are really just wrestling with Lob's theorem itself. Nor do I recall you saying much to dissuade me of this. In your OP, you use the language "Under the belief that the Löbian Obstacle was a genuine problem to be avoided", as if the original idea was simply mistaken.

So, to be absolutely clear on this: do you accept the mathematical proofs in the original paper (but propose some way around them), or do you think they are actually mistaken? Do you accept the proof of Lob's theorem itself, or do you think it is mistaken?

So if A constructs B to produce proofs in L, it doesn't matter what B's beliefs are, or even if B has any beliefs;

But using proofs in L is precisely what I understood from "belief"; so, if you have something else in mind, I will need clarification.

The word "normative" sticks out to me as potential common ground here, so I'll use that language. The specified semantic map determines what is "actually" true, but its content is not a priori knowledge. As such, the only way for A's reasoning in L to have any practical value is if A works under the (normative) assumption that provability implies/will imply truth under S.

If this sounds farfetched, consider how police dramas employ the normative nature of truth for dramatic effect on a regular basis. A rogue detective uses principles of reasoning that their superiors deem invalid, so that the latter do not expect the detective's deductions to yield results in reality... Or perhaps the practice of constructive mathematics would be a more comfortable example, where the axiom of choice is rejected in favour of weaker deductive principles. A dedicated constructivist could reasonably make the claim that their chosen fragment of logic determines what "ought" to be true.

I think I haven't understood much of this.

As such, the only way for A's reasoning in L to have any practical value is if A works under the (normative) assumption that provability implies/will imply truth under S.

I would argue that the only way for A's reasoning in L to have practical value is for A to act on such reasoning. The only way for A's reasoning to have practical value for some other agent B is for B to be working under the assumption that provability-in-L implies truth under S. Indeed, this seems in line with your detective drama example. If the superiors were put in the position of the rogue detective, often they would have come to the same conclusion. Yet, looking at the problem from a removed position, they cannot accept the conclusion. But this is precisely the lobstacle: the condition for reasoning to have value to an agent A is different from the condition for A's reasoning to have value to another agent B.

Let me therefore make a (not totally formal) distinction based on the alief/belief idea:

  • The aliefs of a logic-based agent are the axioms/theorems which inform action;
  • The beliefs of a logic-based agent are the axioms/theorems which the agent explicitly endorses.

This is also a version of the distinction between tacit knowledge (eg knowing how to ride a bike) vs explicit knowledge (eg being able to explain scientifically how bike-riding works). To say that an agent "alieves" a logic L is to say that it tacitly knows how to reason in L.

Roughly, I think you have been talking about beliefs, where I have been talking about aliefs.

Now, a "naive logical agent" (ie the simple agent design which falls prey to the Lobstacle, in the paper) has a few important properties:

  • Goedel's incompleteness theorems tells us that aliefs cannot equal beliefs; a logic never explicitly endorses itself, but rather, only explicitly endorses weaker logics.
  • An agent constructed based on the logic L alieves L, but does not believe L (this is a re-statement of the lobstacle).
  • An agent can only trust another agent if it believes that agent's aliefs.
  • An agent therefore cannot belief its own aliefs. (This is another restatement of the lobstacle.)

These statements are not necessarily true of logical agents in general, of course, since the paper gives some counterexamples (agents who can trust the same logic which they base their actions on). But those counterexamples all have some oddities. I presume your goal is to get rid of those oddities while still avoiding the lobstacle.

You seem exasperated that I'm not incorporating the meta-logical beliefs into the formal system, but this is because in practice meta-logical beliefs are entirely implicit, and do not appear in the reasoning system used by agents.

The thing I'm exasperated by is, rather, that I don't understand what you are doing with the meta-logical beliefs if you aren't including them in the formal system. It's fine if you don't include them in the formal system; just tell me what you are doing instead! This was the point of my "I have the semantic map here in my back pocket" argument: fine, I'm willing to grant that the agent 'has' these meta-logical beliefs, I just don't yet understand how you propose the agent use them.

I mean, I'm particularly exasperated because you explicitly state that the meta-logical beliefs don't change the agent's reasoning at all; it seems like you're boldly asserting that they are just a talisman which the agent keeps in its back pocket, and that this somehow solves everything.

in practice meta-logical beliefs are entirely implicit, and do not appear in the reasoning system used by agents.

In the new terminology I am proposing, I think perhaps what you are saying is that the meta-logical stuff is aliefs and thus not apparent in the beliefs of the system. What do you think of that interpretation?

(Obviously, I don't think your proposal solves the Lobstacle in that case, but, that is a separate question.)

If I build a chess-playing robot, its program will not explicitly include the assumption that the rules of chess it carries are the correct ones,

Right. To me this overtly reads as "a chess playing robot only alieves the rules of chess. It has no explicit endorsement of those rules." Which makes perfect sense to me.

, nor will it use such assumptions in determining its moves, because such determinations are made without any reference to a semantic map.

But then, this part of the sentence seems to go completely off the rails for me; a chess-playing robot will be no good at all if the rules of chess don't bear on its moves. Can you clarify?

This is why I said the agent doesn't change its reasoning. 

I still don't understand this part at all; could you go into more depth?

The metalogical beliefs thus are only relevant when an agent applies its (completed) reasoning to its environment via the semantic map.

This helps clarify a bit; it sounds like maybe we were conflating different notions of "reasoning". IE, I interpreted you as saying the semantic map does not enter into the agent's algorithm anywhere, not in the logical deduction system nor in the reasoning which the agent uses to decide on actions based on those logical deductions; whereas you were saying the semantic map does not enter into the agent's logical deductions, but rather, has to do with how those deductions are used to make decisions.

But I am still confused by your proposal:

We can formalize that situation via the extended logic L' if you like, in which case this is probably the answer that you keep demanding:

(It seems worth clarifying that I'm only demanding that the meta-logical reasoning be formalized for us, ie that you give a formal statement of the agent design which I can understand, not formalized for the agents, ie put into their formal system)

Both A and B "reason" in L' (B could even work in a distinct extension of L), but will only accept proofs in the fragment L. Since the axioms in L' extending L are identifiable by their reference to the semantic mapping, there is no risk of "belief contamination" in proofs, if that were ever a concern.

Lob's theorem says that the conclusions X where L can believe its own reasoning to be sound are precisely the cases where L can prove X, that is, . So normally, L' can accept proofs in L' just fine, if they are concretely given. The lobstacle has to do with a logic trusting itself in the abstract, that is, trusting proofs before you see them. (Hence, the lobstacle doesn't block an agent from trusting its future self if it can concretely simulate what its future self can do, because there will be no specific reasoning step which it will reject; what the lobstacle blocks is for an agent to trust its future self with unforseen chains of reasoning.)

So, one interpretation: you propose to modify this situation, by restricting L' to only accept proofs given in L.

This interpretation seems to conform to my "back pocket" argument. Both A and B "reason" in L', but really, only accept L. (This explains why you said '"reason"' in scare quotes.) But, I'm left wondering where/how L' matters here, and how this proposal differs from just building naive logical agents who use L (and fall prey to the lobstacle).

Thus, I'm probably better off interpreting you as referring to "accepting proofs" in the abstract, rather than concretely given proofs. With this interpretation, it seems you mean to say that A and B alieve L' (that is, they use L' internally), but only explicitly endorse proofs in L (that is, they can only accept principles of reasoning which warrant L-proofs, not L'-proofs, when they abstractly consider which proofs are normative).

But then, your bolded statement seems to just be a re-statement of the Lobstacle:  logical agents can't explicitly endorse their own logic L' which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.

Without an intended semantics, what the probabilities assigned to formulas can only be interpreted as beliefs/levels of certainty about provability of statements from axioms (which it also believes with more or less certainty). This is great, but as soon as you want your logical inductor to reason about a particular mathematical object, the only way to turn those beliefs about provability into beliefs about truth in the model, you need to extend the inductor (explicitly or implicitly) with meta-logical beliefs about the soundness of that map, since it will update its beliefs based on provability even if its proof methods aren't sound.

This part doesn't seem totally relevant any more, but to respond: sure, but the standard model is an infinite object, so we don't really know how to extend the inductor to (implicitly or explicitly) know the intended semantic map. (Although this is a subject of some interest for me.) It's plausible that there's no way. But then, it also becomes plausible that humans don't do this (ie don't actually possess a full semantic map for peano arithmetic).

Instead, I've come to think about us (like logical inductors) as being uncertain about the model (ie uncertain whether numbers with specific properties exist, rather than possessing any tool which would fully differentiate between standard and non-standard numbers).

As such, I feel you've misunderstood me here. I don't want semantics-independent reasoning at all, if anything the opposite: reasoning that prioritises verifying soundness of a logic wrt specified semantic maps in an empirical way, and which can adapt its reasoning system when soundness is shown to fail.

Your critique of logical induction in the post was "fixed semantics are inflexible". But now it seems like you are the one who proposes to have a fixed semantics (carrying a specific semantic map, eg, the standard model), and modifying the logic when that semantic map shows the logic to be unsound. In contrast, logical induction is the one who proposes to have no fixed semantic map (instead being uncertain about the model), and modifying those beliefs based on what the logic proves.

My critique of this is only that I don't know how to "have the standard model". In general, you propose that the agent has a semantic map which can be used to check the soundness of the logic. But I am skeptical: a semantic map will usually refer to things we cannot check (EG the external world, or uncomputable facts about numbers). Don't get me wrong: I would love to be able to check those things more properly. It's just that I don't see a way (beyond what is already possible with logical reasoning, or more haphazardly but more powerfully, with the probabilistic extension of logical reasoning that logical induction gives us).

(And I also don't yet see what that part has to do with getting around the Lobstacle.)

Comment by abramdemski on An Intuitive Guide to Garrabrant Induction · 2021-06-23T17:29:53.677Z · LW · GW

I should! But I've got a lot of things to write up!

It also needs a better name, as there have been several things termed "weak logical induction" over time.

Comment by abramdemski on The Credit Assignment Problem · 2021-06-23T17:24:40.669Z · LW · GW
  • In between … well … in between, we're navigating treacherous waters …

Right, I basically agree with this picture. I might revise it a little:

  • Early, the AGI is too dumb to hack its epistemics (provided we don't give it easy ways to do so!).
  • In the middle, there's a danger zone.
  • When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
  • When the AGI is really smart, it might see better ways of organizing itself (eg, specific ways to hack epistemics which really are for the best even though they insert false beliefs), but we're OK with that, because it's really freaking smart and it knows to be cautious and it still thinks this.

So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system.

One important point here is that the epistemic system probably knows what the instrumental system is up to. If so, this gives us an important lever. For example, in theory, a logical inductor can't be reliably fooled by an instrumental reasoner who uses it (so long as the hardware, including the input channels, don't get corrupted), because it would know about the plans and compensate for them.

So if we could get a strong guarantee that the epistemic system knows what the instrumental system is up to (like "the instrumental system is transparent to the epistemic system"), this would be helpful.

Comment by abramdemski on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the "utility function" abstraction · 2021-06-23T17:08:16.406Z · LW · GW

Hmm, this sentence feels to me like a type error. It doesn't seem like the way we reason about agents should depend on the fundamental laws of physics. If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics. (Another way of saying this is that agents are at a much higher level of abstraction than relativity.)

True, but states aren't at a much higher level of abstraction than relativity... states are a way to organize a world-model, and a world-model is a way of understanding the world.

From a normative perspective, relativity suggests that there's ultimately going to be something wrong with designing agents to think in states; states make specific assumptions about time which turn out to be restrictive.

From a descriptive perspective, relativity suggests that agents won't convergently think in states, because doing so doesn't reflect the world perfectly.

The way we think about agents shouldn't depend on how we think about physics, but it accidentally did, in that we accidentally baked linear time into some agent designs. So the reason relativity is able to say something about agent design, here, is because it points out that some agent designs are needlessly restrictive, and rational agents can take more general forms (and probably should).

This is not an argument against an agent carrying internal state, just an argument against using POMDPs to model everything.

Also, it's pedantic; if you give me an agent model in the POMDP framework, there are probably more interesting things to talk about than whether it should be in the POMDP framework. But I would complain if POMDPs were a central assumption needed to prove a significant claim about rational agents, or something like that. (To give an extreme example, if someone used POMDP-agents to argue against the rationality of assenting to relativity.)

Hmm, you mean that reward is taken as observable? Yeah, this does seem like a significant drawback of talking about rewards. But if we assume that rewards are unobservable, I don't see why reward functions aren't expressive enough to encode utilitarianism - just let the reward at each timestep be net happiness at that timestep. Then we can describe utilitarians as trying to maximise reward.

I would complain significantly less about this, yeah. However, the relativity objection stands.

I think we're talking about different debates here. I agree with the statement above - but the follow-up debate which I'm interested in is the comparison is "utility theory" versus "a naive conception of goals and beliefs" (in philosophical parlance, the folk theory), and so this actually seems like a point in favour of the latter. What does utility theory add to the folk theory of agency? 

To state the obvious, it adds formality. For formal treatments, there isn't much of a competition between naive goals and utility theory: utility theory wins by default, because naive goal theory doesn't show up to the debate.

If I thought "goals" were a better way of thinking than "utility functions", I would probably be working on formalizing goal theory. In reality, though, I think utility theory is essentially what you get when you try to do this.

Here's one example: utility theory says that deontological goals are very complicated. To me, it seems like folk theory wins this one, because lots of people have pretty deontological goals. 

So, my theory is not that it is always better to describe realistic agents as pursuing (simple) goals. Rather, I think it is often better to describe realistic agents as following simple policies. It's just that simple utility functions are often enough a good explanation, that I want to also think in those terms.

Deontological ethics tags actions as good and bad, so, it's essentially about policy. So, the descriptive utility follows from the usefulness of the policy view. [The normative utility is less obvious, but, there are several reasons why this can be normatively useful; eg, it's easier to copy than consequentialist ethics, it's easier to trust deontological agents (they're more predictable), etc.]

To state it a little more thoroughly:

  1. A good first approximation is the prior where agents have simple policies. (This is basically treating agents as regular objects, and investigating the behavior of those objects.)
  2. Many cases where that does not work well are handled much better by assuming simple utility functions and simple beliefs. So, it is useful to sloppily combine the two.
  3. An even better combination of the two conceives of an agent as a model-based learner who is optimizing a policy. This combines policy simplicity with utility simplicity in a sophisticated way. Of course, even better models are also possible.

Or another example: utility theory says that there's a single type of entity to which we assign value. Folk theory doesn't have a type system for goals, and again that seems more accurate to me (we have meta-goals, etc).

I'm not sure what you mean, but I suspect I just agree with this point. Utility functions are bad because they require an input type such as "worlds". Utility theory, on the other hand, can still be saved, by considering expectation functions (which can measure the expectation of arbitrary propositions, linear combinations of propositions, etc). This allows us to talk about meta-goals as expectations-of-goals ("I don't think I should want pizza").

To be clear, I do think that there are a bunch of things which the folk theory misses (mostly to do with probabilistic reasoning) and which utility theory highlights. But on the fundamental question of the content of goals (e.g. will they be more like "actually obey humans" or "tile the universe with tiny humans saying 'good job'") I'm not sure how much utility theory adds.

Again, it would seem to add formality, which seems pretty useful.

Comment by abramdemski on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the "utility function" abstraction · 2021-06-23T16:18:40.281Z · LW · GW

And setting A=B=C is deciding not to allocate the time to figure out their values (hard to decide -> similar). 

This sentence seems to pre-suppose that they have "values", which is in fact what's at issue (since numerical values ensure transitivity). So I would not want to put it that way. Rather, cutting the loop saves time without apparently losing anything (although to an agent stuck in a loop, it might not seem that way).

Usually, such a thing indicates there are multiple things you want

I think this is actually not usually an intransitive loop, but rather, high standards for an answer (you want to satisfy ALL the desiderata). When making decisions, people learn an "acceptable decision quality" based on what is usually achievable. That becomes a threshold for satisficing. This is usually good for efficiency; once you achieve the threshold, you know returns for thinking about this decision are rapidly diminishing, so you can probably move on.

However, in the rare situations where the threshold is simply not achievable, this causes you to waste a lot of time searching (because your termination condition is not yet met!).

Comment by abramdemski on My Current Take on Counterfactuals · 2021-06-22T23:24:24.720Z · LW · GW

I don't believe that LI provides such a Pareto improvement, but I suspect that there's a broader theory which contains the two.

Overall, I place much less weight on arguments that revolve around the presumed nature of human values compared to arguments grounded in abstract reasoning about rational agents.

Ah. I was going for the human-values argument because I thought you might not appreciate the rational-agent argument. After all, who cares what general rational agents can value, if human values happen to be well-represented by infrabayes?

But for general rational agents, rather than make the abstract deliberation argument, I would again mention the case of LIDT in the procrastination paradox, which we've already discussed. 

Or, I would make the radical probabilist argument against rigid updating, and the 'orthodox' argument against fixed utility functions. Combined, we get a picture of "values" which is basically a market for expected values, where prices can change over time (in a "radical" way that doesn't necessarily spring from an update on a proposition), but which follow some coherence rules like an expectation of an expectation equals an expectation. One formalization of this is Skyrms'. Another is your generalization of LI (iirc).

So to sum it up, my argument for general rational agents is:

  • In general, we need not update in a rigid way; we can develop a meaningful theory of 'fluid' updates, so long as we respect some coherence constraints. In light of this generalization, restriction to 'rigid' updates seems somewhat arbitrary (ie there does not seem to be a strong motivation to make the restriction from rationality alone).
  • Separately, there is no need to actually have a utility function if we have a coherent expectation.
  • Putting the two together, we can study coherent expectations where the notion of 'coherence' doesn't assume rigid updates.

However, this argument of course does not account for InfraBayes. I suspect your real crux is the plausibility of coming up with a unifying theory which gets both radical-probabilism stuff and InfraBayes stuff. This does seem challenging, but I strongly suspect it to be possible. Indeed, it seems like it might have to do with the idea of a market which maintains a buy/sell spread rather than giving one price for a good.

Comment by abramdemski on My Current Take on Counterfactuals · 2021-06-22T22:46:58.359Z · LW · GW

I'm objecting to an assumption that contradicts a previous assumption which leads to inconsistent PA. If PA is consistent, then we can't just suppose 1 = 2 because we feel like it. If 1 = 2, then PA is inconsistent and math is broken.

Ah. But in standard math and logic, this is exactly what we do. We can suppose anything because we feel like it.

It sounds like you want something similar to "relevance logic", in that relevance logic is also a nonstandard logic which doesn't let you do whatever you want just for the heck of it. I don't know a whole lot about it, but I think assumptions must be relevant, basically meaning that we actually have to use them. However, afaik, even relevance logic lets us assume things which are contrary to previous assumptions!

Would you also object to the following proof?

  • Suppose A.
    • Suppose A.
      • Contradiction.
    • So A.
  • So A A.

In this proof, I show that A implies its own double negation, but to do so, I have to suppose something contrary to something I've already supposed.

Assuming PA is consistent and assuming the agent has crossed, then U for crossing cannot be -10 or the agent would not have crossed. Weird assumptions like "1 = 2" or "the agent proves crossing implies U = -10" contradict the existing assumption of PA consistency. Any proof that involves inconsistency of PA as a key step is immediately suspect.

Just to emphasize: this is going to have to be a very nonstandard logic, for your notion of which proofs are/aren't suspect to work.

It is possible that troll bridge could be solved with a very nonstandard logic, of course, but we would also want the logic to do lots of other nice things (ie, support normal reasoning to a very large degree). But it is not clear how you would propose to do that; for example, the proof that a sentence implies its own double negation would appear to be thrown out.

Suppose the agent crosses. Further suppose the agent proves crossing implies 1 = 2. Such a proof means PA is inconsistent, so 1 = 2 is indeed "true" within PA. Thus "agent proves crossing implies 1 = 2" implies 1 = 2. Therefore by Löb's theorem, crossing implies 1 = 2.

Putting this in symbols, using box to mean "is provable in PA":

  • Suppose (cross -> 1=2).
    • Suppose the agent crosses.
    • Therefore, cross -> 
  • So 

See the problem? For Lobs theorem to apply, we need to prove . Here, you have an extra box: . So, your proof doesn't go through.

If the proof leads to "PA is inconsistent", then every step that follows from that step is unreliable.

Again, I just want to emphasize that this reasoning implies a very nonstandard logic (which probably wouldn't even support Lob's theorem to begin with?). Normal logic has no conditions like "you can start ignoring a proof if it starts saying such and such". Indeed, this would sort of defeat the point of logic: each step is supposed to be individually justified. And in particular, in typical logics (including classical logic, intuitionistic logic, linear logic, and many others), you can assume whatever you want; the point of logic is to study what follows from what, which means we can make any supposition, and see what follows from it. You don't have to believe it, but you do have to believe that if you assumed it, such-and-such would follow.

Putting restrictions on what we can even assume sounds like putting restrictions on what arguments we're allowed to listen to, rather than studying the structure of arguments. Logic doesn't keep people from making bad assumptions; it only tells you what those assumptions imply.

A hypothetical assumption doesn't harm anybody. If A is totally absurd, that's fine; we can derive sentences like . The conclusions should be confined to the hypothetical. So, under ordinary circumstances, there is no reason to restrict people from making absurd assumptions. Garbage in, garbage out. No problem. 

It's Lobs theorem that's letting us take our absurd conclusions and "break them out of the hypothetical" here.

So as I stated before, I think you should be wrestling with Lobs theorem itself here, rather than finding the flaws in other places. I think you're chasing shadows with your hypothesis about restricting when we can make which assumptions. That's simply not the culprit here.

Comment by abramdemski on Escaping the Löbian Obstacle · 2021-06-22T19:32:25.272Z · LW · GW

A doesn't need B to believe that the logic is sound. Even if you decide to present "logic L plus metalogical beliefs" as a larger logic L' (and assuming you manage to do this in a way that doesn't lead to inconsistency), the semantic map is defined on L, not on L'.

My problem is that I still don't understand how you propose for the agent to reason/behave differently than what I've described; so, your statement that it does in fact do something different doesn't help me much, sorry.

The semantic map is defined on L, not L' -- sure, this makes some sense? But this just seems to reinforce the idea that our agent can only "understand" the internal logic of agents who restrict themselves to only use L (not any meta-logical beliefs).

 Even if you decide to present "logic L plus metalogical beliefs" as a larger logic L' (and assuming you manage to do this in a way that doesn't lead to inconsistency), 
[...]
I didn't understand this remark; please could you clarify?

Seems like you missed my point that the meta-logical belief could just be "L is sound" rather than "L plus me is sound". Adding the first as an axiom to L is fine (it results in an L' which is sound if L was sound), while adding the second as an axiom is very rarely fine (it proves soundness and consistency of the whole system, so the whole system had better be too weak for Godel's incompleteness theorems to apply).

Does that make sense?

You were talking about the second sort of situation (where adding the meta-logical belief as an axiom would result in an inconsistent system, because it would claim its own soundness); I wanted to point out that we could also be in the first sort of situation (where adding the meta-logical belief would result in a perfectly consistent system, but would only let you trust L, not L').

>I don't understand why you would put truth on the "ought" side of the is/ought divide, or if you do, how it helps us out here.

To put soundness in is/ought form, the belief that A must hold is that "if phi is provable in L, the interpretation of phi (via the semantic map) ought to be true". 

To me, this "ought" in the sentence reads as a prediction (basically an abuse of 'ought', which is in common usage basically because people make is/ought errors). I would prefer to re-phrase as "if phi is provable in L, then the interpretation of phi will be true" or less ambitiously "will probably be true".

Is your proposal that "X is true" should be taken as a statement of X's desirability, instead? Or perhaps X's normativity? That's what it means to put it on the "ought" side, to me. If it means something different to you, we need to start over with the question of what the is/ought divide refers to.

Truth can't be moved to the other side, because as I've tried to explain (perhaps unsuccessfully) the logic doesn't include its own semantics, and it's always possible to take contrarian semantics which fail to be sound applications of L. (see also final point below)

I agree that (by Tarski's undefinability theorem) a logic can't know its own semantic map (provided it meets the assumptions of that theorem). At least, if we're on the same page about what a semantic map is.

However, Tarski's analysis also shows how to build a stronger logic L' which knows the semantic map for L. So if "move to the other side" means represent via more formal axioms (which is what I take you to mean), you seem to be wrong.

The interpretation which results in the Löbian Obstacle is "Löb's theorem tells us that a logical agent can't trust its own reasoning, because it can't prove that reasoning is sound," and under that interpretation it seems that extreme measures must be taken to make formal logic relevant to AI reasoning, which is counterintuitive since we humans employ formalizable reasoning every day without any such caveats. In this post I'm saying "Löb's theorem reminds us that a logical agent cannot make a priori claims about the soundness of its own reasoning, because soundness is a statement about a semantic map, which the logic has no a priori control over".

My problem with the general format of this argument is, the other interpretation of Lob's theorem is still valid even if you re-interpret it. That is: Eliezer et al show that the assumptions of Lob's theorem apply, and thereby, they conclude a formal result. No amound of re-interpreting the theorem changes that; one might as well suggest that we solve the Lobstacle by forgetting Lob's theorem.

That is what I meant when I said I don't get how it helps.

It seems to me like your post is missing a step where you design an alternative agent based on your improved thinking. Probably the alternative agent is supposed to be obvious from what you wrote, but I'm not getting it.

No, the agent doesn't change its reasoning.

So how do you escape the conclusion from the paper??

Assume that the agent A is reasoning soundly about their environment, which is to say that their semantic mapping is sound. Then A's belief in the soundness of their reasoning is justified.

Justified, perhaps, but also non-existent, right? You say the agent doesn't change its reasoning. So the reasoning is exactly the same as the lobstacle agent from the paper. So it doesn't conclude its own soundness. Right??

The change is that we don't require A to prove that their semantic mapping is sound, because A cannot do that, and I'm claiming that this doesn't break anything.

I have a couple of possible interpretations of what you might mean here, all of them really poor:

  1. You're saying that we allow the agent to conclude its own soundness without going through anything involving the semantic map, since it can't possibly do that. But this seems to just result in an inconsistent agent, by Godel's second theorem. Probably you don't mean this.
  2. You're saying that we don't change anything about the agent design; but we, as designers, don't worry about the agent being able to reason about its own semantic map. Here, the question is, how does this help us? We still have the same bad agent designs. Probably this isn't what you mean either.

If you want me to make it more formal, here: suppose I have a logic L and a universe U. For simplicity, let's say U is a set. The semantic mapping is a map from the collection of formulas of L to the collection of concepts in U; it may be the case that symbols in L get mapped to collections of objects, or collections of collections, but for argument's sake we can assume the codomain to be some number of applications of the powerset operation to U, forming a collection C(U) of "concepts". So it's a mapping S: L --> C(U). The crucial thing is that S is not a member of C(U), so it can't be the image of any symbol in L under S. That is, S is outside the realm of things described by S, and this is true for any such S! Since "phi is provable in L means phi is true under S" is a statement involving S (even if the 'under S' is usually left implicit), it cannot be the interpretation under S of any formula in L, and so cannot be stated, let alone proved.

Right, so, we seem to be on the same page about what the semantic map basically is and Tarski's undefinability theorem and such.

But what I need is a sketch of the agent that you claim solves the Lobstacle.

Most of the logics we encounter in mathematical practice are built with an intended semantics. For example, Peano Arithmetic contains a bunch of symbols which are often informally treated as if they are "the" natural numbers, despite the fact that they are no more than formal symbols and that the standard natural numbers are not the only model of the theory. In the context of logic applied to AI, this results in a conflation between the symbols in a logic employed by an agent and the "intended meaning" of those labels. This happens in the logical induction paper when discussing PA: the formulas the agent handles are assumed to carry their intended meaning in arithmetic.

Actually, that's misconstruing the formal results of the paper, since logical inductors have formal systems as their subjects rather than any fixed semantics. However, it's clear from the motivation and commentary (even within the abstract) that the envisaged use-case for inductors is to model an agent forming beliefs about the truth of formal statements, which is to say their validity in some specific model/semantics.

Sure, but logical induction doesn't know anything about the intended semantics. It doesn't make a lick of difference to the algorithm whether you believe that PA refers to the standard model. Nor does it feature in the mathematical results.

So I would put it to you that logical induction does a good job of capturing what it means to reason in a semantics-ignorant way. If it also works well as a model of reasoning for semantic-ful things, well, that would seem to be a happy accident. And perhaps it doesn't: perhaps logical induction is missing an aspect of rationality around "semantics".

Thus, logical induction would appear to be precisely what you call for at the end of your post: a theory of rationality which reasons in a semantics-independent way.

Comment by abramdemski on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the "utility function" abstraction · 2021-06-22T18:33:35.252Z · LW · GW

I think the key issue here is what you take as an "outcome" over which utility functions are defined. If you take states to be outcomes, then trying to model sequential decisions is inherently a mess. If you take trajectories to be outcomes, then this problem goes away

Right, it seems pretty important that utility not be defined over states like that. Besides, relativity tells us that a simple "state" abstraction isn't quite right.

But broadly speaking, I expect that everything would be much clearer if phrased in terms of reward rather than utility functions, because reward functions are inherently defined over sequential decisions.

I don't like reward functions, since that implies observability (at least usually it's taken that way).

I think a reasonable alternative would be to assume that utility is a weighted sum of local value (which is supposed to be similar to reward).

Example 1: reward functions. Utility is a weighted sum over a reward which can be computed for each time-step. You can imagine sliding a little window over the time-series, and deciding how good each step looks. Reward functions are single-step windows, but we could also use larger windows to evaluate properties over several time-steps (although this is not usually important).

Example 2: (average/total) utilitarianism. Utility is a weighted sum over (happiness/etc) of all people. You can imagine sliding a person-sized window over all of space-time, and judging how "good" each view is; in this case, we set the value to 0 (or some other default value) unless there is a person in our window, in which case we evaluate how happy they are (or how much they are thriving, or their preference satisfaction, or what-have-you).

At this point I really don't know what people who talk about coherence arguments on LW are actually defending.

One thought I had: it's true that utility functions had better be a function of all time, not just a frozen state. It's true that this means we can justify any behavior this way. The utility-theory hypothesis therefore doesn't constrain our predictions about behavior. We could well be better off just reasoning about agent policies rather than utility functions.

However, there seems to be a second thing we use utility theory for, namely, representing our own preferences. My complaint about your proposed alternative, "reward", was that it was not expressive enough to represent preferences I can think of, and which seem coherent (EG, utilitarianism).

So it might be that we're defending the ability to represent preferences we think we might have.

(Of course, I think even utility functions are too restrictive.)

Another thought I had:

Although utility theory doesn't strictly rule out any policy, a simplicity assumption over agent beliefs and utility functions yields a very different distribution over actions than a simplicity assumption over policies.

It seems to me that there are cases which are better-represented by utility theory. For example, predicting what humans do in unusual situations, but where they have time to think, I expect "simple goals and simple world-models" is going to generalize better than "simple policies". I suspect this precisely because humans have settled on describing behaviors in terms of goals and beliefs, in addition to habits/norms (which are about policy). If habits/norms did good enough a job of constraining expectations on their own, we probably would not do that.

This also relates to the AI-safety-debate-relevant question, of how to model highly capable systems. If your objection to "utility theory" as an answer is "it doesn't constrain my expectations", then I can reply "use a simplicity prior". The empirical claim made by utility theory here is: highly capable agents will tend to have behavior explainable via simple utility functions. As opposed to merely having simple policies.

OK, but then, what is the argument for this claim? Certainly not the usual coherence arguments?

Well, I'm not sure. Maybe we can modify the coherence arguments to have simplicity assumptions run through them as well. Maybe not.

What I feel more confident about is that the simplicity assumption embodies the content of the debate (or at least an important part of the content).