Posts

Goal-directedness: relativising complexity 2022-08-18T09:48:40.864Z
Goal-directedness: tackling complexity 2022-07-02T13:51:07.853Z
Examining Armstrong's category of generalized models 2022-05-10T09:07:41.147Z
Goal-directedness: imperfect reasoning, limited knowledge and inaccurate beliefs 2022-03-19T17:28:04.695Z
Goal-directedness: exploring explanations 2022-02-14T16:20:31.570Z
Goal-directedness: my baseline beliefs 2022-01-08T13:09:06.888Z
Escaping the Löbian Obstacle 2021-06-16T00:02:02.945Z

Comments

Comment by Morgan_Rogers on A Bird's Eye View of the ML Field [Pragmatic AI Safety #2] · 2022-10-05T12:29:49.841Z · LW · GW

This post sought to give an overview of how they do this, which is in my view extremely useful information!

This is what I was trying to question with my comment above: Why do you think this? How am I to use this information? It's surely true that this is a community that needs to be convinced of the importance of work on safety, as you point out in the next post in the sequence, but how does information about, say, the turnover of ML PhD students help me do that?

Thus to answer the question "what kind of research approaches generally work for shaping machine learning systems?" it is quite useful to engage with how they have worked in capabilities advancements. In machine learning, theoretical (in "math proofs" sense of the word) approaches to advancing capabilities have largely not worked. This suggests deep learning is not amenable to these kinds of approaches. 

There is conflation happening here which undermines your argument: theoretical approaches dominated how machine learning systems were shaped for decades, and you say so at the start of this post. It turned out that automated learning produced better results in terms of capabilities, and it is that success that makes it the continued default. But the former fact surely says a lot more about whether or not theory can "shape machine learning systems" than the latter. Following through with your argument, I would instead conclude that implementing theoretical approaches to safety might require us to compromise on capabilities, and this is indeed exactly what I expect: learning systems would have access to much more delicious data if they ignored privacy regulations and other similar ethical boundaries, but safety demands that capability is not the singular shaping consideration in AI systems.

Knowledge that useable theory has not really been produced in deep learning suggests to me that it's unlikely to for safety, either.

This is simply not true. Failure modes which were identified by purely theoretical arguments have been realised in ML systems. System attacks and pathological behaviour (for image classifiers, say) are regularly built in theory before they ever meet real systems. It's also worth noting that any architecture choices or to, say, make backprop more algorithmically efficient, are driven by theory.

In the end, my attitude is not that "iterative engineering practices will never ensure safety", but rather that there are plenty of people already doing iterative engineering, and that while it's great to convince as many of those as possible to be safety-conscious, there would be further benefits to safety if some of their experience could be applied to the theoretical approaches that you're actively dismissing.

Comment by Morgan_Rogers on A Bird's Eye View of the ML Field [Pragmatic AI Safety #2] · 2022-09-14T15:06:51.780Z · LW · GW

There is a disheartening irony to calling this series "Practical AI Safety" and having the longest post being about capabilities advancements which largely ignore safety.

The first part of this post consists in observing that ML applications proceed from metrics, and subsequently arguing that theoretical approaches have been unsuccessful in learning problems. This is true but irrelevant for safety, unless your proposal is to apply ML to safety problems, which reduces AI Safety to 'just find good metrics for safe behaviour'. This seems as far from a pragmatic understanding of what is needed in AI Safety as one can get.

In the process of dismissing theoretical approaches, you ask "Why do residual connections work? Why does fractal data augmentation help?" These are exactly the kind of questions which we need to be building theory for, not to improve performance, but for humans to understand what is happening well enough to identify potential risks orthogonal to the benchmarks which such techniques are improving against, or trust that such risks are not present.

You say, "If we want to have any hope of influencing the ML community broadly, we need to understand how it works (and sometimes doesn’t work) at a high level," and provide similar prefaces as motivation in other sections. I find these claims credible, assuming the "we" refers to AI Safety researchers, but considering the alleged pragmatism of this sequence,  it's surprising to me that none of the claims are followed up with suggested action points. Given the information you have provided, how can we influence this community? By publishing ML papers at NeurIPS? And to what end are you hoping to influence them? AI Safety can attract attention, but attention alone doesn't translate into progress (or even into more person-hours).

Your disdain for theoretical approaches is transparent here (if it wasn't already from the name of this sequence). But your reasoning cuts both ways. You say, "Even if the current paradigm is flawed and a new paradigm is needed, this does not mean that [a researcher's] favorite paradigm will become that new paradigm. They cannot ignore or bargain with the paradigm that will actually work; they must align with it." I expect that 'metrics suffice', (a strawperson of) your favoured paradigm, will not be the paradigm that will actually work, and it's disappointing that your sequence carries the message (to my reading) that technical ML researchers can make significant progress in alignment and safety without really changing what they're doing.

Comment by Morgan_Rogers on [Closed] Job Offering: Help Communicate Infrabayesianism · 2022-03-24T08:12:07.892Z · LW · GW

If I haven't found a way to extend my post-doc position (ending in August) by mid-July and by some miracle this job offer is still open, it could be the perfect job for me. Otherwise, I look forward to seeing the results.

Comment by Morgan_Rogers on Goal-directedness: exploring explanations · 2022-03-19T17:25:25.113Z · LW · GW

A note on judging explanations

I should address a point that wasn't addressed in the post, and which may otherwise be a point of confusion going forward: the quality of an explanation can be high according to my criteria even if it isn't empirically correct. That is, there are some explanations of behaviour which may be falsifiable: if I am observing a robot, I could explain its behaviour in terms of an algorithm, and one way to "test" that explanation would be to discover the algorithm which the robot is in fact running. However, no matter the result of this test, the judged quality of the explanation is not affected. Indeed, there are two possible outcomes: either the actual algorithm provides a better explanation overall, or our explanatory algorithm could be a simpler algorithm with the same effects, and hence be a better explanation than the true one, since using this simpler algorithm is a more efficient way to predict the robot's behaviour than simulating the robot's actual algorithm.

This might seem counterintuitive at first, but it's really just Occam's razor in action. Functionally speaking, the explanations I'm talking about in this post aren't intended to be recovering the specific algorithm the robot is running (just as we don't need the specifics of its hardware or operating system); I am only concerned with accounting for the robot's behaviour.

Comment by Morgan_Rogers on Harmful Options · 2022-03-19T12:52:06.050Z · LW · GW

Suppose your computer games, in addition to the long difficult path to your level's goal, also had little side-paths that you could use—directly in the game, as corridors—that would bypass all the enemies and take you straight to the goal, offering along the way all the items and experience that you could have gotten the hard way.  And this corridor is always visible, out of the corner of your eye.

Even if you resolutely refused to take the easy path through the game, knowing that it would cheat you of the very experience that you paid money in order to buy—wouldn't that always-visible corridor, make the game that much less fun?  Knowing, for every alien you shot, and every decision you made, that there was always an easier path?

This exact phenomenon happens in Deus Ex: Human Revolution, where you can get around almost every obstacle in the game by using the ventilation system. The frustration that results is apparent in this video essay/analysis: it undermines all of the otherwise well-designed systems in the game in spite of not actually interfering with the player's ability to engage with them.

I wonder if, alongside the "loss of rejected options" proposition, a reason that extra choices impact us is the mental bandwidth they take up. If the satisfaction we derive from a choice is (to a first-order approximation) proportional to our intellectual and emotional investment in the option we select, then having more options leaves less to invest as soon as the options go from being free to having any cost at all. As an economic analogy, a committee seeking to design a new product or building must choose between an initial set of designs. The more designs there are, the more resources must go into the selection procedure, and if the committee's budget is fixed, then this will remove resources that could have improved the product further down the line.

Comment by Morgan_Rogers on Why Rationalists Shouldn't be Interested in Topos Theory · 2022-03-12T15:37:49.548Z · LW · GW

[0,1] is a commutative quantale when equipped with its usual multiplication. You can lift the monoidal product structure to sheaves on [0,1] (viewed as a frame) via Day convolution. So we recover a topos where the truth values are probabilities. 

People who have attempted to build toposes with probabilities as truth values have also failed to notice this. Take Isham and Doering's paper, for example, (which I personally am quite averse to because they bullishly follow through on constructing toposes with certain properties which are barely justified). They don't even think about products of probabilities.

I think the monoidal topos on the unit interval merits some serious investigation.

Comment by Morgan_Rogers on Goal-directedness: exploring explanations · 2022-02-21T12:50:02.704Z · LW · GW

I see what you're getting at. For an arbitrary explanation, we need to take into account not only the complexity of the explanation itself, but also how difficult it is to compute a relevant prediction from that explanation; according to my criteria, the Standard Model (or any sufficiently detailed theory of physics that accurately explains phenomena within a conservative range of low-ish energy environments encountered on Earth) would count as a very good explanation for any behaviour for its complexity, but that's ignoring the fact that it would be impossible to actually compute those predictions.

While I made the claim that there is a clear dividing line between (accuracy and power) and (complexity), this strikes me as an issue straddling complexity and explanatory power, which muddies the water a little.

Since I've appealed to physics explanations in my post, I'm glad you've made me think about these points. Moving forward, though, I expect the classes of explanation under consideration to be so constrained as to make this issue insignificant. That is, I expect to be directly comparing explanations taking the form of goals to explanations taking the form of algorithms or similar; each of these has a clear interpretation in terms of its predictions and, while the former might be harder to compute, the difference in difficulty is going to be suitably uniform across the classes (after accounting for complexity of explanations), so that I feel justified in ignoring it until later.

Comment by Morgan_Rogers on Goal-directedness: my baseline beliefs · 2022-01-14T13:14:46.657Z · LW · GW

Thanks for the ideas!

I like the idea about the size of the target states; there's bound to be some interesting measure theory that I can apply if I decide to formalize in that direction. In fact, measure theory might be able to clarify some of the subtleties I alluded to above regarding what happens when we refine the world model (for example, in a way that causes a single goal state to split into two or more).

There are hints in your last paragraph of associating competence with goal-directedness, which I think is an association to avoid. For example, when a zebra is swimming across a river as fast as it can, I would like the extent to which that behaviour is considered goal-directed to be independent of whether that zebra is the one that gets attacked by a crocodile.

Comment by Morgan_Rogers on Why Subagents? · 2021-10-26T14:26:39.778Z · LW · GW

The example you give has a pretty simple lattice of preferences, which lends itself to illustrations but which might create some misconceptions about how the subagent model should be formalized. For example, in your example you assume that the agents' preferences are orthogonal (one cares about pepperoni, the other about mushrooms, and each is indifferent to the opposite direction), the agents have equal weighting in the decision-making, the lattice is distributive... Compensating for these factors, there are many ways that a given 'weak utility' can be expressed in terms of subagents. I'm sure there are optimization questions that follow here, about the minimum number of subagents (dimensions) needed to embed a given weak-utility function (partially ordered set), and about when reasonable constraints such as orthogonality of subagents can be imposed. There are also composition questions: how does a committee of agents with subagents behave?

Comment by Morgan_Rogers on A revolution in philosophy: the rise of conceptual engineering · 2021-10-25T15:16:18.601Z · LW · GW

It's really nice to see a critical take on analytic philosophy, thank you for this post. The call-out aspect was also appreciated: coming from mathematics, where people are often quite reckless about naming conventions to the detriment of pedagogical dimensions of the field, it is quite refreshing.

On the philosophy content, it seems to me that many of the vices of analytic philosophy seem hard to shake, even for a critic such as yourself.

Consider the "Back to the text" section. There is some irony in your accusation of Chalmers basing his strategy on its name via its definition rather than the converse, yet you end that section with giving a definition-by-example of what engineering is and proceed with that definition. To me, this points to the tension between dismissing the idea that concepts should be well-defined notions in philosophical discourse, while relying on at least some precision of denotation in using names of concepts in discourse.

You also seem to lean on anthropological principles as analytic philosophy does. I agree that the only concepts which will appear in philosophical discourse will be those which are relevant to human experience, but that experience extends far beyond "human life" to anything of human interest (consider the language of physics and mathematics, which often doesn't have direct relation to our immediate experience), and this is a consequence of the fact that philosophy is a human endeavour rather than anything intrinsic to its content.

I'd like to take a different perspective on your Schmidhuber quote. Contrary to your interpretation, the fact that concepts are physically encoded in neural structures supports the Platonic idea that these concepts have an independent existence (albeit a far more mundane one than Plato might have liked). The empirical philosophy approach might be construed as investigating the nature of concepts statistically. However, there is a category error happening here: in pursuing this investigation, an empirical philosopher is conflating the value of the global concept with their own "partial" perspective concept.

I would argue that, whether one is convinced they exist or not, no one is invested in communal concepts, which are the kind of fragmented, polysemous entity which you describe, for their own sake. Individuals are mostly invested in their own conceptions of concepts, and take an interest in communal concepts only insofar as they are interested in being included in the community in which it resides. In short, relativism is an alternative way to resolve concepts: we can proceed not by rejecting the idea that concepts can have clear definitions (which serve to ground discourse in place of the more nebulous intuitions which motivate them), but rather by recognizing that any such definitions must come with a limited scope. I also personally reject the idea that a definition should be expected to conform to all of the various "intuitions" which are appealed to in classical philosophy for various reasons, but especially because there seems no a priori reason that any human should have infallible (or even rational) intuitions about concepts.

I might even go so far as to say that recognizing relativism incorporates your divide and conquer approach to resolving disagreement: the gardeners and landscape artists can avoid confusion when they discuss the concept of soil by recognizing their differing associations with the concept and hence specifying the details relevant to the union of their interests. But each can discard the extraneous details in discussion with their own community, just as physicists will go back to talking about "sound" in its narrowed sense when talking with other physicists. These narrowings only seem problematic if one expects the scope of all discourse to be universal.

Comment by Morgan_Rogers on Model splintering: moving from one imperfect model to another · 2021-10-24T11:11:21.480Z · LW · GW

In section 4.6, you described an "unnatural" reward function splintering, and went on to advocate for more natural ones. I would agree with your argument as a general principle, but on the other hand I can think of situations where an exceptional case should be accounted for. Suppose that the manager of the rube-blegg factory keeps a single rube and a single blegg in a display case on the factory floor to present to touring visitors. A rube classifier which physically sorts rubes and bleggs should be able to recognize that these displayed examples are not to be sorted with the others, even though this requires making an unnatural extension of its internal reward function.
I think your examples in Section 6 of suitably deferring to human values upon model splintering could resolve this, but to me it highlights that a naive approach to model splintering could result in problems if the AI is not keeping track of enough features of the world to identify when an automatic "natural" extension of its model is inappropriate.

Comment by Morgan_Rogers on Generalised models as a category · 2021-10-08T12:45:34.203Z · LW · GW

Re "I'm not fully sold on category theory as a mathematical tool", if someone (e.g. me) were to take the category you've outlined and run with it, in the sense of establishing its general structure and special features, could you be convinced? Are there questions that you have about this category that you currently are only able to answer by brute force computation from the definitions of the objects and morphisms as you've given them? More generally, are there variants of this category that you've considered that it might be useful to study in parallel?

Comment by Morgan_Rogers on Subagents of Cartesian Frames · 2021-10-08T12:11:33.567Z · LW · GW

I am very experienced in category theory but not the Chu construction (or *-autonomous categories in general). There is a widely used notion of subobject of an object  in a category  as "equivalence class of monomorphisms with codomain ". This differs from your definition most conspicuously in the case of  where there is no morphism from this frame to a typical frame.

If I'm calculating correctly, the standard notion of subobject is strictly stronger than the one you present here (as long as the world  is inhabited, and even in that case I think the construction collapses enough to make it true) since monomorphisms are morphisms which are injective in their agent argument and surjective in their environment argument, and we can extend any morphism to  along such a monomorphism.

Now, I notice that you refer to the concepts in this post as subagents rather than subframes, so perhaps you were deliberately avoiding this stronger concept. Intuitively, a subframe in the sense I describe above consists of an agent with a subset of the available options and who may not be able to distinguish between some of the environments present in a larger frame; the "precommitted agent" you mention early on here seems to be a special case of this which is the identity in the environment component. Incidentally, the equivalence relation corresponding to this notion of subobject corresponds to isomorphism in the finite case but is non-trivial for a similar reason to the case you described of infinite frames.

I wonder if you have any thoughts about how these notions compare? It's clear from the discussion that you chose a definition which reflected what you wanted to express, which is always good, but on the other hand the monomorphisms I described will crop up when you consider factorizations of the morphisms in your category more generally. Perhaps they could be useful to you.

Comment by Morgan_Rogers on Escaping the Löbian Obstacle · 2021-07-18T19:21:16.481Z · LW · GW

Note also that your definition implies that if an agent alieves something, it must also believe it.

I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine!

I don't know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I'm pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the "reasoning engine" (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be "more complicated" than aliefs, which made me unsure.

Anyway, since you keep taking the time to thoroughly reply in good faith, I'll do my best to clarify and address some of the rest of what you've said. However, thanks to the discussion we've had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don't want to muddy the waters further!

Rather, for the Lobstacle, "A trusts B" has to be defined as "A willingly relies on B to perform mission-critical tasks". This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.

Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their "relevance" to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).

Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform "mission-critical tasks"), at least as far as my own logical reasoning is concerned, and an agent that doesn't trust itself is going to waste a lot of time second-guessing its own actions. So I don't think you've addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.

Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.

Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you're admitting that your approach has this problem, but, you think there's value in what you've done so far because you've solved one part of the problem and you think this other part could also work with time. Is that what you're intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don't see what the value in your proposal could be.
For me, solving the Lobstacle means being able to actually decide to delegate.

There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategy means that the Löbstacle only applies to questions concerning the (formal) reasoning of a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn't exist (because it's founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn't address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you're claiming that Löb's theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is "doomed to fail".

The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.

Your implication is "there was not a problem to begin with" rather than "I have solved the problem". I asked whether you objected to details of the math in the original paper, and you said no -- so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!).

Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.

Indeed, my claim boils down to saying that there is no problem. But I don't see why that doesn't constitute a solution to the apparent problem. It's like the Missing Dollar Riddle; explaining why there's no problem is the solution.

I'm somewhat curious if you think you've communicated your perspective shift to any other person; so far, I'm like "there just doesn't seem to be anything real here", but maybe there are other people who get what you're trying to say?

There's no real way for me to know. Everyone who I've spoken to about this in person has gotten it, but that only amounts to a handful of people. It's hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.

Comment by Morgan_Rogers on Escaping the Löbian Obstacle · 2021-07-11T18:55:21.149Z · LW · GW

I like the alief/belief distinction, this seems to carry the distinction I was after. To make it more formal, I'll use "belief" to refer to 'things which an agent can prove in its reasoning engine/language (L)', and "alief" to refer to beliefs plus 'additional assumptions which the agent makes about the bearing of that reasoning on the environment', which together constitute a larger logic (L'). Does that match the distinction you intended between these terms?

An immediate pedagogical problem with this terminology is that we have to be careful not to conflate this notion of belief with the usual one: an agent will still be able to prove things in L even if it doesn't believe (in the conventional sense) that the L-proof involved is valid.

There is a more serious formalization issue at play, though, which is the problem of expressing a negative alief. How does one express that an agent "does not alieve that a proof of X in L implies that X is true"?  is classically equivalent to , which in particular is an assertion of both the existence of a proof of X and the falsehood of X, which is clearly far stronger than the intended claim. This is going off on a tangent, so for now I will just assume that it is possible to express disaliefs by introducing some extra operators in L' and get on with it.

So, to be absolutely clear on this: do you accept the mathematical proofs in the original paper (but propose some way around them), or do you think they are actually mistaken? Do you accept the proof of Lob's theorem itself, or do you think it is mistaken?

Yes. The mathematical proofs and Löb's theorem are absolutely fine. What I'm refuting is their relevance; specifically the validity of this claim:

An agent can only trust another agent if it believes that agent's aliefs.

My position is that *when their beliefs are sound* an agent only ever needs to *alieve* another agents *beliefs* in order to trust them. A definition of trust which fails to be reflexive is clearly a bad definition, and with this modified definition there is no obstacle because "beliefs being strictly weaker than aliefs" is the default (indeed, by Löb's theorem, the only reasonable possibility), and can be implemented symmetrically between multiple agents; no heirarchy of logics or other such solution is needed.

Note that following the construction in the article, the secondary agent B can only act on the basis of a valid L-proof, so there is no need to distinguish between trusting what B says (the L-proofs B produces) and what B does (their subsequent action upon producing an L-proof). Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.

> So if A constructs B to produce proofs in L, it doesn't matter what B's beliefs are, or even if B has any beliefs;

But using proofs in L is precisely what I understood from "belief"; so, if you have something else in mind, I will need clarification.

In the language you introduce, and bearing in mind what I said above, this would be restated as "it doesn't matter what B's aliefs are". I can use this to illustrate why I needed to include the condition "when their beliefs are sound" in the above: suppose A and B have have differing/imperfect knowledge about the state of the environment, to the effect that B may disalieve the soundness of L, while A alieves it. The result is that B might alieve they are misleading A by communicating an L-proof which B alieves to be invalid. But if A's application of L is in fact sound (so B is in the wrong) then A can still trust and successfully apply the L-proof supplied by B.

> ...nor will it use such assumptions in determining its moves, because such determinations are made without any reference to a semantic map.

But then, this part of the sentence seems to go completely off the rails for me; a chess-playing robot will be no good at all if the rules of chess don't bear on its moves. Can you clarify?

You're right, this part was a mistake. Such a robot will optimize play with respect to what it alieves are the rules, and so will become very good at what it implicitly alieves chess to be.

> This is why I said the agent doesn't change its reasoning.

I still don't understand this part at all; could you go into more depth?

I was taking "reasoning" here to mean "applying the logic L" (so manipulating statements of belief), since any assumptions lying strictly in L' are only applied passively. It feels strange to me to extend "reasoning" to include this implicit stuff, even if we are including it in our formal model of the agent's behaviour.

But then, your bolded statement seems to just be a re-statement of the Löbstacle:  logical agents can't explicitly endorse their own logic L' which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.

It's certainly a restatement of Löb's theorem. My assertion is that there is no resultant obstacle.

Re the rest,

(And I also don't yet see what that part has to do with getting around the Löbstacle.)

It's not relevant to getting around the Löbstacle; this part of the discussion was the result of me proposing a possible advantage of the perspective shift which (I believe, but have yet to fully convince you) resolves the Löbstacle. I agree that this part is distracting, but it's also interesting, so please direct message me (via whatever means is available on LW, or by finding me on the MIRIx Discord server or AI alignment Slack) if you have time to discuss it some more.

Comment by Morgan_Rogers on What are some claims or opinions about multi-multi delegation you've seen in the memeplex that you think deserve scrutiny? · 2021-06-27T18:19:47.965Z · LW · GW

Critch's comments support an opinion I've held since I started thinking seriously about alignment: that the language we use to describe it is too simple, and ignores the fact that "human" interests (the target of alignment) are not the monolith they're usually presented as.

For your specific question about multi-multi, I only have limited access to the memeplex, so I'll just share my thoughts. Multi-multi delegation involves:
1. Compromise / resolution of conflicts of interest between delegators.
2. Mutual trust in delegators regarding communication of interests to delegatees.
3. Equitable control between delegators. This could be lumped in with conflicts of interest, but deserves special attention.
4. Capacity for communication and cooperation between delegatees.

... and some other aspects I haven't thought of. As far as I can see, though, the most important issues here would be addressed by consideration of single-multi and multi-single; multi-multi-specific problems will only be relevant when there are obstacles to communication between either delegators or delegatees (a conceivable future problem, but not a problem as long as the complexity of systems actually being constructed is limited).

Comment by Morgan_Rogers on Escaping the Löbian Obstacle · 2021-06-23T21:56:09.157Z · LW · GW

Seems like you missed my point that the meta-logical belief could just be "L is sound" rather than "L plus me is sound". Adding the first as an axiom to L is fine (it results in an L' which is sound if L was sound), while adding the second as an axiom is very rarely fine (it proves soundness and consistency of the whole system, so the whole system had better be too weak for Godel's incompleteness theorems to apply).

Aha! I knew I must be missing something, thanks for the clarification. That makes things easier. I'll continue to use L' to mean "L + Sound(L,S)", where S is a particular semantic map.

> A doesn't need B to believe that the logic is sound. Even if you decide to present "logic L plus metalogical beliefs" as a larger logic L' (and assuming you manage to do this in a way that doesn't lead to inconsistency), the semantic map is defined on L, not on L'.

My problem is that I still don't understand how you propose for the agent to reason/behave differently than what I've described; so, your statement that it does in fact do something different doesn't help me much, sorry.

The semantic map is defined on L, not L' -- sure, this makes some sense? But this just seems to reinforce the idea that our agent can only "understand" the internal logic of agents who restrict themselves to only use L (not any meta-logical beliefs).

The Löbian obstacle is about trust in the reasoning performed by a subordinate agent; the promise of subsequent actions taken based on that reasoning are just a pretext for considering that formal problem. So if A constructs B to produce proofs in L, it doesn't matter what B's beliefs are, or even if B has any beliefs; B could just be manipulating these as formal expressions. If you insist that B's beliefs be incorporated into its reasoning, as you seem to want to (more on that below), then I'm saying it doesn't matter what extension of L the agent B is equipped with; it can even be an inconsistent extension, as long as only valid L-proofs are produced.

To me, this "ought" in the sentence reads as a prediction (basically an abuse of 'ought', which is in common usage basically because people make is/ought errors). I would prefer to re-phrase as "if phi is provable in L, then the interpretation of phi will be true" or less ambitiously "will probably be true".

Is your proposal that "X is true" should be taken as a statement of X's desirability, instead? Or perhaps X's normativity? That's what it means to put it on the "ought" side, to me. If it means something different to you, we need to start over with the question of what the is/ought divide refers to.

The word "normative" sticks out to me as potential common ground here, so I'll use that language. The specified semantic map determines what is "actually" true, but its content is not a priori knowledge. As such, the only way for A's reasoning in L to have any practical value is if A works under the (normative) assumption that provability implies/will imply truth under S.

If this sounds farfetched, consider how police dramas employ the normative nature of truth for dramatic effect on a regular basis. A rogue detective uses principles of reasoning that their superiors deem invalid, so that the latter do not expect the detective's deductions to yield results in reality... Or perhaps the practice of constructive mathematics would be a more comfortable example, where the axiom of choice is rejected in favour of weaker deductive principles. A dedicated constructivist could reasonably make the claim that their chosen fragment of logic determines what "ought" to be true.

However, Tarski's analysis also shows how to build a stronger logic L' which knows the semantic map for L. So if "move to the other side" means represent via more formal axioms (which is what I take you to mean), you seem to be wrong.

A compromise which was suggested to me this past week is to work with a two-level logic. It seems to me disingenuous that "A reasons in L" should mean "actually, A reasons in the stronger logic L' which incorporates all the axioms it needs to ensure that provability in L implies truth", as the Tarskian construction suggests. After all, if A attempts to justify that decision to itself, it seems to me to trigger an infinite regress of more and more powerful logics. Identifying and segregating the axioms which refer to the semantic map S as "meta-logical beliefs" is enough to avoid such an existential crisis, just as we only need to employ PA + Sound(PA,N), where N is our favourite model of the natural numbers, to use PA to reason about those natural numbers.

Justified, perhaps, but also non-existent, right? You say the agent doesn't change its reasoning. So the reasoning is exactly the same as the lobstacle agent from the paper. So it doesn't conclude its own soundness. Right??

You seem exasperated that I'm not incorporating the meta-logical beliefs into the formal system, but this is because in practice meta-logical beliefs are entirely implicit, and do not appear in the reasoning system used by agents. If I build a chess-playing robot, its program will not explicitly include the assumption that the rules of chess it carries are the correct ones, nor will it use such assumptions in determining its moves, because such determinations are made without any reference to a semantic map. This is why I said the agent doesn't change its reasoning. 

The metalogical beliefs thus are only relevant when an agent applies its (completed) reasoning to its environment via the semantic map. We can formalize that situation via the extended logic L' if you like, in which case this is probably the answer that you keep demanding:

Both A and B "reason" in L' (B could even work in a distinct extension of L), but will only accept proofs in the fragment L. Since the axioms in L' extending L are identifiable by their reference to the semantic mapping, there is no risk of "belief contamination" in proofs, if that were ever a concern.

I expect there is more I should say to justify how this avoids the obstacle, but I will wait for your feedback first.

Sure, but logical induction doesn't know anything about the intended semantics. It doesn't make a lick of difference to the algorithm whether you believe that PA refers to the standard model. Nor does it feature in the mathematical results.

Thus, logical induction would appear to be precisely what you call for at the end of your post: a theory of rationality which reasons in a semantics-independent way.

Without an intended semantics, what the probabilities assigned to formulas can only be interpreted as beliefs/levels of certainty about provability of statements from axioms (which it also believes with more or less certainty). This is great, but as soon as you want your logical inductor to reason about a particular mathematical object, the only way to turn those beliefs about provability into beliefs about truth in the model, you need to extend the inductor (explicitly or implicitly) with meta-logical beliefs about the soundness of that map, since it will update its beliefs based on provability even if its proof methods aren't sound.

As such, I feel you've misunderstood me here. I don't want semantics-independent reasoning at all, if anything the opposite: reasoning that prioritises verifying soundness of a logic wrt specified semantic maps in an empirical way, and which can adapt its reasoning system when soundness is shown to fail. A logical inductor isn't equipped with the capacity to work in mere fragments of its logic (it could be modified to do so, but that's certainly not the main point of this algorithm, as I understand it), so can only go as far as identifying when its reasoning is not sound by observing that it believes a contradiction, without being able to do anything about it besides modifying the beliefs it holds in its axioms indefinitely.

Comment by Morgan_Rogers on Escaping the Löbian Obstacle · 2021-06-17T05:50:29.334Z · LW · GW

I should pre-emptively correct my "formal" argument, since it's not true that S can never be in its own codomain; arguably I can construct U so that C(U) contains the names of some semantic maps as elements (although in this purely set-theoretic set-up, it's hard to see how doing so would capture their content). Nonetheless, a diagonalisation argument that depends only on L and C(U) being non-trivial demonstrates that C(U) cannot contain every semantic map, which I think should be enough to salvage the argument.

Comment by Morgan_Rogers on Escaping the Löbian Obstacle · 2021-06-16T20:06:47.479Z · LW · GW

It seems like you just get a new system, L', which believes in the soundness of L, but which doesn't believe in its own soundness. So the agent can trust agents who use L, but cannot trust agents who additionally have the same meta-logical beliefs which allow them to trust L. Meaning, the agent cannot trust itself.

A doesn't need B to believe that the logic is sound. Even if you decide to present "logic L plus metalogical beliefs" as a larger logic L' (and assuming you manage to do this in a way that doesn't lead to inconsistency), the semantic map is defined on L, not on L'. In the practical situation under consideration when arriving at the Löbian Obstacle, the consequence is that A needn't worry about forming beliefs about L' for the purposes of trusting B's (actions based on) proofs in L to be sound.

I suppose if the meta-logical belief is "L plus this very meta-logical belief is sound", yeah.

I didn't understand this remark; please could you clarify?

I don't understand why you would put truth on the "ought" side of the is/ought divide, or if you do, how it helps us out here.

To put soundness in is/ought form, the belief that A must hold is that "if phi is provable in L, the interpretation of phi (via the semantic map) ought to be true". Truth can't be moved to the other side, because as I've tried to explain (perhaps unsuccessfully) the logic doesn't include its own semantics, and it's always possible to take contrarian semantics which fail to be sound applications of L. (see also final point below)

I don't get how "interpreting lobs theorem differently" helps us out here, or, what it really means.

The interpretation which results in the Löbian Obstacle is "Löb's theorem tells us that a logical agent can't trust its own reasoning, because it can't prove that reasoning is sound," and under that interpretation it seems that extreme measures must be taken to make formal logic relevant to AI reasoning, which is counterintuitive since we humans employ formalizable reasoning every day without any such caveats. In this post I'm saying "Löb's theorem reminds us that a logical agent cannot make a priori claims about the soundness of its own reasoning, because soundness is a statement about a semantic map, which the logic has no a priori control over".

This isn't concrete enough for me. Suppose an agent has just such a mapping, in its back pocket. Seems like the same exact proofs as in the paper go through... so just having such a mapping doesn't "immediately dissolve" the lobian obstacle. I suppose you mean that the agent changes its reasoning somehow. Yet you explicitly state that we don't just add to the list of axioms. So how does the meta-logical belief interact with everything else? Something about the action condition is different?

No, the agent doesn't change its reasoning. Assume that the agent A is reasoning soundly about their environment, which is to say that their semantic mapping is sound. Then A's belief in the soundness of their reasoning is justified. The change is that we don't require A to prove that their semantic mapping is sound, because A cannot do that, and I'm claiming that this doesn't break anything.

If you want me to make it more formal, here: suppose I have a logic L and a universe U. For simplicity, let's say U is a set. The semantic mapping is a map from the collection of formulas of L to the collection of concepts in U; it may be the case that symbols in L get mapped to collections of objects, or collections of collections, but for argument's sake we can assume the codomain to be some number of applications of the powerset operation to U, forming a collection C(U) of "concepts". So it's a mapping S: L --> C(U). The crucial thing is that S is not a member of C(U), so it can't be the image of any symbol in L under S. That is, S is outside the realm of things described by S, and this is true for any such S! Since "phi is provable in L means phi is true under S" is a statement involving S (even if the 'under S' is usually left implicit), it cannot be the interpretation under S of any formula in L, and so cannot be stated, let alone proved.

What do you mean when you say that a logical inductor has an implicit assumption of particular fixed semantics?

Most of the logics we encounter in mathematical practice are built with an intended semantics. For example, Peano Arithmetic contains a bunch of symbols which are often informally treated as if they are "the" natural numbers, despite the fact that they are no more than formal symbols and that the standard natural numbers are not the only model of the theory. In the context of logic applied to AI, this results in a conflation between the symbols in a logic employed by an agent and the "intended meaning" of those labels. This happens in the logical induction paper when discussing PA: the formulas the agent handles are assumed to carry their intended meaning in arithmetic.

Actually, that's misconstruing the formal results of the paper, since logical inductors have formal systems as their subjects rather than any fixed semantics. However, it's clear from the motivation and commentary (even within the abstract) that the envisaged use-case for inductors is to model an agent forming beliefs about the truth of formal statements, which is to say their validity in some specific model/semantics.