AI Unsafety via Non-Zero-Sum Debate 2020-07-03T22:03:16.264Z
AI Services as a Research Paradigm 2020-04-20T13:00:40.276Z
New paper: (When) is Truth-telling Favored in AI debate? 2019-12-26T19:59:00.946Z
Redefining Fast Takeoff 2019-08-23T02:15:16.369Z
Deconfuse Yourself about Agency 2019-08-23T00:21:24.548Z
AI Safety Debate and Its Applications 2019-07-23T22:31:58.318Z


Comment by vojtakovarik on "Zero Sum" is a misnomer. · 2020-10-12T13:44:05.065Z · LW · GW

As a game theorist, I completely endorse the proposed terminology. Just don't tell other game theorists... Sometimes, things get even worse when some people use the term "general sum games" to refer to games that are not constant-sum.

I like to imagine different games on a scale between completely adversarial and completely cooperative. With things in the middle being called "mixed-motive games".

Comment by vojtakovarik on Integrating Hidden Variables Improves Approximation · 2020-08-09T11:55:49.945Z · LW · GW

I am usually reasonably good at translating from math to non-abstract intuitive examples...but I didn't have much success here. Do you have "in English, for simpletons" example to go with this? :-) (You know, something that uses apples and biscuits rather than English-but-abstract words like "there are many hidden variables mediating the interactions between observables" :D.)

Otherwise, my current abstract interpretation of this is something like: "There are detailed models, and those might vary a lot. And then there are very abstract models, which will be more similar to each other...well, except that they might also be totally useless." So I was hoping that a more specific example would clarify things for a bit and tell me whether there is more to this (and also whether I got it all wrong or not :-).)

Comment by vojtakovarik on Noise Simplifies · 2020-08-09T11:34:14.710Z · LW · GW

I have a long list of randomly-chosen numbers between 1 and 10, and I want to know whether their sum is even or odd.

I find your example here somewhat misleading. Suppose your random numbers weren't randomly drawn from 1-10, but from . If you don't know a single number, you still know that there is a 5:1 chance that it will be even (and hence not change the parity of the sum of the whole list). So if a single number is unknown, you will still want to take the sum of the ones you do know. In this light, your example seems like an exception, rather than the norm. (My main issue with it is that since it feels very ad-hoc, you might subconsciously come to the impression that the described behaviour is the norm.)

However, it might easily be that the class of these "exception" is important on its own. So I wouldn't want to shoot down the overall idea described in the post - I like it :-).

Comment by vojtakovarik on How should AI debate be judged? · 2020-07-20T09:42:19.604Z · LW · GW

Even if you keep the argumentation phase asymmetric, you might want to make the answering phase simultaneous or at least allow the second AI to give the same answer as the first AI (which can mean a draw by default).

This doesn't make for a very good training signal, but might have better equilibria.

Comment by vojtakovarik on AI Unsafety via Non-Zero-Sum Debate · 2020-07-19T14:36:39.543Z · LW · GW

I haven't yet thought about this in much detail, but here is what I have:

I will assume you can avoid getting "hacked" while overseeing the debate. If you don't assume that, then it might be important whether you can differentiate between arguments that are vs aren't relevant to the question at hand. (I suppose that it is much harder to get hacked when strictly sticking to a specific subject-matter topic. And harder yet if you are, e.g., restricted to answering in math proofs, which might be sufficient for some types of questions.)

As for the features of safe questions, I think that one axis is the potential impact of the answer and an orthogonal one is the likelihood that the answer will be undesirable/misaligned/bad. My guess is that if you can avoid getting hacked, then the lower-impact-of-downstream-consequences questions are inherently safer (from the trivial reason of being less impactful). But this feels like a cheating answer, and the second axis seems more interesting.

My intuition about the "how likely are we to get an aligned answer" axis is this: There questions where I am fairly confident in our judging skills (for example, math proofs). Many of those could fall into the "definitely safe" category. Then there is the other extreme of questions where our judgement might be very fallible - things that are too vague or that play into our biases. (For example hard philosophical questions and problems whose solutions depend on answers to such questions. E.g., I wouldn't trust myself to be a good judge of "how should we decide on the future of the universe" or "what is the best place for me to go for a vacation".) I imagine these are "very likely unsafe". And as a general principle, where there are two extremes, there often will be a continuum inbetween. Maybe "what is a reasonable way of curing cancer?" could fall here? (Being probably safe, but I wouldn't bet all my money on it.)

Comment by vojtakovarik on AI Unsafety via Non-Zero-Sum Debate · 2020-07-06T11:41:31.949Z · LW · GW

I agree with what Paul and Donald are saying, but the post was trying to make a different point.

Among various things needed to "make debate work", I see three separate sub-problems:

(A) Ensuring that "agents use words to get a human to select them as the winner; and that this is their only terminal goal" is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human's head to explode and their body falls on the reward button, this doesn't count.)

(B) Having already accomplished (A), ensure that "agents use words to convince the human that their answer is better" is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)

(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.

While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI's signals), you have already made a mistake. On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal - most solution to (B) will be compatible with most solutions to (C).

For (A), Michael Cohen's Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.) Michael's recent "AI Debate" Debate post seems to be primarily concerned about (B). Finally, this post could be rephrased as "When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.".

Comment by vojtakovarik on AI Unsafety via Non-Zero-Sum Debate · 2020-07-06T10:24:09.227Z · LW · GW

if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want

That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you - hating me - believe it is poison, we will happily coordinate towards me eating it. I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don't see how this would happen, but that isn't a particularly strong argument.)

Also, the debaters better be comparably smart.

Yes, this seems like a necessary assumption in a symmetric debate. Once again, this is trivially satisfied if the debaters are copies of each other. It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)

Comment by vojtakovarik on AI Unsafety via Non-Zero-Sum Debate · 2020-07-04T19:45:11.810Z · LW · GW

I think I understood the first three paragraphs. The AI "ramming a button to the human" clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario --- by preventing the AI from doing this (boxing), ensuring it doesn't want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase "assume, for the sake of argument, that you have solved all the 'obvious' problems".

Or did you have something else in mind by the first three paragraphs?

I didn't understand the last paragraph. Or rather, I didn't understand how it relates to debate, what setting the AIs appear in, and why would they want to behave as you describe.

Comment by vojtakovarik on How can Interpretability help Alignment? · 2020-05-25T08:33:09.160Z · LW · GW

An important consideration is whether the interpretability research which seems useful for alignment is research which we expect the mainstream ML research community to work on and solve suitably. Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it "ourselves".)

There’s little research which focuses on interpreting reinforcement learning agents [...]. There is some work in DeepMind's safety team on this, isn't there? (Not to dispute the overall point though, "a part of DeepMind's safety team" is rather small compared to the RL community :-).)

Nitpicks and things I didn't get:

  • It was a bit hard to understand what you mean by the "research questions vs tasks" distinction. (And then I read the bullet point below it and came, perhaps falsely, to the conclusion that you are only after "reusable piece of wisdom" vs "one-time thing" distinction.)
  • There is something funny going on in this sentence:

If we believe a particular proposal is more or less likely than others to produce aligned AI, then we would preferentially work on interpretability research which we believe will help this proposal other work which wouldn't, as it wouldn't be as useful.

Comment by vojtakovarik on Book report: Theory of Games and Economic Behavior (von Neumann & Morgenstern) · 2020-05-24T12:27:04.956Z · LW · GW

Related to that: An interesting take (not only) on cooperative game theory is Schelling's The Strategy of Conflict (from 1960, resp. second edition from 1980, but I am not aware of sufficient follow-up research on the ideas presented there). And there might be some useful references in CLR's sequence on Cooperation, Conflict, and Transformative AI.

Comment by vojtakovarik on Book report: Theory of Games and Economic Behavior (von Neumann & Morgenstern) · 2020-05-21T23:11:37.869Z · LW · GW

When I read your summary (and follow up post), I get the impression that you are suggesting it might be reasonable to study the book, follow-up on its ideas, or spend time looking to improve upon its shortcomings. It seems to me that paying attention to a 75 years old textbook (and drawing the attention of others to it) only makes sense for its historical value or if you manage to tease out some timeless lessons. Hm, or maybe to let people know that "no, it doesn't make sense for you to read this". But my impression is that neither of these was the goal? If not, what was the aim of the post then?

Why just not read something up-to-date instead?

Multiagent Systems, Algorithmic, Game-Theoretic,and Logical Foundations (by Yoav Shoham and Kevin Leyton-Brown, 2010) - which I read and can recommend (freely available online)

Game Theory (Maschler, Zamir, Solan, 2013) - which I didn't read, but it should be good and has a 2020 version

Comment by vojtakovarik on What makes counterfactuals comparable? · 2020-05-03T15:10:42.574Z · LW · GW

My impression is that logical counterfactuals, and counterfactuals, and comparability is - at the moment - too confused, and most disagreements here are "merely verbal" ones. Most of your questions (seem to me to) point in the direction of different people using different definitions. I feel slightly worried about going too deep into discussions along the lines of "Vojta reacts to Chris' claims about what other LW people argue against hypothetical 1-boxing CDT researchers from classical academia that they haven't met" :D.

My take on how to do counterfactuals correctly is that this is not a property of the world, but of your mental models:

Definition (comparability according to Vojta): Two scenarios are comparable (given model and observation sequence ) if they are both possible in and and consistent with .

According to this view, counterfactuals only make sense if your model contains uncertainty...

(Aside on logical counterfactuals: Note that there is difference between the model that I use and the hypothetical models I would be able to infer were I to use all my knowledge. Indeed, I can happily reason about 6th digit of being 7, since I don't know what it is, despite knowing the formula for calculating . I would only get into trouble if I were to do the calculations (and process their implications for the real world). Updating your models with new logical information seems like an important problem, but one I think is independent from counterfactual reasoning.)

...however, there remains the fact humans do counterfactual reasoning all the time, even about impossible things ("What if I decided to not write this comment?", "What if the Sun revolved around the Earth?"). I think this is consistent with the above definition, from three reasons. First, the models that humans use are complicated, fragmented, incomplete, and wrong. So much so that positing logical impossibilities (the Sun going around the Earth thing) doesn't make the model inconsistent (because it is so fragmented and incomplete). Second, when doing counterfactuals, we might take it for granted that you are to replace the actual observation history by some alternative . So you then apply the above definition to and (e.g., me not starting to write this comment). When is compatible with the model we use, everything is logically consistent (in ). For example, it might actually be impossible for me to not have started writing this comment, but it was perfectly consistent with my (wrong) model. Finally, when some counterfactual would be inconsistent with our model, we might take it for granted that we are supposed to relax in some manner. Moreover, people might often implicitly assume same/similar relaxation. For example, suppose I know that the month of May has 31 days. The natural relaxation is to be uncertain about month lengths while still remembering it was something between 28 and 31. I might this say that 30 was a perfectly reasonable length, while being indignant upon being asked to consider May that is 370 days long.

As for the implications for your question: The phrasing of 1) seems to suggest a model that has uncertainty about your decision procedure. Thus picking both 10 and 5 seems possible (and consistent with observation history of seeing the two boxes), and thus comparable. Note that this would seem fishier if you additionally posited that you are a utility maximizer (but, I argue, most people would implicitly relax this assumption if you asked them to consider the 5 counterfactual). Regarding 2) I think that "a typical AF reader" uses a model in which "a typical CDT adherent" can deliberate, come to the one-boxing conclusion, and find 1M in the box, making the options comparable for "typical AF readers". I think that "a typical CDT adherent" uses a model in which "CDT adherents" find the box empty while one-boxers find it full, thus making the options incomparable. The third question I didn't understand.

Disclaimer: I haven't been keeping up to date on discussions regarding these matters, so it might be that what I write has some obvious and known holes in it...

Comment by vojtakovarik on What makes counterfactuals comparable? · 2020-05-03T14:14:53.316Z · LW · GW

An evidential decision theorist would smoke in the Smoking Lesion problem so they don't get cancer.

Is this possibly a typo, and should it instead say that EDT would not smoke? (I never seem to remember the details of Smoking Lesion, but this seems inconsistent with the "so they don't get cancer".)

Comment by vojtakovarik on AI Services as a Research Paradigm · 2020-04-24T10:27:34.184Z · LW · GW

It seems to me like you might each be imagining a slightly different situation.

Not quite certain what the difference is. But it seems like Michael is talking about setting up well the parts of the system that are mostly/only AI. In my opinion, this requires AI researchers, in collaboration with experts from whatever-area-is-getting-automated. (So while it might not fall only under the umbrella of AI research, it critically requires it.) Whereas - it seems to me that - Rohin is talking more about ensuring that the (mostly) human parts of society do their job in the presence of automatization. For example, how to deal with unemployment when parts of the industry get automated. (And I agree that I wouldn't go looking for AI researches when tackling this.)

Comment by vojtakovarik on AI Services as a Research Paradigm · 2020-04-21T11:26:40.620Z · LW · GW

List of Research Questions

Looking at these, I feel like they are subquestions of "how do you design a good society that can handle technological development" -- most of it is not AI-specific or CAIS-specific.

It is intentional that not all the problems are technical problems - for example, I expect that not tackling unemployment due to AI might indirectly make you a lot less safe (it seems prudent to not be in a civial war or war when you are attempting to finish building AGI). However, you are right that the list might nevertheless be too broad (and too loosely tied to AI).

Anyway: As a smaller point, I feel that most of the listed problems will get magnified as you introduce more AI services, or they might gain important twists. As a larger point: Am I correct to understand you as implying that "technical AI alignment researchers should primarily focus on other problems" (modulo qualifications)? My intuition is that this doesn't follow, or at least that we might disagree on the degree to which this needs to be qualified to be true. However, I have not yet thought about this enough to be able to elaborate more right now :(. A bookmark that seems relevant is the following prompt:

Conditional on your AI system never turning into an agent-like AGI, how is "not dying and not losing some % of your potential utility because of AI" different from "how do you design a good society that can handle the process of more and more things getting automated"?

(This should go with many disclaimers, first among those the fact that this is a prompt, not an implicit statement that I fully endorse.)

Comment by vojtakovarik on AI Services as a Research Paradigm · 2020-04-21T02:24:09.472Z · LW · GW

Fixed the wrong section numbers and frame problem description.

Informally, we can assume that some description of the world is given by context and view a task as something specified by an initial state and an end state (or states) - accomplishing the task amounts to causing a transformation from the starting state to one of the desired end states.

I feel like this definition is not capturing what I mean by a "task". Many "agent-like" things, such as "become supreme ruler of the world", seem like tasks according to this definition; many useless things like "twitching randomly" can be thought of as completing a "task" as defined here and so would be counted as "services".

Could it be that the problem is not in the "task" part but in the definition service? If I consider the task of building me a house that I will like, I can envision a very service-like way of doing that (ask me a bunch of routine questions, select house-model correspondingly, then proceed to build it in a cook-book manner by calling on other services). But I can also imagine going about this in a very agent-like manner.

(Also, "twitching randomly" seems like a perfectly valid task, and a twitch-bot as a perfectly valid service. Just a very stupid one that nobody would want to build or pay for. Uhm, probably. Hopefully.)

Comment by vojtakovarik on AI Services as a Research Paradigm · 2020-04-21T01:52:11.754Z · LW · GW

I agree with your points in the suggested summary. However, I feel like they are not fully representative of the text. But, as the author, I might be imagining the version of the document in my head rather than the one I actually wrote :-).

  • My estimate is that after reading it, I would gain the impression that the text revolves around the abstract model. Which I thought wasn't the case; definitely wasn't the intention.
  • Also, I am not sure if it is intended that your summary doesn't mention the examples and the "classifying research questions" subsection (which seems equally important to me as the list it generates).
  • Finally, from your planned opinion, I might get the impression that the text suggests no technical problems at all. I think that some of them either are technical problems (e.g., undesired appearance of agency, preventing error propagation and correlated failures, "Tools vs Agents" in Section 6) or have important technical components (all the problems listed as related changes in environment, system, or users). Although whether these are AI specific is arguable.

Side-note 1: I also think that most of the classical AI safety problems also appear in systems of AI services (either in individual services, or in "system-wide variants"). But this is only mentioned in the text briefly, since I am not yet fully clear on how to do the translation between agent-like AIs and systems of AI services. (Also, on the extent to which such translation even makes sense.)

Side-note 2: I imagine that many "non-AI problems" might become "somewhat-AI problems" or even "problems that AI researchers need to deal with" once we get enough progress in AI to automate the corresponding domains.

Comment by vojtakovarik on Embedded Agency via Abstraction · 2020-01-28T11:34:11.951Z · LW · GW

A side-note:

Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

Can't remember the specific reference but: Imperfect-information game theory has some research on abstractions. Naturally, an object of interest are "optimal" abstractions --- i.e., ones that are as small as possible for given accuracy, or as accurate as possible for given size. However, there are typically some negative results, stating that getting (near-) optimal abstractions is at least as expensive as finding the (near-) optimal solution of the full game. Intuitively, I would expect this to be a recurring theme for abstractions in general.

The implication of this is that all the goals should have the implicitly have the caveat that the maps have to be "not-too-expensive to construct". (This is intended to be a side-note, not an advocacy to change the formulation. The one you have there is accessible and memorable :-).)

Comment by vojtakovarik on New paper: (When) is Truth-telling Favored in AI debate? · 2020-01-26T19:18:30.524Z · LW · GW

Thank you for the comments!

A quick reaction to the truth-seeking definition: When writing the definition (of truth-promotion), I imagined a (straw) scenario where I am initially uncertain about what the best answer is --- perhaps I have some belief, but upon reflection, I put little credence in it. In particular, I wouldn't be willing to act on it. Then I run the debate, become fully convinced that the debate's outcome is the correct answer, and act on it.

The other story seems also valid: you start out with some belief, update it based on the debate, and you want to know how much the debate helped. Which of the two options is better will, I guess, depend on the application in mind.

"I'd be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space,"

To dissolve a possible confusion: By "claims about a space of questions" you mean "a claim about every question from a space of questions"? Would this mean that the agents would commit to many claims at once (possibly more than the human judge can understand at once)? (Something I recall Beth Barnes suggesting.) Or do you mean that they would make a single "meta" claim, understandable by the judge, that specified many smaller claims (eg, "for any meal you ask me to cook, I will be able to cook it better than any of my friends"; horribly false, btw.)?

Anyway, yeah, I agree that this seems promising. I still don't know how to capture the relations between different claims (which I somehow expect to be important if we are to prove some guarantees for debate).

I agree with your high-level points regarding the feature debate formalization. I should clarify one thing that might not be apparent from the paper: the message of the counterexamples was meant to be "these are some general issues which we expect to see in debate, and here is how they can manifest in the feature debate toy model", rather than "these specific examples will be a problem in general debates". In particular, I totally agree that the specific examples immediatelly go away if you allow the agents to challenge each others' claims. However, I have an intuition that even with other debate protocols, similar general issues might arise with different specific examples.

For example, I guess that even with other debate protocols, you will be "having a hard time when your side requires too difficult arguments". I imagine there will always be some maximum "inferential distance that a debater can bridge" (with the given judge and debate protocol). And any claim which requires more supporting arguments than this will be a lost cause. How will such an example look like? Without a specific debate design, I can't really say. Either way, if true, it becomes important whether you will be able to convincingly argue that a question is too difficult to explain (without making this a universal strategy even in cases where it shouldn't apply).

A minor point:

"If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised."

I agree with your point here --- debate being wrong in a very unlikely world is not a bug. However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior. So the claim should be "rational judges can have unstable debates in unlikely worlds" and "biased judges can have unstable debates even in typical worlds".

Comment by vojtakovarik on AI Safety Debate and Its Applications · 2020-01-23T14:59:01.247Z · LW · GW


(Just noticed your comment for the other debate post/paper. I will reply to it during the weekend.)

Comment by vojtakovarik on New paper: (When) is Truth-telling Favored in AI debate? · 2019-12-29T23:00:30.162Z · LW · GW

I guess on first reading, you can cheat by reading the introduction, Section 2 right after that, and the conclusion. One level above that is reading the text but skipping the more technical sections (4 and 5). Or possibly reading 4 and 5 as well, but only focusing on the informal meaning of the formal results.

Regarding the background knowledge required for the paper: It uses some game theory (Nash equilibria, extensive form games) and probability theory (expectations, probability measures, conditional probability). Strictly speaking, you can get all of this from looking up whichever keywords on wikipedia. I think that all of the concepts used there are basic in the corresponding fields, and in particular no special knowledge of measure theory is required. However, I studied both game theory and measure theory, so I am biased, and you shouldn't trust me. (Moreover, there is a difference between "strictly speaking, only this is needed" and "my intuitions are informed by X, Y, and Z".)

Another thing is that the AAAI workshop where this will appear has a page limit, which means that some explanations might have gotten less space than they would deserve. In particular, the arguments in Section 4 are much easier to digest if you can draw the functions that the text talks about. To understand the formal results, I think I visualized two-dimensional slices of the "world space" (i.e., squares), and assumed that the value of the function is 0 by default, except for being 1 at some selected subset of the square. This allows you to compute all the expectations and conditionals visually.

Comment by vojtakovarik on Deconfuse Yourself about Agency · 2019-10-09T21:26:21.155Z · LW · GW

First off, while I feel somewhat de-confused about X-like behavior, I don't feel very confident about X-like architectures. Maybe the meaning is somewhat clear on higher levels of abstraction (e.g., if my brain goes "realize I want to describe a concept --> visualize several explanations and judge each for suitability --> pick the one that seems the best --> send a signal to start typing it down", then this would be a kind of search/optimization-thingy). But on the level of physics, I don't really know what an architecture means. So take this with a grain of salt.

Maybe the term "physical structure" is misleading. The thing I was trying to point at is the distinction between being able to accurately model Y using model X, and Y actually being X. In the sense that there might be a giant look-up table (GLUT) that accuractly predicts your behavior, but on no level of abstraction is it correct to say that you actually are a GLUT. Whereas modelling you as having some goals, planning, etc. might be less accurate but somewhat more, hm, true. I realize this isn't very precise, but I guess you can see what I mean.

That being said, I suppose that what I meant by "optimization architecture" is, for example, a stochastic gradient descent with the emphasis on "this is the input", "this is the part of the algorithm that does the calculation", and "this is the output". An "implementation of an optimization architecture" would be...well, the atoms of your computer that perform SGD, or maybe some simple bacteria that moves in the direction where the concentration of whatever-it-likes is the highest (not that anything I know would implement precisely SGD, but still).

Ad "interesting physical structure" behind the ant-colony: If by "evolution" we mean the atoms that the world is made of, as they changed over time until your ant colony emerged...then yeah, this is a physical structure causally upstream of the ant colony, and one that is responsible for the ant colony behaving the way it does. I wouldn't say it is interesting (to me, and w.r.t. the ant colony) though, since it is totally incomprehensible to me. (But maybe "interestingness" doesn't really make sense on the level of physics, and is only relevant in relation to our abstract world-models and their understanding.)

Finally, the ideal thing a "X-like behavior ==> Y-like architecture" theorem would cash out into is a criterion that you can actually check and say with certainty that the thing will not exhibit X-like behavior. (Whether this is reasonable to hope for is another matter.) So, even if all that I have written in this comment turns out to be nonsense, getting such criterion is what we are after :-).

Comment by vojtakovarik on Deconfuse Yourself about Agency · 2019-09-04T10:42:32.450Z · LW · GW

I agree with your summary :). The claim was that humans often predict behavior by assuming that something has a particular architecture.

(And some confusions about agency seem to appear precisely because of not making the architecture/behavior distinction.)

Comment by vojtakovarik on Problems with AI debate · 2019-08-30T23:57:20.273Z · LW · GW

Intuitively, I agree that the vacation question is under-defined / has too many "right" answers. On the other hand, I can also imagine the world where you can develop some objective fun theory, or just something which actually makes the questions well-posed. And the AIs could use this fact in the debate:

Bob: "Actually, you can derive a well-defined fun theory and use it to answer this question. And then Bali clearly wins."

Alice: "There could never be any such thing!"

Bob: "Actually, there indeed is such a theory, and its central idea is [...]."

[They go on like this for a bit, and eventually, Bob wins.]

Indeed, this seems like a thing you could (by explaining that integration is a thing) if somebody tried to convince you that there is no principled way to measure the area of a circle.

However -- if true -- this only shows that there are less under-defined question than we think. The "Ministry of Ambiguity versus the Department of Clarity" fight is still very much a thing, as are the incentives to manipulate the human. And perhaps most importantly, routinely holding debates where the AI "explains to you how to think about something" seems extremely dangerous...

Comment by vojtakovarik on Deconfuse Yourself about Agency · 2019-08-30T08:29:11.464Z · LW · GW

I have a sense that (formalized) versions of A(Θ)-morphism are going to be more useful (or easier?) for the behavioral side, though it isn't really clear.

I think -morphisation is primarily useful for describing what we often mean when we say "agency". In particular, I view this as distinct from which concepts we should be thinking about in this space. (I think the promising candidates include learning that Vanessa points to in her comment, optimization, search, and the concepts in the second part of my post.)

However, I think it might also serve as a useful part of the language for describing (non) agent-like behavior. For example, we might want to SGD-morphise an ecoli bacteria independently of whether it actually implements some form of stochastic gradient descent w.r.t. the concentration of some chemicals in the environment.

You mention the distinction between agent-like architecture and agent-like behavior (which I find similar to my distinction between selection and control), but how does the concept of A(Θ)-morphism account for this distinction?

I think of agent-like architectures as something objective, or related to the territory. In contrast, agent-like behavior is something subjective, something in the map. Importantly, agent-like behavior, or the lack of it, of some is something that exists in the map of some entity (where often ).

The selection/control distinction seems related, but not quite similar to me. Am I missing something there?

Comment by vojtakovarik on Deconfuse Yourself about Agency · 2019-08-29T18:45:23.993Z · LW · GW

I am not even sure what the input/output channels of a rock are supposed to be

I guess you imagine that the input is the physical forces affecting the ball and the output is the forces the ball exerts on the environment. Obviously, this is very much not useful for anything. But it suddenly becomes non-trivial if you consider something like the billiard-ball computer (seems like a theoretical construct, not sure if anybody actually built one...but it seems like a relevant example anyway).

Comment by vojtakovarik on Deconfuse Yourself about Agency · 2019-08-29T18:37:20.453Z · LW · GW

Yep, that totally makes sense.

Observations inspired by your comment: While this shouldn't necessarily be so, it seems the particular formulations make a lot of difference when it comes to exchanging ideas. If I read your comment without the

(although maybe "intelligence" would be a better word?)

bracket, I immediatelly go "aaa, this is so wrong!". And if I substitute "intelligent" for "agent", I totally agree with it. Not sure whether this is just me, or whether it generalizes to other people.

More specifically, I agree that from the different concepts in the vicinity of "agency", "the ability to learn the environment and exploit this knowledge towards a certain goal" seems to be particularly important to AI alignment. I think the word "agency" is perhaps not well suited for this particular concept, since it comes with so many other connotations. But "intelligence" seems quite right.

Comment by vojtakovarik on Towards an Intentional Research Agenda · 2019-08-23T18:05:45.761Z · LW · GW

(I don't have much experience thinking in these terms, so maybe the question is dumb/already answered in the post. But anyway: )

Do you have some more-detailed (and stupidly explicit) examples of the intentional and algorithmic views on the same thing, and how to translate between them?

Comment by vojtakovarik on Vaniver's View on Factored Cognition · 2019-08-23T17:39:04.136Z · LW · GW

That is, I can easily see how factored cognition allows you to stick to cognitive strategies that definitely solve a problem in a safe way, but don't see how it does that and allows you to develop new cognitive strategies to solve a problem that doesn’t result in an opening for inner optimizers--not within units, but within assemblages of units.

Do you have some intuition for how inner optimizers would arise within assemblages of units, without being initiated by some unit higher in the hierarchy? Or is that what you are pointing at?

Comment by vojtakovarik on AI Safety Debate and Its Applications · 2019-07-31T11:54:17.526Z · LW · GW

I agree with Lanrian. A perhaps better metric is the chance that randomly selected pixels of a randomly selected image will cause the judge to guess the label correctly. This corresponds to "judge accuracy (random pixels)" in Table 2 of the original paper, and it's 48.2%/59.4% for 4/6 pixels.