Posts

What is the purpose and application of AI Debate? 2024-04-04T00:38:24.932Z
Extinction Risks from AI: Invisible to Science? 2024-02-21T18:07:33.986Z
Extinction-level Goodhart's Law as a Property of the Environment 2024-02-21T17:56:02.052Z
Dynamics Crucial to AI Risk Seem to Make for Complicated Models 2024-02-21T17:54:46.089Z
Which Model Properties are Necessary for Evaluating an Argument? 2024-02-21T17:52:58.083Z
Weak vs Quantitative Extinction-level Goodhart's Law 2024-02-21T17:38:15.375Z
VojtaKovarik's Shortform 2024-02-04T20:57:44.150Z
My Alignment "Plan": Avoid Strong Optimisation and Align Economy 2024-01-31T17:03:34.778Z
Control vs Selection: Civilisation is best at control, but navigating AGI requires selection 2024-01-30T19:06:29.913Z
AI Awareness through Interaction with Blatantly Alien Models 2023-07-28T08:41:07.776Z
Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics 2023-07-21T21:03:21.501Z
Recursive Middle Manager Hell: AI Edition 2023-05-04T20:08:17.583Z
OpenAI could help X-risk by wagering itself 2023-04-20T14:51:00.338Z
Legitimising AI Red-Teaming by Public 2023-04-19T14:05:06.372Z
How do you align your emotions through updates and existential uncertainty? 2023-04-17T20:46:29.510Z
Formalizing Objections against Surrogate Goals 2021-09-02T16:24:39.818Z
Risk Map of AI Systems 2020-12-15T09:16:46.852Z
Values Form a Shifting Landscape (and why you might care) 2020-12-05T23:56:57.516Z
AI Problems Shared by Non-AI Systems 2020-12-05T22:15:27.928Z
AI Unsafety via Non-Zero-Sum Debate 2020-07-03T22:03:16.264Z
AI Services as a Research Paradigm 2020-04-20T13:00:40.276Z
New paper: (When) is Truth-telling Favored in AI debate? 2019-12-26T19:59:00.946Z
Redefining Fast Takeoff 2019-08-23T02:15:16.369Z
Deconfuse Yourself about Agency 2019-08-23T00:21:24.548Z
AI Safety Debate and Its Applications 2019-07-23T22:31:58.318Z

Comments

Comment by VojtaKovarik on When is Goodhart catastrophic? · 2024-04-15T18:37:17.853Z · LW · GW

Assumption 2 is, barring rather exotic regimes far into the future, basically always correct, and for irreversible computation, this always happens, since there's a minimum cost to increase the features IRL, and it isn't 0.

Increasing utility IRL is not free.

I think this is a misunderstanding of what I meant. (And the misunderstanding probably only makes sense to try clarifying it if you read the paper and disagree with my interpretation of it, rather than if your reaction is only based on my summary. Not sure which of the two is the case.)

What I was trying to say is that the most natural interpretation of the paper's model does not allow for things like: In state 1, the world is exactly as it is now, except that you decided to sleep on the floor every day instead of in your bed (for no particular reason), and you are tired and miserable all day. State 2 is exactly the same as state 1, except you decided that it would be smarter to sleep in your bed. And now, state 2 is just strictly better than state 1 (at least in all respects that you would care to name).
Essentially, the paper's model requires, by assumption, that it is impossible to get any efficiency gains (like "don't sleep on the floor" or "use this more efficient design instead) or mutually-beneficial deals (like helping two sides negotiate and avoid a war).

Yes, I agree that you can interpret the model in ways that avoid this. EG, maybe by sleeping on the floor, your bed will last longer. And sure, any action at all requires computation. I am just saying that these are perhaps not the interpretations that people initially imagine when reading the paper,. So unless you are using an interpretation like that, it is important to notice those strong assumptions.

Comment by VojtaKovarik on What is the purpose and application of AI Debate? · 2024-04-11T00:12:00.564Z · LW · GW

I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)

Comment by VojtaKovarik on What is the purpose and application of AI Debate? · 2024-04-04T18:55:15.745Z · LW · GW

I was previously unaware of Section 4.2 of the Scalable AI Safety via Doubly-Efficient Debate paper and, hurray, it does give an answer to (2) in Section 4.2. (Thanks for mentioning, @niplav!) That still leaves (1) unanswered, or at least not answered clearly enough, imo. Also I am curious about the extent that other people, who find debate promising, consider this paper's answer to (2) as the answer to (2).

For what it's worth, none of the other results that I know about were helpful for me for understanding (1) and (2). (The things I know about are the original AI Safety via Debate paper, follow-up reports by OpenAI, the single- and two-step debate papers, the Anthropic 2023 post, the Khan et al. (2024) paper. Some more LW posts, including mine.) I can of course make some guesses regarding plausible answers to (1) and (2). But most of these papers are primarily concerned with exploring the properties of debates, but not explaining where debate fits in the process of producing an AI (and what problem it aims to address).

Comment by VojtaKovarik on What is the purpose and application of AI Debate? · 2024-04-04T18:29:45.840Z · LW · GW

The original people kind-of did, but new people started, and Geoffrey Irving continued/got-back-to working on it.

Comment by VojtaKovarik on What is the purpose and application of AI Debate? · 2024-04-04T01:01:10.412Z · LW · GW

Further disclaimer: Feel free to answer even if you don't find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising --- or at least from people who have a very good model of specific such people.

Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on "as far as the proposal is concerned, how is Debate actually meant to be used"? (For example, do we apply it during deployment, as a way of checking the safety of solutions proposed by other systems? Or do we use it during deployment, to generate solutions? Or do we use it to generate training data?) And as far as I know, of all the existing work, only the Nov 2023 paper addresses my questions, and it only answers (Q2). But I am not sure to what extent is the answer given there canonical. So I am interested in knowing the opinions of people who currently endorse Debate.

Illustrating what I mean by the questions: If I were to answer the questions 1-3 for RLHF, I could for example say that:
(1) RLFH is meant for turning a neural network trained for next-token prediction into, for example, an agent that acts as a chatbot and gives helpful, honest, and lawsuit-less answers.
(2) RLHF is used for generating training (or fine-tuning) data (or signal).
(3) Seems pretty good for this purpose, for roughly <=human-level AIs.

Comment by VojtaKovarik on [April Fools' Day] Introducing Open Asteroid Impact · 2024-04-02T03:35:58.336Z · LW · GW

I believe that a promising safety strategy for the larger asteroids is to put them in a secure box prior to them landing on earth. That way, the asteroid is -- provably -- guaranteed to have no negative impact on earth.

Proof:

   | | | | | | | |
   v v v v v v v v
   __________                        CC
  |        ___      |                     CCCC
  |     / O O \    |         :-)         CCC           :-)
  |    | o C o |  |        _|_         ||  o       _|_
  |     \  o _ /    |          |           ||/            |
  |_________ |         /\           ||             /\
--------------------------------------------------------
                                                                        □

Comment by VojtaKovarik on Technologies and Terminology: AI isn't Software, it's... Deepware? · 2024-03-21T18:50:23.961Z · LW · GW

Agreed.

It seems relevant, to the progression, that a lot of human problem solving -- though not all -- is done by the informal method of "getting exposed to examples and then, somehow, generalising". (And I likewise failed to appreciate this, not sure until when.) This suggests that if we want to build AI that solves things in similar ways that humans solve them, "magic"-involving "deepware" is a natural step. (Whether building AI in the image of humans is desirable, that's a different topic.)

Comment by VojtaKovarik on Technologies and Terminology: AI isn't Software, it's... Deepware? · 2024-03-21T02:50:02.328Z · LW · GW

tl;dr: It seems noteworthy that "deepware" has strong connotations with "it involves magic", while the same is not true for AI in general.


I would like to point out one thing regarding the software vs AI distinction that is confusing me a bit. (I view this as complementing, rather than contradicting, your post.)

As we go along the progression "Tools > Machines > Electric > Electronic > Digital", most[1] of the examples can be viewed as automating a reasonably-well-understood process, on a progressively higher level of abstraction.[2]
[For example: A hammer does basically no automation. > A machine like a lawn-mower automates a rigidly-designed rotation of the blades. > An electric kettle does-its-thingy. > An electronic calculator automates calculating algorithms that we understand, but can do it for much larger inputs than we could handle. > An algorithm like Monte Carlo tree search automates an abstract reasoning process that we understand, but can apply it to a wide range of domains.]

But then it seems that this progression does not neatly continue to the AI paradigm. Or rather, some things that we call AI can be viewed as a continuation of this progression, while others can't (or would constitute a discontinuous jump).
[For example, approaches like "solving problems using HCH" (minus the part where you use unknown magic to obtain a black box that imitates the human) can be viewed as automating a reasonably-well-understood process (of solving tasks by decomposing & delegating them). But there are also other things that we call AI that are not well described as a continuation of this progression --- or perhaps they constitute a rather extreme jump. On the other hand, deep learning automates the not-well-understood process of "stare at many things, then use magic to generalise". And the other example is abstract optimisation, which automates the not-well-understood process of "search through many potential solutions and pick the one that scores the best according to an objective function". And there are examples that lie somewhere inbetween --- for example, AlphaZero is mostly a quite well-understood process, but it does involve some opaque deep learning.]

I suppose we could refer to the distinction as "does it involve magic?". It then seems noteworthy that "deepware" has strong connotations with magic, while the same isn't true for all types of AI.[2]

 

  1. ^

    Or perhaps just "many"? I am not quite sure, this would require going through more examples, and I was intending for this to be a quick comment.

  2. ^

    To be clear, I am not super-confident that this progression is a legitimate phenomenon. But for the sake of argument, let's say it is.

  3. ^

    An interesting open question is how large hit to competitiveness would we suffer if we restricted ourselves to systems that only involve a small amount of magic.

Comment by VojtaKovarik on Many arguments for AI x-risk are wrong · 2024-03-11T23:13:05.103Z · LW · GW

I want to flag that the overall tone of the post is in tension with the dislacimer that you are "not putting forward a positive argument for alignment being easy".

To hint at what I mean, consider this claim:

Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.

I think this claim is only valid if you are in a situation such as "your probability of scheming was >95%, and this was based basically only on this particular version of the 'counting argument' ". That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it --- then yes, you should strongly update.
But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn't mean our intuitions cannot be more or less detailed. It's just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that "scheming is instrumental for a large class of goals" makes a huge contribution to my beliefs (of "something between 10% and 99% on alignment being hard"), while the particular version of the 'counting argument' that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.

I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: "So, we don't have any rigorous arguments about AI risk being real or not, and we won't have them for quite a while yet. Should we be super-careful about it, just in case?". But I do think that is appropriate.

Comment by VojtaKovarik on Can we get an AI to do our alignment homework for us? · 2024-02-26T22:42:52.578Z · LW · GW

I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don't understand or perhaps disagree with.)

Are you saying something like: "Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn't hard in the first place)."?

If yes, then I agree --- but I feel that of the two questions, "would the plan work in theory" is the much less interesting one. (For example, suppose that OpenAI could in theory use AI to solve alignment in 2 years. Then this won't really matter unless they can refrain from using that same AI to build misaligned superintelligence in 1.5 years. Or suppose the world could solve AI alignment if the US government instituted a 2-year moratorium on AI research --- then this won't really matter unless the US government actually does that.)

Comment by VojtaKovarik on Can we get an AI to do our alignment homework for us? · 2024-02-26T21:07:52.760Z · LW · GW

However, note that if you think we would fail to sufficiently check human AI safety work given substantial time, we would also fail to solve various issues given a substantial pause

This does not seem automatic to me (at least in the hypothetical scenario where "pause" takes a couple of decades). The reasoning being that there is difference between [automate a current form of an institution, and speed-run 50 years of it in a month] and [an institutions, as it develops over 50 years].

For example, my crux[1] is that current institutions do not subscribe to the security mindset with respect to AI. But perhaps hypothetical institutions in 50 years might.

  1. ^

    For being in favour of slowing things down; if that were possible in a reasonable way, which it might not be.

Comment by VojtaKovarik on Can we get an AI to do our alignment homework for us? · 2024-02-26T20:49:38.133Z · LW · GW

Assumming that there is an "alignment homework" to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.

An important disclaimer is that perhaps there is no "alignment homework" that needs to get done ("alignment by default", "AGI being impossible", etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question - namely, because they think that the homework to be done isn't particularly difficult in the first place.


For example, suppose OpenAI can use AI to automate many research tasks that they already know how to do. Or they can use it to scale up the amount of research they produce. Etc. But this is likely to only give them the kinds of results that they could come up with themselves (except possibly much faster, which I acknowledge matters).
However, suppose that the solution to making AI go well lies outside of the ML paradigm. Then OpenAI's "superalignment" approach would need to naturally generate solutions outside of this new paradigm. Or it would need to cause the org to pivot to a new paradigm. Or it would need to convince OpenAI that way more research is needed, and they need to stop AI progress until that happens.
And my point here is not to argue that this won't happen. Rather, I am suggesting that whether this would happen seems strongly connected to whether OpenAI would be able to do these things even prior to all the automation. (IE, this depends on things like: Will people think to look into a particular problem? Will people be able to evaluate the quality of alignment proposals? Is the organisational structure set up such that warning signs will be taken seriously?)

To put it in a different way:

  • We can use AI to automate an existing process, or a process that we can describe in enough detail.
    (EG, suppose we want to "automate science". Then an example of a thing that we might be able to do would be to: Set up a system where many LLMs are tasked to write papers. Other LLMs then score those papers using the same system as human researchers use for conference reviewes. And perhaps the most successful papers then get added to the training corpus of future LLMs. And then we repeat the whole thing. However, we do not know how to "magically make science better".)
  • We can also have AI generate solution proposals, but this will only be helpful to the extent that we know how to evaluate the quality of those proposals.[1]
    (EG, we can use AI to factorise numbers into their prime factors, since we know how to check whether  is equal to the original number. However, suppose we use an AI to generate a plan for how to improve an urban design of a particular city. Then it's not really clear how to evaluate that plan. And the same issue arises when we ask for plans regarding the problem of "making AI go well".)

Finally, suppose you think that the problem with "making AI go well" is the relative speeds of progress in AI capabilities vs AI alignment. Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.[2]

  1. ^

    A relevant intuition pump: The usefulness of forecasting questions on prediction markets seems limited by your ability to specify the resolution criteria.

  2. ^

    The resonable default assumption might be that AI will speed up capabilities and alignment equally. In contrast, arguing for disproportional speedup of alignment sounds like corporate b...cheap talk. However, there might be reasons to believe that AI will disproportionally speed up capabilities - for example, because we know how to evaluate capabilities research, while the field of "make AI go well" is much less mature.

Comment by VojtaKovarik on Extinction Risks from AI: Invisible to Science? · 2024-02-22T01:23:31.693Z · LW · GW

Quick reaction:

  • I didn't want to use the ">1 billion people" formulation, because that is compatible with scenarios where a catastrophe or an accident happens, but we still end up controling the future in the end.
  • I didn't want to use "existential risk", because that includes scenarios where humanity survives but has net-negative effects (say, bad versions of Age of Em or humanity spreading factory farming across the stars).
  • And for the purpose of this sequence, I wanted to look at the narrower class of scenarios where a single misaligned AI/optimiser/whatever takes over and does its thing. Which probably includes getting rid of literally everyone, modulo some important (but probably not decision-relevant?) questions about anthropics and negotiating with aliens.
Comment by VojtaKovarik on Extinction Risks from AI: Invisible to Science? · 2024-02-22T00:40:02.053Z · LW · GW

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?

(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)

Comment by VojtaKovarik on Extinction Risks from AI: Invisible to Science? · 2024-02-22T00:37:50.111Z · LW · GW

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.

Comment by VojtaKovarik on VojtaKovarik's Shortform · 2024-02-05T22:45:53.980Z · LW · GW

A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven't solved everything, you must have made a bunch of progress.

Right, I agree. I didn't realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like "[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However, it is somewhat misleading in that 100% loss explained does not mean you understand what is going on inside the system."

I rephrased that now. Would be curious to hear whether you still have objections to the updated phrasing.

Comment by VojtaKovarik on VojtaKovarik's Shortform · 2024-02-04T20:57:44.291Z · LW · GW

[% of loss explained] isn't a good interpretability metric [edit: isn't enough to get guarantees].
In interpretability, people use [% of loss explained] as a measure of the quality of an explanation. However, unless you replace the system-being-explained by its explanation, this measure has a fatal flaw.

Suppose you have misaligned superintelligence X pretending to be a helpful assistant A --- that is, acting as A in all situations except those where it could take over the world. Then the explanation "X is behaving as A" will explain 100% of loss, but actually using X will still kill you.

For [% of loss explained] to be a useful metric [edit: robust for detecting misalignment], it would need to explain most of the loss on inputs that actually matter. And since we fundamentally can't tell which ones those are, the metric will only be useful (for detecting misaligned superintelligences) if we can explain 100% of loss on all possible inputs.

Comment by VojtaKovarik on My Alignment "Plan": Avoid Strong Optimisation and Align Economy · 2024-02-01T19:25:25.459Z · LW · GW

I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the "distribute AI(x-1) quickly" part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the "single point of failure" effect, though it seems unclear how large.)

Comment by VojtaKovarik on My Alignment "Plan": Avoid Strong Optimisation and Align Economy · 2024-01-31T20:09:12.223Z · LW · GW

To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.

But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.

Comment by VojtaKovarik on Protecting agent boundaries · 2024-01-25T19:58:36.100Z · LW · GW

Might be obvious, but perhaps seems worth noting anyway: Ensuring that our boundaries are respected is, at least with a straightforward understanding of "boundaries", not sufficient for being safe.
For example:

  • If I take away all food from your local supermarkets (etc etc), you will die of starvation --- but I haven't done anything with your boundaries.
  • On a higher level, you can wipe out humanity without messing with our boundaries, by blocking out the sun.
Comment by VojtaKovarik on Would you have a baby in 2024? · 2023-12-27T18:16:35.565Z · LW · GW

An aspect that I would not take into account is the expected impact of your children.

Most importantly, it just seems wrong to make personal-happiness decisions subservient to impact.
But even if you did want to optimise impact through others, then betting on your children seems riskier and less effective than, for example, engaging with interested students. (And even if you wanted to optimise impact at all costs, then the key factors might not be your impact through others. But instead (i) your opportunity costs, (ii) second order effects, where having kids makes you more or less happy, and this changes the impact of your work, and (iii) negative second order effects that "sacrificing personal happiness because of impact" has on the perception of the community.)

Comment by VojtaKovarik on Would you have a baby in 2024? · 2023-12-27T17:59:22.360Z · LW · GW

In fact it's hard to find probable worlds where having kids is a really bad idea, IMO.

One scenario where you might want to have kids in general, but not if timelines are short, is if you feel positive about having kids, but you view the first few years of having kids as a chore (ie, it costs you time, sleep, and money). So if you view kids as an investment of the form "take a hit to your happiness now, get more happiness back later", then not having kids now seems justifiable. But I think that this sort of reasoning requires pretty short timelines (which I have), with high confidence (which I don't have), and high confidence that the first few years of having kids is net-negative happiness for you (which I don't have).

(But overall I endorse the claim that, mostly, if you would have otherwise wanted kids, you should still have them.)
 

Comment by VojtaKovarik on Evaluating the historical value misspecification argument · 2023-11-27T18:57:08.177Z · LW · GW

(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)

It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?

My reaction to this is that: Actually, current LLMs do care about our preferences, and about their guardrails. It was never about getting some AI to care about our preferences. It is about getting powerful AIs to robustly care about our preferences. Where by "robustly" includes things like (i) not caring about other things as well (e.g., prediction accuracy), (ii) generalising correctly (e.g., not just maximising human approval), and (iii) not breaking down when we increase the amount of optimisation pressure a lot (e.g., will it still work once we hook it into future-AutoGPT-that-actually-works and have it run for a long time?).

Some examples of what would cause me to update are: If we could make LLMs not jailbreakable without relying on additional filters on input or output.

Comment by VojtaKovarik on Box inversion revisited · 2023-11-17T17:55:50.745Z · LW · GW

Nitpicky comment / edit request: The circle inversion figure was quite confusing to me. Perhaps add a note to it saying that solid green maps onto solid blue, red maps onto itself, and dotted green maps onto dotted blue. (Rather than colours mapping to each other, which is what I intuitively expected.)

Comment by VojtaKovarik on The conceptual Doppelgänger problem · 2023-08-11T10:41:10.712Z · LW · GW

Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.

Comment by VojtaKovarik on Does LessWrong allow exempting posts from being scraped by GPTBot? · 2023-08-09T20:03:49.475Z · LW · GW

Related questions:

  • What is the expected sign of the value of marking posts like this? (One might wonder whether explicitly putting up "DON'T LOOK HERE!" won't backfire.)
    [I expect some AI companies might respect these signs, so this seems genuinely unclear.]
  • Is there a way of putting things on the internet in a way that more robustly prevents AIs from seeing them?
    [I am guessing not, but who knows...]
     
Comment by VojtaKovarik on Circumventing interpretability: How to defeat mind-readers · 2023-08-04T18:19:50.637Z · LW · GW

E.g. Living in large groups such that it’s hard for a predator to focus on any particular individual; a zebra’s stripes.

Off-topic, but: Does anybody have a reference for this, or a better example? This is the first time I have heard this theory about zebras.

Comment by VojtaKovarik on The Open Agency Model · 2023-08-02T17:33:50.481Z · LW · GW

Two points that seem relevant here:

  1. To what extent are "things like LLMs" and "things like AutoGPT" very different creatures, with the latter sometimes behaving like a unitary agent?
  2. Assuming that the distinction in (1) matters, how often do we expect to see AutoGPT-like things?

(At the moment, both of these questions seem open.)

Comment by VojtaKovarik on Role Architectures: Applying LLMs to consequential tasks · 2023-08-02T17:24:48.626Z · LW · GW

This made me think of "lawyer-speak", and other jargons.

More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)

Comment by VojtaKovarik on AI Awareness through Interaction with Blatantly Alien Models · 2023-07-31T14:15:17.714Z · LW · GW

I would distinguish between "feeling alien" (as in, most of the time, the system doesn't feel too weird or non-human to interact with, at least if you don't look too closely)  and "being alien" (a in, "having the potential to sometimes behave in a way that a human never would").

My argument is that the current LLMs might not feel alien (at least to some people), but they definitely are. For example, any human that is smart enough to write a good essay will also be able to count the number of words in a sentence --- yet LLMs can do one, but not the other. Similarly, humans have moods and emotions and other stuff going in their heads, such that when they say "I am sorry" or "I promise to do X", it is a somewhat costly signal of their future behaviour --- yet this doesn't have to be true at all for AI.

(Also, you are right that people believe that ChatGPT's isn't conscious. But this seems quite unrelated to the overall point? As in, I expect some people would also believe ChatGPT if it started saying that it is conscious. And if ChatGPT was conscious and claimed that it isn't, many people would still believe that it isn't.)

Comment by VojtaKovarik on AI Awareness through Interaction with Blatantly Alien Models · 2023-07-31T13:18:54.607Z · LW · GW

I agree that we shouldn't be deliberately making LLMs more alien in ways that have nothing to do with how alien they actually are/can be. That said, I feel that some of the examples I gave are not that far from how LLMs / future AIs might sometimes behave? (Though I concede that the examples could be improved a lot on this axis, and your suggestions are good. In particular, the GPT-4 finetuned to misinterpret things is too artificial. And with intentional non-robustness, it is more honest to just focus on naturally-occurring failures.)

To elaborate: My view of the ML paradigm is that the machinery under the hood is very alien, and susceptible to things like jailbreaks, adversarial examples, and non-robustness out of distribution. Most of the time, this makes no difference to the user's experience. However, the exceptions might be disproportionally important. And for that reason, it seems important to advertise the possibility of those cases.

For example, it might be possible to steal other people's private information by jailbreaking their LLM-based AI assistants --- and this is why it is good that more people are aware of jailbreaks. Similarly, it seems easy to create virtual agents that maintain a specific persona to build trust, and then abuse that trust in a way that would be extremely unlikely for a human.[1] But perhaps that, and some other failure modes, are not yet sufficiently widely appreciated?

Overall, it seems good to take some action towards making people/society/the internet less vulnerable to these kinds of exploits. (The example I gave in the post were some ideas towards this goal. But I am less married to those than to the general point.) One fair objection against the particular action of advertising the vulnerabilities is that doing so brings them to the attention of malicious actors. I do worry about this somewhat, but primarily I expect people (and in particular nation-states) to notice these vulnerabilities anyway. Perhaps more importantlly, I expect potential misaligned AIs to notice the vulnerabilities anyway --- so patching them up seems useful for (marginally) decreasing the world's take-over-ability.

  1. ^

    For example, because a human wouldn't be patient enough to maintain the deception for the given payoff. Or because a human that would be smart enough to pull this off would have better ways to spend their time. Or because only a psychopathic human would do this, and there is only so many of those.

Comment by VojtaKovarik on AI Safety in a World of Vulnerable Machine Learning Systems · 2023-07-25T10:42:18.786Z · LW · GW

I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.

Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).

But ultimately, I think none of these fits perfectly. So a longer, self-contained description is something like:

  • Consider the world where we automate more and more things using AI systems that have vulnerable components. Perhaps those vulnerabilities primarily come from narrow-purpose neural networks and foundation models. But some might also come from insecure software design, software bugs, and humans in the loop.
  • And suppose some parts of the economy/society will be designed more securely (some banks, intelligence services, planes, hopefully nukes)...while others just have glaring security holes.
  • A naive expectation would be that a security hole gets fixed if and only if there is somebody who would be able to exploit it. This is overly optimistic, but note that even this implies the existence of many vulnerabilities that would require stronger-than-existing level of capability to exploit. More realistically, the actual bar for fixing security holes will be "there might be many people who can exploit this, but it is not worth their opportunity cost". And then we will also not-fix all the holes that we are unaware of, or where the exploitation goes undetected.
    These potential vulnerabilities leave a lot of space for actual exploitation when the stakes get higher, or we get a sudden jump in some area of capabilities, or when many coordinated exploits become more profitable than what a naive extrapolation would suggest.

There are several potential threats that have particularly interesting interactions with this setting:

  • (A) Alignment scheme failure: An alignment scheme that would otherwise work fails due to vulnerabilities in the AI company training it. This seems the closest to what this post describes?
  • (B) Easier AI takeover: Somebody builds a misaligned AI that would normally be sub-catastrophic, but all of these vulnerabilities allow it to take over.
  • (C) Capitalism gone wrong: The vulnerabilities regularly get exploited, in ways that either go undetected or cause negative externalities that nobody relevant has incentives to fix. And this destroys a large portion of the total value.
  • (D) Malicious actors: Bad actors use the vulnerabilities to cause damage. (And this makes B and C worse.)
  • (E) Great-power war: The vulnerabilities get exploited during a great-power war. (And this makes B and C worse.)

Connection to Cases 1-3: All of this seems very related to how you distinguish between adversarial robustness gets solved before tranformative AI/after TAI/never. However, I would argue that TAI is not necessarily the relevant cutoff point here. Indeed, for Alignment failure (A) and Easier takeover (B), the relevant moment is "the first time we get an AI capable of forming a singleton". This might happen tomorrow, by the time we have automated 25% of economically-relevant tasks, half a year into having automated 100% of tasks, or possibly never. And for the remaining threat models (C,D,E), perhaps there are no single cutoff points, and instead the stakes and implications change gradually?

Implications: Personally, I am the most concerned about misaligned AI (A and B) and Capitalism gone wrong (C). However, perhaps risks from malicious actors and nation-state adversaries (D, E) are more salient and less controversial, while pointing towards the same issues? So perhaps advancing the agenda outlined in the post can be best done through focusing on these? [I would be curious to know your thoughts.]

Comment by VojtaKovarik on AI Safety in a World of Vulnerable Machine Learning Systems · 2023-07-24T17:37:07.812Z · LW · GW

An idea for increasing the impact of this research: Mitigating the "goalpost moving" effect for "but surely a bit more progress on capabilities will solve this".

I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds --- but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough "coordination point" that would allow us to take needle-shifting actions.

I also suspect there are ways of making this go better. I am not quite sure what they are, but here are some ideas: Making and publishing surveys. Operationalizing all of this better, in particular with respect to the "how much does this actually matter?" aspect. Formulating some memorable "hypothesis" that makes it easier to refer to this in conversations and papers (cf "orthogonality thesis"). Perhaps making some proponents of "the opposing view" make some testable predictions, ideally some that can be tested with systems whose failures won't be catastrophic yet?

Comment by VojtaKovarik on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-24T16:33:39.352Z · LW · GW

Ok, got it. Though, not sure if I have a good answer. With trans issues, I don't know how to decouple the "concepts and terminology" part of the problem from the "political" issues. So perhaps the solution with AI terminology is to establish the precise terminology? And perhaps to establish it before this becomes an issue where some actors benefit from ambiguity (and will therefore resist disambiguation)? [I don't know, low confidence on all of this.]

Comment by VojtaKovarik on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-24T15:25:25.555Z · LW · GW

Do you have a more realistic (and perhaps more specific, and ideally apolitical) example than "cooperation is a fuzzy concept, so you have no way to deny that I am cooperating"? (All instances of this that I managed to imagine were either actually complicated, about something else, or something that I could resolve by replying "I don't care about your language games" and treating you as non-cooperative.)

Comment by VojtaKovarik on AI Safety in a World of Vulnerable Machine Learning Systems · 2023-07-24T15:07:32.169Z · LW · GW

For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes.

Regarding the predictions, I want to make the following quibble: According to the definition above, one way of "solving" adversarial robustness is to make sure that nobody tries to catastrophically exploit the system in the first place. (In particular, exploitable AI that takes over the world is no longer exploitable.)

So, a lot with this definition rests on how do you distinguish between "cannot be exploited" and "will not be exploited".

And on reflection, I think that for some people, this is close to being a crux regarding the importance of all this research.

Comment by VojtaKovarik on Even Superhuman Go AIs Have Surprising Failure Modes · 2023-07-23T18:36:00.550Z · LW · GW

Yup, this is a very good illustration of the "talking past each other" that I think is happening with this line of research. (I mean with adversarial attacks on NNs in general, not just with Go in particular.) Let me try to hint at the two views that seem relevant here.

1) Hinting at the "curiosity at best" view: I agree that if you hotfix this one vulnerability, then it is possible we will never encounter another vulnerability in current Go systems. But this is because there aren't many incentives to go look for those vulnerabilities. (And it might even be that if Adam Gleave didn't focus his PhD on this general class of failures, we would never have encountered even this vulnerability.)

However, whether additional vulnerabilities exist seems like an entirely different question. Sure, there will only be finitely many vulnerabilities. But how confident are we that this cyclic-groups one is the last one? For example, I suspect that you might not be willing to give 1:1000 odds on whether we would encounter new vulnerabilities if we somehow spent 50 researcher-years on this.

But I expect that you might say that this does not matter, because vulnerabilities in Go do not matter much, and we can just keep hotfixing them as they come up?

2) And the other view seems to be something like: Yes, Go does not matter. But we were only using Go (and image classifiers, and virtual-environment football) to illustrate a general point, that these failures are an inherent part of deep learning systems. And for many applications, that is fine. But there will be applications where it is very much not fine (eg, aligning strong AIs, cyber-security, economy in the presence of malicious actors).

And at this point, some people might disagree and claim something like "this will go away with enough training". This seems fair, but I think that if you hold this view, you should make some testable predictions (and ideally ones that we can test prior to having superintelligent AI).

And, finally, I think that if you had this argument with people in 2015, many of them would have made predictions such as "these exploits work for image classifiers, but they won't work for multiagent RL". Or "this won't work for vastly superhuman Go".


Does this make sense? Assuming you still think this is just an academic curiosity, do you have some testable predictions for when/which systems will no longer have vulnerabilities like this? (Pref. something that takes fewer than 50 researcher years to test :D.)

Comment by VojtaKovarik on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-23T10:51:38.067Z · LW · GW

Therefore, I don't think implication (1) or (2) follow from the premise, even if it is correct.

To clarify: what do you mean by the premise and implications (1) and (2) here? (I am guessing that premise = text under the heading "Conjecture: ..." and implications (1) or (2) = text under the heading "Implications".)

Comment by VojtaKovarik on Even Superhuman Go AIs Have Surprising Failure Modes · 2023-07-23T00:45:44.390Z · LW · GW

My reaction to this is something like:

Academically, I find these results really impressive. But, uhm, I am not sure how much impact they will have? As in: it seems very unsurprising[1] that something like this is possible for Go. And, also unsurprisingly, something like this might be possible for anything that involves neural networks --- at least in some cases, and we don't have a good theory for when yes/no. But also, people seem to not care. So perhaps we should be asking something else? Like, why is that people don't care? Suppose you managed to demonstrate failures like this in settings X, Y, and Z --- would this change anything? And also, when do these failures actually matter? [Not saying they don't, just that we should think about it.]


To elaborate:

  • If you understand neural networks (and how Go algorithms use them), it should be obvious that these algorithms might in principle have various vulnerabilities. You might become more confident about this once you learn about adversarial examples for image classifiers or hear arguments like "feed-forward networks can't represent recursively-defined concepts". But in a sense, the possibility of vulnerabilities should seem likely to you just based on the fact that neural networks (unlike some other methods) come with no relevant worst-case performance guarantees. (And to be clear, I believe all of this indeed was obvious to the authors since AlphaGo came out.)
  • So if your application is safety-critical, security mindset dictates that you should not use an approach like this. (Though Go and many other domains aren't safety-critical, hence my question "when does this matter".)
  • Viewed from this perspective, the value added by the paper is not "Superhuman Go AIs have vulnerabilities" but "Remember those obviously-possible vulnerabilities? Yep, it is as we said, it is not too hard to find them".
  • Also, I (sadly) expect that reactions to this paper (and similar results) will mostly fall into one of the following two camps: (1) Well, duh! This was obvious. (2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities. I would be hoping for reaction such as (3) [Oh, ok! So failures like this are probably possible for all neural networks. And no safety-critical system should rely on neural networks not having vulnerabilities, got it.] However, I mostly expect that anybody who doesn't already believe (1) and (3) will just react as (2).
  • And this motivates my point about "asking something else". EG, how do people who don't already believe (3) think about these things, and which arguments would they find persuasive? Is it efficient to just demonstrate as many of these failures as possible? Or are some failures more useful than others, or does this perhaps not help at all? Would it help with "signpost moving" if we first made some people commit to specific predictions (eg, "I believe scale will solve the general problem of robustness, and in particular I think AlphaZero has no such vulnerabilities").
  1. ^

    At least I remember thinking this when AlphaZero came out. (We did a small project in 2018 where we found a way to exploit AlphaZero in the tiny connect-four game, so this isn't just misremembering / hindsight bias.)

Comment by VojtaKovarik on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-22T22:00:54.405Z · LW · GW

Yes, I fully agree with all of this except one point, and with that one point I only want to add a small qualification.

Sometimes someone wants to come up with a crisp definition of a concept for which I suspect no such definition to exist. I usually find that I have little to say and can only wait for them to try to actually provide such a definition. And sometimes I'm surprised by what people can come up with.

The quibble I want to make here is that if we somehow knew that the Kolmogorov complexity of the given concept was at least X (and if that was even a sensible thing to say), and somebody was trying to come up with a definition with K-complexity <<X, then we could safely say that this has no chance of working. But then in reality, we do not know anything like this, so the best we can do (as I try to do with this post) is to say "this concept seems kinda complicated, so perhaps we shouldn't be too surprised if crisp definitions end up not working".

Comment by VojtaKovarik on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-22T21:30:34.937Z · LW · GW

And to highlight a particular point: I endorse your claim about crisp concepts, but I think it should be ammended as follows:

You should not be seeking a crisp definition of a fuzzy concept, you should be seeking a crisp concept or concepts in the neighbourhood of your fuzzy one, that can better do the work of the fuzzy one. However, you should keep in mind that the given collection of crisp concepts might fail to capture some important nuances of the fuzzy concept.

(And it is fine that this difference is there --- as long as we don't forget about it.)

Comment by VojtaKovarik on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-22T21:19:23.917Z · LW · GW

I think I agree with essentially everything you are saying here? Except that I was trying to emphasize something different from what you are emphasizing.

More specifically: I was trying to emphasize the point that [the concept that the word "cooperation" currently points to] is very fuzzy. Because it seemed to me that this was insufficiently clear (or at least not common knowledge). And appreciating this seemed necessary for ppl agreeing that (1) our mission should be to find crisp concepts in the vicinity of the fuzzy one (2) but that we shouldn't be surprised when those concepts fail to fully capture everything we wanted. (And also (3) avoiding unnecessary arguments about which definition is better, at least to the extent that those only stem from (1) + (2).)

Comment by VojtaKovarik on Recursive Middle Manager Hell: AI Edition · 2023-05-21T20:51:54.234Z · LW · GW

(1) « people liking thing does not seem like a relevant parameter of design ».

This is quite a bold statement. I personally believe the mainstream theory according to which it’s easier to have designs adopted when they are liked by the adopters.

Fair point. I guess "not relevant" is a too strong phrasing. And it would have been more accurate to say something like "people liking things might be neither sufficient nor necessary to get designs adopted, and it is not clear (definitely at least to me) how much it matters compared to other aspects".

 

Re (2): Interesting. I would be curious to know to what extent this is just a surface-level-only metaphor, or unjustified antrophomorphisation of cells, vs actually having implications for AI design. (But I don't understand biology at all, so I don't really have a clue :( .)

Comment by VojtaKovarik on Recursive Middle Manager Hell: AI Edition · 2023-05-18T16:04:48.069Z · LW · GW

I agree that the general point (biology needs to address similar issues, so we can use it for inspiration) is interesting. (Seems related to https://www.pibbss.ai/ .)

That said, I am somewhat doubtful about the implied conclusion (that this is likely to help with AI, because it won't mind): (1) there are already many workspace practices that people don't like, so "people liking things" doesn't seem like a relevant parameter of design, (2) (this is totally vague, handwavy, and possibly untrue, but:) biological processes might also not "like" being surveiled, replaced, etc, so the argument proves too much.

Comment by VojtaKovarik on Hell is Game Theory Folk Theorems · 2023-05-13T19:50:56.576Z · LW · GW

What did you think the purpose was, that would be better served by that stuff you listed?

I think the purpose is the same thing that you say it is, an example of an equilibrium that is "very close" to the worst possible outcome. But I would additionally prefer if the example did not invoke the reaction that it critically relies on quirky mathematical details. (And I would be fine if this additional requirement came at the cost of the equilibrium being "90% of the way towards worst possible outcome", rather than 99% of the way.)

Comment by VojtaKovarik on Hell is Game Theory Folk Theorems · 2023-05-11T19:59:26.688Z · LW · GW

Oh, I missed that --- I thought they set it to 100 forever. In that case, I was wrong, and this indeed works as a mechanism for punishing non-punishers, at least from the mathematical point of view.

Mathematics aside, I still think the example would be clearer if there were explicit mechanisms for punishing individuals. As it is, the exact mechanism critically relies on details of the example, and on mathematical nitpicks which are unintuitive. If you instead had explicit norms, meta-norms, etc, you would avoid this. (EG, suppose anybody can punish anybody else by 1, for free. And the default is that you don't do it, except that there is the rule for punishing rule-breakers (incl. for this rule).)

Comment by VojtaKovarik on When is Goodhart catastrophic? · 2023-05-11T19:35:51.701Z · LW · GW

Another piece of related work: Simon Zhuang, Dylan Hadfield-Mennel: Consequences of Misaligned AI.
The authors assume a model where the state of the world is characterized by multiple "features". There are two key assumptions: (1) our utility is (strictly) increasing in each feature, so -- by definition -- features are things we care about (I imagine money, QUALYs, chocolate). (2) We have a limited budget, and any increase in any of the features always has a non-zero cost. The paper shows that: (A) if you are only allowed to tell your optimiser about a strict subset of the features, all of the non-specified features get thrown under the buss. (B) However, if you can optimise things gradually, then you can alternate which features you focus on, and somehow things will end up being pretty okay.

 

Personal note: Because of the assumption (2), I find the result (A) extremely unsurprising, and perhaps misleading. Yes, it is true that at the Pareto-frontier of resource allocation, there is no space for positive-sum interactions (ie, getting better on some axis must hurt us on some other axis). But the assumption (2) instead claims that positive-sum interactions are literally never possible. This is clearly untrue in the real-world, about things we care about.

That said, I find the result (B) quite interesting, and I don't mean to hate on the paper :-).

Comment by VojtaKovarik on Hell is Game Theory Folk Theorems · 2023-05-10T18:45:57.758Z · LW · GW

I suspect that, in this particular example, it is more about reasoning about subgame-perfection being unintuitive (and the absence of a mechanism for punishing people who don't punish "defectors").

Comment by VojtaKovarik on Hell is Game Theory Folk Theorems · 2023-05-10T18:39:31.999Z · LW · GW

Indeed, sounds relevant. Though note that from a technical point of view,  Scott's example arguably fails the "individual rationality" condition, since some people would prefer to die over 8h/day of shocks. (Though presumably you can figure out ways of modifying that thought example to "remedy" that.)

Comment by VojtaKovarik on Hell is Game Theory Folk Theorems · 2023-05-10T18:25:47.977Z · LW · GW

Two comments to this:
1) The scenario described here is a Nash equilibrium but not a subgame-perfect Nash equilibrium. (IE, there are counterfactual parts of the game tree where the players behave "irationally".) Note that subgame-perfection is orthogonal with "reasonable policy to have", so the argument "yeah, clearly the solution is to always require subgame-perfection" does not work. (Why is it orthogonal? First, the example from the post shows a "stupid" policy that isn't subgame-perfect. However, there are cases where subgame-imperfection seems smart, because it ensures that those counterfactual situations don't become reality. EG, humans are somewhat transparent to each other, so having the policy of refusing unfair splits in the Final Offer / Ultimatum game can lead to not being offered unfair splits in the first place.)

2) You could modify the scenario such that the "99 equilibrium" becomes more robust. (EG, suppose the players have a way of paying a bit to punish a specific player a lot. Then you add the norm of turning temperature to 99, the meta-norm of punishing defectors, the meta-meta-norm of punishing those who don't punish defectors, etc. And tadaaaa, you have a pretty robust hell. This is a part of how society actually works, except usually those norms typically enforce pro-social behaviour.)