Posts

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI 2024-01-26T07:22:06.370Z
Thomas Kwa's MIRI research experience 2023-10-02T16:42:37.886Z
AISC team report: Soft-optimization, Bayes and Goodhart 2023-06-27T06:05:35.494Z
Soft optimization makes the value target bigger 2023-01-02T16:06:50.229Z
Jeremy Gillen's Shortform 2022-10-19T16:14:43.693Z
Neural Tangent Kernel Distillation 2022-10-05T18:11:54.687Z
Inner Alignment via Superpowers 2022-08-30T20:01:52.129Z
Finding Goals in the World Model 2022-08-22T18:06:48.213Z
The Core of the Alignment Problem is... 2022-08-17T20:07:35.157Z
Project proposal: Testing the IBP definition of agent 2022-08-09T01:09:37.687Z
Broad Basins and Data Compression 2022-08-08T20:33:16.846Z
Translating between Latent Spaces 2022-07-30T03:25:06.935Z
Explaining inner alignment to myself 2022-05-24T23:10:56.240Z
Goodhart's Law Causal Diagrams 2022-04-11T13:52:33.575Z

Comments

Comment by Jeremy Gillen (jeremy-gillen) on A framework for thinking about AI power-seeking · 2024-07-26T14:24:17.638Z · LW · GW

I think this post is great, I'll probably reference it next time I'm arguing with someone about AI risk. It's a good summary of the standard argument and does a good job of describing the main cruxes and how they fit into the argument. I'd happily argue for 1,2,3,4 and 6, and I think my disagreements with most people can be framed as disagreements about these points. I agree that if any of these are wrong, there isn't much reason to be worried about AI takeover, as far as I can see.

One pet peeve of mine is when people call something an assumption, even though in that context it's a conclusion. Just because you think the argument was insufficient to support it, doesn't make it an assumption. E.g. In the second last paragraph: 

The argument, rather, tends to move quickly from abstract properties like “goal-directedness," "coherence," and “consequentialism,” to an invocation of “instrumental convergence,” to the assumption that of course the rational strategy for the AI will be to try to take over the world.

There's something wrong with the footnotes. [17] is incomplete and [17-19] are never referenced in the text.

Comment by Jeremy Gillen (jeremy-gillen) on Why Not Subagents? · 2024-07-18T07:20:54.577Z · LW · GW

I made a mistake again. As described above, complete only pseudodominates incomplete.

But this is easily patched with the trick described in the OP. So we need the choice complete to make two changes to the downstream decisions. First, change decision 1 to always choose up (as before), second, change the distribution of Decision 2 to {}, because this keeps the probability of B constant. Fixed diagram:

Now the lottery for complete is {B: , A+: , A:}, and the lottery for incomplete is {B: , A+: , A:}. So overall, there is a pure shift of probability from A to A+.
[Edit 23/7: hilariously, I still had the probabilities wrong, so fixed them, again].

Comment by Jeremy Gillen (jeremy-gillen) on Why Not Subagents? · 2024-07-11T09:47:52.239Z · LW · GW

That is really helpful, thanks. I had been making a mistake, in that I thought that there was an argument from just "the agent thinks it's possible the agent will run into a money pump" that concluded "the agent should complete that preference in advance". But I was thinking sloppily and accidentally sometimes equivocating between pref-gaps and indifference. So I don't think this argument works by itself, but I think it might be made to work with an additional assumption.

One intuition that I find convincing is that if I found myself at outcome A in the single sweetening money pump, I would regret having not made it to A+. This intuition seems to hold even if I imagine A and B to be of incomparable value.

In order to avoid this regret, I would try to become the sort of agent that never found itself in that position. I can see that if I always follow the Caprice rule, then it's a little weird to regret not getting A+, because that isn't a counterfactually available option (counterfacting on decision 1). But this feels like I'm being cheated. I think the reason that if feels like I'm being cheated is that I feel like getting to A+ should be a counterfactually available option.

One way to make it a counterfactually available option in the thought experiment is to introduce another choice before choice 1 in the decision tree. The new choice (0), is the choice about whether to maintain the same decision algorithm (call this incomplete), or complete the preferential gap between A and B (call this complete).

I think the choice complete statewise dominates incomplete. This is because the choice incomplete results in a lottery {B: , A+: , A:} for .[1] However, the choice complete results in the lottery {B: , A+: , A:0}.

Do you disagree with this? I think this allows us to create a money pump, by charging the agent $ for the option to complete its own preferences.


The statewise pseudodominance relation is cyclic, so the Statewise Pseudodominance Principle would lead to cyclic preferences.

This still seems wrong to me, because I see lotteries as being an object whose purpose is to summarize random variables and outcomes. So it's weird to compare lotteries that depend on the same random variables (they are correlated), as if they are independent. This seems like a sidetrack though, and it's plausible to me that I'm just confused about your definitions here.

 

  1. ^

    Letting  be the probability that the agent chooses 2A+ and  the probability the agent chooses 2B (following your comment above). And  is defined similarly, for choice 1.

Comment by Jeremy Gillen (jeremy-gillen) on TurnTrout's shortform feed · 2024-07-09T03:41:42.066Z · LW · GW

The reasons I don't find this convincing:

  • The examples of human cognition you point to are the dumbest parts of human cognition. They are the parts we need to override in order to pursue non-standard goals. For example, in political arguments, the adaptations that we execute that make us attached to one position are bad. They are harmful to our goal of implementing effective policy. People who are good at finding effective government policy are good at overriding these adaptations.
  • "All these are part of the arbitrary, intrinsically-complex, outside world." This seems wrong. The outside world isn't that complex, and reflections of it are similarly not that complex. Hardcoding knowledge is a mistake, of course, but understanding a knowledge representation and updating process needn't be that hard.
  • "They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity." I agree with this, but it's also fairly obvious. The difficulty of alignment is building these in such a way that you can predict that they will continue to work, despite the context changes that occur as an AI scales up to be much more intelligent.
  • "These bitter lessons were taught to us by deep learning." It looks to me like deep learning just gave most people an excuse to not think very much about how the machine is working on the inside. It became tractable to build useful machines without understanding why they worked.
  • It sounds like you're saying that classical alignment theory violates lessons like "we shouldn't hardcode knowledge, it should instead be learned by very general methods". This is clearly untrue, but if this isn't what you meant then I don't understand the purpose of the last quote. Maybe a more charitable interpretation is that you think the lesson is "intelligence is irreducibly complex and it's impossible to understand why it works". But this is contradicted by the first quote. The meta-methods are a part of a mind that can and should be understood. And this is exactly the topic that much of agent foundations research has been about (with a particular focus on the aspects that are relevant to maintaining stability through context changes). 
    • (My impression was that this is also what shard theory is trying to do, except with less focus on stability through context changes, much less emphasis on fully-general outcome-directedness, and more focus on high-level steering-of-plans-during-execution instead of the more traditional precise-specification-of-outcomes).
Comment by Jeremy Gillen (jeremy-gillen) on Linch's Shortform · 2024-06-21T07:11:08.752Z · LW · GW

A very similar strategy is listed as a borderline example of a pivotal act, on the pivotal act page: 

Comment by Jeremy Gillen (jeremy-gillen) on What do coherence arguments actually prove about agentic behavior? · 2024-06-20T01:38:17.518Z · LW · GW

I intended for my link to point to the comment you linked to, oops.

I've responded here, I think it's better to just keep one thread of argument, in a place where there is more necessary context.

Comment by Jeremy Gillen (jeremy-gillen) on Why Not Subagents? · 2024-06-20T01:29:19.502Z · LW · GW

(sidetrack comment, this is not the main argument thread)

Think about your own preferences.

  • Let A be some career as an accountant, A+ be that career as an accountant with an extra $1 salary, and B be some career as a musician. Let p be small. Then you might reasonably lack a preference between 0.5p(A+)+(1-0.5p)(B) and A. That's not instrumentally irrational.

I find this example unconvincing, because any agent that has finite precision in their preference representation will have preferences that are a tiny bit incomplete in this manner. As such, a version of myself that could more precisely represent the value-to-me of different options would be uniformly better than myself, by my own preferences. But the cost is small here. The amount of money I'm leaving on the table is usually small, relative to the price of representing and computing more fine-grained preferences.

I think it's really important to recognize the places where toy models can only approximately reflect reality, and this is one of them. But it doesn't reduce the force of the dominance argument. The fact that humans (or any bounded agent) can't have exactly complete preferences doesn't mean that it's impossible for them to be better by their own lights.

Think about incomplete preferences on the model of imprecise exchange rates.

I appreciate you writing out this more concrete example, but that's not where the disagreement lies. I understand partially ordered preferences. I didn't read the paper though. I think it's great to study or build agents with partially ordered preferences, if it helps get other useful properties. It just seems to me that they will inherently leave money on the table. In some situations this is well worth it, so that's fine.

  • The general principle that you appeal to (If X is weakly preferred to or pref-gapped with Y in every state of nature, and X is strictly preferred to Y in some state of nature, then the agent must prefer X to Y) implies that rational preferences can be cyclic. B must be preferred to p(B-)+(1-p)(A+), which must be preferred to A, which must be preferred to p(A-)+(1-p)B+, which must be preferred to B.

No, hopefully the definition in my other comment makes this clear. I believe you're switching the state of nature for each comparison, in order to construct this cycle.

Comment by Jeremy Gillen (jeremy-gillen) on Why Not Subagents? · 2024-06-20T00:37:04.635Z · LW · GW

It seems we define dominance differently. I believe I'm defining it a similar way as "uniformly better" here. [Edit: previously I put a screenshot from that paper in this comment, but translating from there adds a lot of potential for miscommunication, so I'm replacing it with my own explanation in the next paragraph, which is more tailored to this context.].

A strategy outputs a decision, given a decision tree with random nodes. With a strategy plus a record of the outcome of all random nodes we can work out the final outcome reached by that strategy (assuming the strategy is deterministic for now). Let's write this like Outcome(strategy, environment_random_seed). Now I think that we should consider a strategy s to dominate another strategy s* if for all possible environment_random_seeds, Outcome(s, seed) ≥ Outcome(s*,seed), and for some random seed, Outcome(s, seed*) > Outcome(s*, seed*). (We can extend this to stochastic strategies, but I want to avoid that unless you think it's necessary, because it will reduce clarity).

In other words, a strategy is better if it always turns out to do "equally" well or better than the other strategy, no matter the state of nature. By this definition, a strategy that chooses A at the first node will be dominated.

Relating this to your response:

We say that a strategy is dominated iff it leads to a lottery that is dispreferred to the lottery led to by some other available strategy. So if the lottery 0.5p(A+)+(1-0.5p)(B) isn’t preferred to the lottery A, then the strategy of choosing A isn’t dominated by the strategy of choosing 0.5p(A+)+(1-0.5p)(B). And if 0.5p(A+)+(1-0.5p)(B) is preferred to A, then the Caprice-rule-abiding agent will choose 0.5p(A+)+(1-0.5p)(B).

I don't like that you've created a new lottery at the chance node, cutting off the rest of the decision tree from there. The new lottery wasn't in the initial preferences. The decision about whether to go to that chance node should be derived from the final outcomes, not from some newly created terminal preference about that chance node. Your dominance definition depends on this newly created terminal preference, which isn't a definition that is relevant to what I'm interested in.

I'll try to back up and summarize my motivation, because I expect any disagreement is coming from there. My understanding of the point of the decision tree is that it represents the possible paths to get to a final outcome. We have some preference partial order over final outcomes. We have some way of ranking strategies (dominance). What we want out of this is to derive results about the decisions the agent must make in the intermediate stage, before getting to a final outcome.

If it has arbitrary preferences about non-final states, then it's behavior is entirely unconstrained and we cannot derive any results about its decisions in the intermediate state.

So we should only use a definition of dominance that depends on final outcomes, then any strategy that doesn't always choose B at decision node 1 will be dominated by a strategy that does, according to the original preference partial order.

(I'll respond to the other parts of your response in another comment, because it seems important to keep the central crux debate in one thread without cluttering it with side-tracks).

Comment by Jeremy Gillen (jeremy-gillen) on What do coherence arguments actually prove about agentic behavior? · 2024-06-19T03:52:33.866Z · LW · GW

I find the money pump argument for completeness to be convincing.

The rule that you provide as a counterexample (Caprice rule) is one that gradually completes the preferences of the agent as it encounters a variety of decisions. You appear to agree with that this is the case. This isn't a large problem for your argument. The big problem is that when there are lots of random nodes in the decision tree, such that the agent might encounter a wide variety of potentially money-pumping trades, the agent needs to complete its preferences in advance, or risk its strategy being dominated.

You argue with John about this here, and John appears to have dropped the argument. It looks to me like your argument there is wrong, at least when it comes to situations where there are sufficient assumptions to talk about coherence (which is when the preferences are over final outcomes, rather than trajectories).

Comment by Jeremy Gillen (jeremy-gillen) on Why Not Subagents? · 2024-06-19T03:46:13.689Z · LW · GW

If it has no preference, neither choice will constitute a dominated strategy.

I think this statement doesn't make sense. If it has no preference between choices at node 1, then it has some chance of choosing outcome A. But if it does so, then that strategy is dominated by the strategy that always chooses the top branch, and chooses A+ if it can. This is because 50% of the time, it will get a final outcome of A when the dominating strategy gets A+, and otherwise the two strategies give incomparable outcomes.

I'm assuming dominated means a strategy that gives a final outcome that is incomparable or > in the partial order of preferences, for all possible settings of random variables. (And strictly > for at least one setting of random variables). Maybe my definition is wrong? But it seems like this is the definition I want.

Comment by Jeremy Gillen (jeremy-gillen) on Thomas Kwa's Shortform · 2024-06-15T13:18:24.396Z · LW · GW

We might have developed techniques to specify simple, bounded object-level goals. Goals that can be fully specified using very simple facts about reality, with no indirection or meta level complications. If so, we can probably use inner aligned agents to assist with some relativity well specified engineering or scientific problems. Specification mistakes at that point could easily result in irreversible loss of control, so it's not the kind of capability I'd want lots of people to have access to.

To move past this point, we would need to make some engineering or scientific advances that would be helpful for solving the problem more permanently. Human intelligence enhancement would be a good thing to try. Maybe some kind of AI defence system to shut down any rogue AI that shows up. Maybe some monitoring tech that helps governments co-ordinate. These are basically the same as the examples given on the pivotal act page.

Comment by Jeremy Gillen (jeremy-gillen) on Thomas Kwa's Shortform · 2024-06-13T22:15:05.243Z · LW · GW

Yes, if you have a very high bar for assumptions or the strength of the bound, it is impossible.

Fortunately, we don't need a guarantee this strong. One research pathway is to weaken the requirements until they no longer cause a contradiction like this, while maintaining most of the properties that you wanted from the guarantee. For example, one way to weaken the requirements is to require that the agent provably does well relative to what is possible for agents of similar runtime. This still gives us a reasonable guarantee ("it will do as well as it possibly could have done") without requiring that it solve the halting problem.

Comment by Jeremy Gillen (jeremy-gillen) on Thomas Kwa's Shortform · 2024-06-12T22:00:46.866Z · LW · GW

A dramatic advance in the theory of predicting the regret of RL agents. So given a bunch of assumptions about the properties of an environment, we could upper bound the regret with high probability. Maybe have a way to improve the bound as the agent learns about the environment. The theory would need to be flexible enough that it seems like it should keep giving reasonable bounds if the is agent doing things like building a successor. I think most agent foundations research can be framed as trying to solve a sub-problem of this problem, or a variant of this problem, or understand the various edge cases.

If we can empirically test this theory in lots of different toy environments with current RL agents, and the bounds are usually pretty tight, then that'd be a big update for me. Especially if we can deliberately create edge cases that violate some assumptions and can predict when things will break from which assumptions we violated.

(although this might not bring doom below 25% for me, depends also on race dynamics and the sanity of the various decision-makers).

Comment by Jeremy Gillen (jeremy-gillen) on MIRI 2024 Communications Strategy · 2024-05-31T23:51:54.457Z · LW · GW

I think your summary is a good enough quick summary of my beliefs. The minutia that I object to is how confident and specific lots of parts of your summary are. I think many of the claims in the summary can be adjusted or completely changed and still lead to bad outcomes. But it's hard to add lots of uncertainty and options to a quick summary, especially one you disagree with, so that's fair enough.
(As a side note, that paper you linked isn't intended to represent anyone else's views, other than myself and Peter, and we are relatively inexperienced. I'm also no longer working at MIRI).

I'm confused about why your <20% isn't sufficient for you to want to shut down AI research. Is it because of benefits outweigh the risk, or because we'll gain evidence about potential danger and can shut down later if necessary?

I'm also confused about why being able to generate practical insights about the nature of AI or AI progress is something that you think should necessarily follow from a model that predicts doom. I believe something close enough to (1) from your summary, but I don't have much idea (above general knowledge) of how the first company to build such an agent will do so, or when they will work out how to do it. One doesn't imply the other.

Comment by Jeremy Gillen (jeremy-gillen) on LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!") · 2024-05-29T14:44:44.647Z · LW · GW

I'm enjoying having old posts recommended to me. I like the enriched tab.

Comment by Jeremy Gillen (jeremy-gillen) on Alexander Gietelink Oldenziel's Shortform · 2024-05-20T16:46:36.086Z · LW · GW

Doesn't the futarchy hack come up here? Contractors will be betting that competitors timelines and cost will be high, in order to get the contract. 

Comment by Jeremy Gillen (jeremy-gillen) on Alexander Gietelink Oldenziel's Shortform · 2024-05-14T11:52:52.283Z · LW · GW

This doesn't feel like it resolves that confusion for me, I think it's still a problem with the agents he describes in that paper.

The causes  are just the direct computation of  for small values of . If they were arguments that only had bearing on small values of x and implied nothing about larger values (e.g. an adversary selected some  to show you, but filtered for  such that ), then it makes sense that this evidence has no bearing on. But when there was no selection or other reason that the argument only applies to small , then to me it feels like the existence of the evidence (even though already proven/computed) should still increase the credence of the forall.

Comment by Jeremy Gillen (jeremy-gillen) on EJT's Shortform · 2024-04-03T02:27:19.231Z · LW · GW

I sometimes name your work in conversation as an example of good recent agent foundations work, based on having read some of it and skimmed the rest, and talked to you a little about it at EAG. It's on my todo list to work through it properly, and I expect to actually do it because it's the blocker on me rewriting and posting my "why the shutdown problem is hard" draft, which I really want to post.

The reasons I'm a priori not extremely excited are that it seems intuitively very difficult to avoid either of these issues:

  • I'd be surprised if an agent with (very) incomplete preferences was real-world competent. I think it's easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
  • It's easy to shuffle around the difficulty of the shutdown problem, e.g. by putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.

It's plausible you've avoided these problems but I haven't read deeply enough to know yet. I think it's easy for issues like this to be hidden (accidentally), so it'll take a lot of effort for me to read properly (but I will, hopefully in about a week).

The part where it works for a prosaic setup seems wrong (because of inner alignment issues (although I see you cited my post in a footnote about this, thanks!)), but this isn't what the shutdown problem is about so it isn't an issue if it doesn't apply directly to prosaic setups.

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-21T07:41:49.873Z · LW · GW

I would be excited to read this / help with a draft. 

We can meet in person one afternoon and work out some cruxes and write them up?

Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?

Yes, approximately, as I believe you and I are capable of doing. [...PhD student crux]

The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.

Are you saying you buy my side of the PhD student crux, but that it isn't closely analogous because you think we have more fine-grained control over the thoughts or goals of the AI? And in your second sentence my guess is that you are saying a second reason you disagree is that it looks difficult for a research AI to know when to perform well and when to perform less well? I think we might be miscommunicating about this because it doesn't look like the sort of situation where the "deception" is a relatively difficult task.

I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard.

I agree with this. But the research we need isn't just past the frontier. At some point not that far past the frontier, you get to a situation where the research team can say "given our guesses about mainland researchers and their time constraints, we'd put <1% on them having made discovery X {or explored idea Y, or done experiment Z}. Let's {delay giving it to them, not work on that direction, insert a subtle false assumption, instead share our 8th best idea, just say we're stuck, communicate it poorly}".

I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information". 

Sure, kinda, but the important part here is that a trapped agentic AI can selectively mislead you about the most important {knowledge, tasks, strategies} that you aren't capable of doing with just humans. If you're capable of doing it with humans, or if you're close to being able to do it with humans, then why not just do that? Seems safer.

And every time the AIs do useful work we can add this to our supervised training corpus. 

Only if you spend the human resources to verify it. Are you aiming for 30x research acceleration? Because this by itself might be enough limit you below that level. (and it also directly provides a way for the AI to tell "what time it is" by sending messages to future iterations, and plausibly also provide a bunch of information about human research capability and capacity).

One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work. 

Mostly my objection to this is above.  On scalable oversight, I buy that it might be possible to get 20% improvements in the accuracy of your evaluations of (adversarialy generated!) research, compared to team of humans trying really hard to evaluate research, given a human level research AI. This is so far from being enough. Are you predicting much better scalable oversight that scales much further than I am predicting it will? (I'm assuming you're using "GPT-5/6" to represent "some powerful future consequentialist AI that can do research", because otherwise we should be having a different argument).

While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.

It'd be cool if there were a demonstration of manipulating internal activations during training, in a way that fixes non-trivial unwanted OOD behavior. If such techniques existed, and also worked on consequentialist agents (I expect would be harder), I would consider them very useful tools that would help with building an aligned agent.

While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.

This is very plausible, but I think you're stretching the analogy between LMs and future research AIs way too far. "Paying attention to the tokens we want them to" doesn't add much/any control over the properties we want control over (compared to e.g. changing the input string to remove information we don't want it paying attention to).

Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start. 

I think you might be centrally thinking about an LLM completing a few paragraphs, where I'm thinking of an AI doing a large scale scientific project? I buy that for many concepts X, you can work out the "neural correlates of X" in your AI. But that kind of thing is noisy (and plausibly can be optimized against by the AI) unless you have a deep understanding of what you are measuring. And optimizing against such imperfect metrics obviously wouldn't do much beyond destroying the metric. I do think research in this direction has a chance of being useful, but mainly by being upstream of much better understanding.

By leaning more on generalization, I mean leaning more on the data efficiency thing

Sorry for misinterpreting you, but this doesn't clarify what you meant. 

also weak-to-strong generalization ideas.

I think I don't buy the analogy in that paper, and I don't find the results surprising or relevant (by my current understanding, after skimming it). My understanding of the result is "if you have a great prior, you can use it to overcome some label noise and maybe also label bias". But I don't think this is very relevant to extracting useful work from a misaligned agent (which is what we are talking about here), and based on the assumptions they describe, I think they agree? (I just saw appendix G, I'm a fan of it, it's really valuable that they explained their alignment plan concisely and listed their assumptions).

I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good').

If we can tell we are have such an AI, and we can tell that our random modifications are affecting the goal, and also the change is roughly one that helps us rather than changing many things that might or might not be helpful, this would be a nice situation to be in.

I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes). 

Section 6 assumes we have failed to control the AI, so it is free of weird/strict constraints, and free to scale itself up, improve itself, etc. So my comment is about an AI that no longer can be assumed to have human-ish capabilities.

Comment by Jeremy Gillen (jeremy-gillen) on PIBBSS Speaker events comings up in February · 2024-02-20T23:05:45.594Z · LW · GW

Do you have recordings? I'd be keen to watch a couple of the ones I missed.

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-20T01:08:48.782Z · LW · GW

I feel like you’re proposing two different types of AI and I want to disambiguate them. The first one, exemplified in your response to Peter (and maybe referenced in your first sentence above), is a kind of research assistant that proposes theories (after having looked at data that a scientist is gathering?), but doesn’t propose experiments and doesn’t think about the usefulness of its suggestions/theories. Like a Solomonoff inductor that just computes the simplest explanation for some data? And maybe some automated approach to interpreting theories?

The second one, exemplified by the chess analogy and last paragraph above, is a bit like a consequentialist agent that is a little detached from reality (can’t learn anything, has a world model that we designed such that it can’t consider new obstacles).

Do you agree with this characterization?

What I'm saying is "simpler" is that, given a problem that doesn't need to depend on the actual effects of the outputs on the future of the real world […], it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.

I accept chess and formal theorem-proving as examples of problem where we can define the solution without using facts about the real-world future (because we can easily write down formally a definition of what the solution looks like). 

For a more useful problem (e.g. curing a type of cancer) we (the designers) only know how to define a solution in terms of real world future states (patient is alive, healthy, non traumatized, etc). I’m not saying there doesn’t exist a definition of success that doesn’t involve referencing real-world future states. But the AI designers don’t know it (and I expect it would be relatively complicated).

My understanding of your simplicity argument is that it is saying that it is computationally cheaper for a trained AI to discover during training a non-consequence definition of the task, despite a consequentialist definition being the criterion used to train it? If so, I disagree that computation cost is very relevant here, generalization (to novel obstacles) is the dominant factor determining how useful this AI is.

Comment by Jeremy Gillen (jeremy-gillen) on The Pointer Resolution Problem · 2024-02-20T00:12:43.472Z · LW · GW

Geometric rationality ftw!

(In normal planning problems there are exponentially many plans to evaluate (in the number of actions). So that doesn't seem to be a major obstacle if your agent is already capable of planning.)

Comment by Jeremy Gillen (jeremy-gillen) on The Pointer Resolution Problem · 2024-02-16T22:56:12.350Z · LW · GW

Might be much harder to implement, but could we maximin "all possible reinterpretations of alignment target X"?

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-12T06:54:44.890Z · LW · GW

In my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI's decision to give output X depends on the fact that output X has some specific effects in the future.

Agreed.

Whereas, if you train it on a problem where solutions don't need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that's simpler.

The "problem where solutions don't need to depend on effects" is where we disagree. I agree such problems exist (e.g. formal proof search), but those aren't the kind of useful tasks we're talking about in the post. For actual concrete scientific problems, like outputting designs for a fusion rocket, the "simplest" approach is to be considering the consequences of those outputs on the world. Otherwise, how would it internally define "good fusion rocket design that works when built"? How would it know not to use a design that fails because of weaknesses in the metal that will be manufactured into a particular shape for your rocket? A solution to building a rocket is defined by its effects on the future (not all of its effects, just some of them, i.e. it doesn't explode, among many others).

I think there's a (kind of) loophole here, where we use an "abstract hypothetical" model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by "understood in abstract terms"? So the AI has defined "good fusion rocket design" as "fusion rocket that is built by not-real hypothetical humans based on my design and functions in a not-real hypothetical universe and has properties and consequences XYZ" (but the hypothetical universe isn't the actual future, it's just similar enough to define this one task, but dissimilar enough that misaligned goals in this hypothetical world don't lead to coherent misaligned real-world actions). Is this what you mean? Rereading your comment, I think this matches what you're saying, especially the chess game part.

The part I don't understand is why you're saying that this is "simpler"? It seems equally complex in kolmogorov complexity and computational complexity.

Comment by Jeremy Gillen (jeremy-gillen) on What are the known difficulties with this alignment approach? · 2024-02-12T06:23:46.761Z · LW · GW

I think the overall goal in this proposal is to get a corrigible agent capable of bounded tasks (that maybe shuts down after task completion), rather than a sovereign?

One remaining problem (ontology identification) is making sure your goal specification stays the same for a world-model that changes/learns.

Then the next remaining problem is the inner alignment problem of making sure that the planning algorithm/optimizer (whatever it is that generates actions given a goal, whether or not it's separable from other components) is actually pointed at the goal you've specified and doesn't have any other goals mixed into it. (see Context Disaster for more detail on some of this, optimization daemons, and actual effectiveness). Part of this problem is making sure the system is stable under reflection.

Then you've got the outer alignment problem of making sure that your fusion power plant goal is safe to optimize (e.g. it won't kill people who get in the way, doesn't have any extreme effects if the world model doesn't exactly match reality, or if you've forgotten some detail). (See Goodness estimate bias, unforeseen maximum). 

Ideally here you build in some form of corrigibility and other fail-safe mechanisms, so that you can iterate on the details.

That's all the main ones imo. Conditional on solving the above, and actively trying to foresee other difficult-to-iterate problems, I think it'd be relatively easy to foresee and fix remaining issues.

Comment by Jeremy Gillen (jeremy-gillen) on Updatelessness doesn't solve most problems · 2024-02-11T06:39:31.365Z · LW · GW

A first problem with this is that there is no sharp distinction between purely computational (analytic) information/observations and purely empirical (synthetic) information/observations.

I don't see the fuzziness here, even after reading the two dogmas wikipedia page (but not really understanding it, it's hidden behind a wall of jargon). If we have some prior over universes, and some observation channel, we can define an agent that is updateless with respect to that prior, and updateful with respect to any calculations it performs internally. Is there a section of Radical Probablism that is particularly relevant? It's been a while.
It's not clear to me why all superintelligences having the same classification matters. They can communicate about edge cases and differences in their reasoning. Do you have an example here?

A second and more worrying problem is that, even given such convergence, it's not clear all other agents will decide to forego the possible apparent benefits of logical exploitation. It's a kind of Nash equilibrium selection problem: If I was very sure all other agents forego them (and have robust cooperation mechanisms that deter exploitation), then I would just do like them.

I think I don't understand why this is a problem. So what if there are some agents running around being updateless about logic? What's the situation that we are talking about a Nash equilibrium for? 

As mentioned in the post, Counterfactual Mugging as presented won't be common, but equivalent situations in multi-agentic bargaining might, due to (the naive application of) some priors leading to commitment races.

Can you point me to an example in bargaining that motivates the usefulness of logical updatelessness? My impression of that section wasn't "here is a realistic scenario that motivates the need for some amount of logical updatelessness", it felt more like "logical bargaining is a situation where logical updatelessness plausibly leads to terrible and unwanted decisions".

It's not looking like something as simple as that will solve, because of reasoning as in this paragraph:

Unfortunately, it’s not that easy, and the problem recurs at a higher level: your procedure to decide which information to use will depend on all the information, and so you will already lose strategicness. Or, if it doesn’t depend, then you are just being updateless, not using the information in any way.

Or in other words, you need to decide on the precommitment ex ante, when you still haven't thought much about anything, so your precommitment might be bad.

Yeah I wasn't thinking that was a "solution", I'm biting the bullet of losing some potential value and having a decision theory that doesn't satisfy all the desiderata. I was just saying that in some situations, such an agent can patch the problem using other mechanisms, just as an EDT agent can try to implement some external commitment mechanism if it lives in a world full of transparent newcomb problems.

Comment by Jeremy Gillen (jeremy-gillen) on Updatelessness doesn't solve most problems · 2024-02-11T03:00:48.908Z · LW · GW

To me it feels like the natural place to draw the line is update-on-computations but updateless-on-observations. Because 1) It never disincentivizes thinking clearly, so commitment races bottom out in a reasonable way, and 2) it allows cooperation on common-in-the-real-world newcomblike problems.

It doesn't do well in worlds with a lot of logical counterfactual mugging, but I think I'm okay with this? I can't see why this situation would be very common, and if it comes up it seems that an agent that updates on computations can use some precommitment mechanism to take advantage of it (e.g. making another agent).

Am I missing something about why logical counterfactual muggings are likely to be common?

Looking through your PIBBS report (which is amazing, very helpful), I intuitively feel the pull of Desiderata 4 (No existential regret), and also the intuition of wanting to treat logical uncertainty and empirical uncertainty in a similar way. But ultimately I'm so horrified by the mess that comes from being updateless-on-logic that being completely updateful on logic is looking pretty good to me.

(Great post, thanks)

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-11T01:43:53.509Z · LW · GW

Thanks for reading it, it's good to know exactly where you think the argument is weakest and I appreciate the effort of going through and noting differences.

On section 4:

I definitely don't feel confident that any of the mentioned problems will arise in practice. I don't see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative. 

This surprises me actually, I thought this section was solid conditional on the previous assumptions. I think you shouldn't think of them as problems that might arise in practice, instead they should be thought of as reasons why behavioral training under-specifies some parts of the AI algorithm.
I'll give some more quick examples of "outer-shell constraints", because examples are very common and it seems odd for this to be called speculative: Unendorsed habits or addictions in humans; the instinct to breathe in when you're trying to hold your breath; appending a checker algorithm C to any algorithm A, to double check the output of A; any biases in a search ordering (as in the alphago example).

On section 5:
It's on my todo list to write a comment responding to some of the specifics of Redwood's control post. This post wasn't intending to engage with their detailed arguments, just give the top-level reasons why we believe control is difficult.

Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?

Yes, approximately, as I believe you and I are capable of doing. The crux here might be "if we had 20 phd students, and incentivize them to both look like they are doing checkable research, and also incentivize them not to make any real useful progress on solving a particular difficult problem, would this be very useful without costing a lot of expert supervision?". If I remember correctly, this was a crux between myself and Ryan during an in person discussion. It seems likely to me that the phd students, trying not to be useful while looking good in checkable ways, will succeed at both. More intelligent agent are even more likely to succeed at both. It does depend on the type of research, I agree that formal theorem-proving would be much easier to incentivize properly. I'm talking about the least-easily-checkable 20% of useful research.

Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is

Seems easy enough to predict given roughly human-scientist-team level of capabilities. And seems hard to get useful work out of human-scientist-team without them also working out ways to (maybe imperfectly) distinguish "faked sensory inputs" from real ones, and without leaking any information about what time it is. E.g. we need the AI to build on previous work

"There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities." Well, no. You either need sufficient labeled data in the domain we're trying to elicit capabilities, or good generalization from other domains. 

Nice, this is my other crux with Ryan, which we phrased roughly as "how data efficient can we make training". Where my argument against very high data efficiency is that current long-horizon RL isn't very data-efficient, and intentional exploration hacking seems plausible in this situation (e.g. bad analogy but for the sake of describing exploration hacking: someone is training you to solve rubik's cubes, you can avoid succeeding at solving a rubik's cube in the first place (and use other strategies to mess with their reward signal). This works as long as the trainers don't know the specific steps required to solve the rubik's cube, but even then I think there's some strategies that could work to reduce generalization).

The data need not be human generated, it could be generated by other AIs, as long as we trust it. 

?? This seems to be assuming a solution to the problem.

I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training. 

I agree there are numerous directions of hope, but disagree that either of your examples count, if I understand them correctly. "leaning more on generalization" ≈ "leaning more on agency", which doesn't seem good to me. Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn't seem like it would help with any problem that we were focused on in this post.
 

I agree that we don't have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that's not totally asymmetric, it also gives us hope about our ability to modify AI goals.

If we are using okay interpretability tools to understand whether the AI has the goal we intended, and to guide training, then I would consider that a fundamental advance over current standard training techniques.
I agree that goals would very likely be hit by some modifications during training, in combination with other changes to other parts of the algorithm. The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.

Many of the issues in this section are things that, if we're not being totally idiots, it seems we'll get substantial warning about. e.g., AIs colluding with their AI monitors. That's definitely a positive, though far from conclusive.

I think that there is a lot of room for the evidence to be ambiguous and controversial, and for the obvious problems to look patchable. For this reason I've only got a little hope that people will panic at the last minute due to finally seeing the problems and start trying to solve exactly the right problems. On top of this, there's the pressure of needing to "extract useful work to solve alignment" before someone less cautious builds an unaligned super-intelligence, which could easily lead to people seeing substantial warnings and pressing onward anyway.

Section 6:

I think a couple of the arguments here continue to be legitimate, such as "Unclear that many goals realistically incentivise taking over the universe", but I'm overall fine accepting this section. 

That argument isn't really what it says on the tin, it's saying something closer to "maybe taking over the universe is hard/unlikely and other strategies are better for achieving most goals under realistic conditions". I buy this for many environments and levels of power, but it's obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that's the sort of AI we get if it can undergo self-improvement.

Overall I think your comment is somewhat representative of what I see as the dominant cluster of views currently in the alignment community. (Which seems like a very reasonable set of beliefs and I don't think you're unreasonable for having them).

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-31T10:05:04.788Z · LW · GW

I agree that it'd be extremely misleading if we defined "catastrophe" in a way that includes futures where everyone is better off than they currently are in every way (without being very clear about it). This is not what we mean by catastrophe.

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-31T08:33:14.558Z · LW · GW

Trying to find the crux of the disagreement (which I don't think lies in takeoff speed):

If we assume a multipolar, slow-takeoff, misaligned AI world, where there are many AIs that slowly takeover the economy and generally obey laws to the extent that they are enforced (by other AIs). And they don't particularly care about humans, in a similar manner to the way humans don't particularly care about flies. 

In this situation, humans eventually have approximately zero leverage, and approximately zero value to trade. There would be much more value in e.g. mining cities for raw materials than in human labor.

I don't know much history, but my impression is that in similar scenarios between human groups, with a large power differential and with valuable resources at stake, it didn't go well for the less powerful group, even if the more powerful group was politically fragmented or even partially allied with the less powerful group.

Which part of this do you think isn't analogous?
My guesses are either that you are expecting some kind of partial alignment of the AIs. Or that the humans can set up very robust laws/institutions of the AI world such that they remain in place and protect humans even though no subset of the agents is perfectly happy with this, and there exist laws/institutions that they would all prefer.

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-31T00:30:55.507Z · LW · GW

(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values

I see, thanks! I agree these are both really important problems.

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-30T05:05:07.319Z · LW · GW

Yeah specifying goals in a learned ontology does seem better to me, and in my opinion is a much better approach than behavioral training.
But there's a couple of major roadblocks that come to mind:

  • You need really insanely good interpretability on the learned ontology.
  • You need to be so good at specifying goals in that ontology that they are robust to adversarial optimization.

Work on these problems is great. I particularly like John's work on natural latent variables which seems like the sort of thing that might be useful for the first two of these.

Keep in mind though there are other major problems that this approach doesn't help much with, e.g.:

  • Standard problems arising from the ontology changing over time or being optimized against.
  • The problem of ensuring that no subpart of your agent is pursuing different goals (or applying optimization in a way that may break the overall system at some point).
Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-30T02:55:52.773Z · LW · GW

We aren't implicitly assuming (1) in this post. (Although I agree there will be economic pressure to expand the use of powerful AI, and this adds to the overall risk).

I don't understand what you mean by (2). I don't think I'm assuming it, but can't be sure.

One hypothesis: That AI training might (implicitly? Through human algorithm iteration?) involve a pressure toward compute efficient algorithms? Maybe you think that this a reason we expect consequentialism? I'm not sure how that would relate to the training being domain-specific though.
 

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-30T02:33:15.759Z · LW · GW

I think you and Peter might be talking past each other a little, so I want to make sure I properly understand what you are saying. I’ve read your comments here and on Nate’s post, and I want to start a new thread to clarify things.

I’m not sure exactly what analogy you are making between chess AI and science AI. Which properties of a chess AI do you think are analogous to a scientific-research-AI?

- The constraints are very easy to specify (because legal moves can be easily locally evaluated). In other words, the set of paths considered by the AI is easy to define, and optimization can be constrained to only search this space.
- The task of playing chess doesn’t at all require or benefit from modelling any other part of the world except for the simple board state.

I think these are the main two reasons why current chess AIs are safe.

Separately, I’m not sure exactly what you mean when you’re saying “scientific value”. To me, the value of knowledge seems to depend on the possible uses of that knowledge. So if an AI is evaluating “scientific value”, it must be considering the uses of the knowledge? But you seem to be referring to some more specific and restricted version of this evaluation, which doesn’t make reference at all to the possible uses of the knowledge? In that case, can you say more about how this might work?
Or maybe you’re saying that evaluating hypothetical uses of knowledge can be safe? I.e. there’s a kind of goal that wants to create “hypothetically useful” fusion-rocket-designs, but doesn’t want this knowledge to have any particular effect on the real future.

You might be reading us as saying that “AI science systems are necessarily dangerous” in the sense that it’s logically impossible to have an AI science system that isn’t also dangerous? We aren’t saying this. We agree that in principle such a system could be built.

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-29T23:44:00.902Z · LW · GW

Yep ontological crises are a good example of another way that goals can be unstable.
I'm not sure I understood how 2 is different from 1.

I'm also not sure that rebinding to the new ontology is the right approach (although I don't have any specific good approach). When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work. So I'm pretty enthusiastic about work that clarifies my understanding here (where infrabayes, natural latents and finite factored sets all seem like the sort of thing that might lead to a clearer picture).

Comment by Jeremy Gillen (jeremy-gillen) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-26T20:36:07.112Z · LW · GW

Thanks! 
I think that our argument doesn't depend on all possible goals being describable this way. It depends on useful tasks (that AI designers are trying to achieve) being driven in large part by pursuing outcomes. For a counterexample, behavior that is defined entirely by local constraints (e.g. a calculator, or "hand on wall maze algorithm") aren't the kind of algorithm that is a source of AI risk (and also isn't as useful in some ways).


Your example of a pointer to a goal is a good edge case for our way of defining/categorizing goals. Our definitions don't capture this edge case properly. But we can extend the definitions to include it, e.g. if the goal that ends up eventually being pursued is an outcome, then we could define the observing agent as knowing that outcome in advance. Or alternatively, we could wait until the agent has uncovered its consequentialist goal, but hasn't yet completed it. In both these cases we can treat it as consequentialist. Either way it still has the property that leads to danger, which is the capacity to overcome large classes of obstacles and still get to its destination.

I'm not sure what you mean by "goal objects robust to capabilities not present early in training". If you mean "goal objects that specify shutdownable behavior while also specifying useful outcomes, and are robust to capability increases", then I agree that such objects exist in principle. But I could argue that this isn't very natural, if this is a crux and I'm understanding what you mean correctly?
 

Comment by Jeremy Gillen (jeremy-gillen) on A Shutdown Problem Proposal · 2024-01-22T02:09:57.388Z · LW · GW

I think you're right that the central problems remaining are in the ontological cluster, as well as the theory-practice gap of making an agent that doesn't override its hard-coded false beliefs.

But less centrally, I think one issue with the proposal is that the sub-agents need to continue operating in worlds where they believe in a logical contradiction. How does this work? (I think this is something I'm confused about for all agents and this proposal just brings it to the surface more than usual).

Also, agent1 and agent2 combine into some kind of machine. This machine isn't VNM rational. I want to be able to describe this machine properly. Pattern matching, my guess is that it violates independence in the same way as here. [Edit: Definitely violates independence, because the combined machine should choose a lottery over <button-pressed> over certainty of either outcome. I suspect that it doesn't have to violate any other axioms].

Comment by Jeremy Gillen (jeremy-gillen) on TurnTrout's shortform feed · 2024-01-01T23:43:53.311Z · LW · GW

I think the term is very reasonable and basically accurate, even more so with regard to most RL methods. It's a good way of describing a training process without implying that the evolving system will head toward optimality deliberately. I don't know a better way to communicate this succinctly, especially while not being specific about what local search algorithm is being used.

Also, evolutionary algorithms can be used to approximate gradient descent (with noisier gradient estimates), so it's not unreasonable to use similar language about both.

I'm not a huge fan of the way you imply that it was chosen for rhetorical purposes.

Comment by Jeremy Gillen (jeremy-gillen) on Some Rules for an Algebra of Bayes Nets · 2023-12-19T01:04:48.709Z · LW · GW

This is one of my favorite posts because it gives me tools that I expect to use.

A little while ago, John described his natural latent result to me. It seemed cool, but I didn't really understand how to use it and didn't take the time to work through it properly. I played around with similar math in the following weeks though; I was after a similar goal, which was better ways to think about abstract variables.

More recently, John worked through the natural latent proof on a whiteboard at a conference. At this point I felt like I got it, including the motivation. A couple of weeks later I tried to prove it as an exercise for myself (with the challenge being that I had to do it from memory, rigorously, and including approximation). This took me two or three days, and the version I ended up with used a slightly different version of the same assumptions, and got weaker approximation results. I used the graphoid axioms, which are the standard (but slow and difficult) way of formally manipulating independence relationships (and I didn't have previous experience using them).

This experience caused me to particularly appreciate this post. It turns lots of work into relatively little work.

Comment by Jeremy Gillen (jeremy-gillen) on Evolution provides no evidence for the sharp left turn · 2023-12-13T23:39:31.864Z · LW · GW

My understanding of the first part of your argument: The rapid (in evolutionary timescales) increase in human capabilities (that led to condoms and ice cream) is mostly explained by human cultural accumulation (i.e. humans developed better techniques for passing on information to the next generation).


My model is different. In my model, there are two things that were needed for the rapid increase in human capabilities. The first was the capacity to invent/create useful knowledge, and the second was the capacity to pass it on.
To me it looks like the human rapid capability gains depended heavily on both.

Comment by Jeremy Gillen (jeremy-gillen) on Jeremy Gillen's Shortform · 2023-11-14T21:07:09.011Z · LW · GW

Ah I see, I was referring to less complete abstractions. The "accurately predict all behavior" definition is fine, but this comes with a scale of how accurate the prediction is. "Directions and simple functions on these directions" probably misses some tiny details like floating point errors, and if you wanted a human to understand it you'd have to use approximations that lose way more accuracy. I'm happy to lose accuracy in exchange for better predictions about behavior in previously-unobserved situations. In particular, it's important to be able to work out what sort of previously-unobserved situation might lead to danger. We can do this with humans and animals etc, we can't do it with "directions and simple functions on these directions".

Comment by Jeremy Gillen (jeremy-gillen) on Jeremy Gillen's Shortform · 2023-11-14T19:33:50.720Z · LW · GW

There aren't really any non-extremely-leaky abstractions in big NNs on top of something like a "directions and simple functions on these directions" layer. (I originally heard this take from Buck)

Of course this depends on what it's trained to do? And it's false for humans and animals and corporations and markets, we have pretty good abstractions that allow us to predict and sometimes modify the behavior of these entities.

I'd be pretty shocked if this statement was true for AGI.

Comment by Jeremy Gillen (jeremy-gillen) on Jeremy Gillen's Shortform · 2023-11-08T20:30:18.190Z · LW · GW

Yeah I think I agree. It also applies to most research about inductive biases of neural networks (and all of statistical learning theory). Not saying it won't be useful, just that there's a large mysterious gap between great learning theories and alignment solutions and inside that gap is (probably, usually) something like the levels-of-abstraction mistake.

Comment by Jeremy Gillen (jeremy-gillen) on Deconfusing “ontology” in AI alignment · 2023-11-08T20:11:53.854Z · LW · GW

its notion of regulators generally does not line up with neural networks.

When alignment researchers talk about ontologies and world models and agents, we're (often) talking about potential future AIs that we think will be dangerous. We aren't necessarily talking about all current neural networks.

A common-ish belief is that future powerful AIs will be more naturally thought of as being agentic and having a world model. The extent to which this will be true is heavily debated, and gooder regulator is kinda part of that debate.

Biphasic cognition might already be an incomplete theory of mind for humans

Nothing wrong with an incomplete or approximate theory, as long as you keep an eye on the things that it's missing and whether they are relevant to whatever prediction you're trying to make.

Comment by Jeremy Gillen (jeremy-gillen) on Jeremy Gillen's Shortform · 2023-11-08T19:43:48.239Z · LW · GW

Here's a mistake some people might be making with mechanistic interpretability theories of impact (and some other things, e.g. how much Neuroscience is useful for understanding AI or humans).

When there are multiple layers of abstraction that build up to a computation, understanding the low level doesn't help much with understanding the high level. 


Examples:
1. Understanding semiconductors and transistors doesn't tell you much about programs running on the computer. The transistors can be reconfigured into a completely different computer, and you'll still be able to run the same programs. To understand a program, you don't need to be thinking about transistors or logic gates. Often you don't even need to be thinking about the bit level representation of data.

2. The computation happening in single neurons in an artificial neural network doesn't have have much relation to the computation happening at a high level. What I mean is that you can switch out activation functions, randomly connect neurons to other neurons, randomly share weights, replace small chunks of network with some other differentiable parameterized function. And assuming the thing is still trainable, the overall system will still learn to execute a function that is on a high level pretty similar to whatever high level function you started with.[1]

3. Understanding how neurons work doesn't tell you much about how the brain works. Neuroscientists understand a lot about how neurons work. There are models that make good predictions about the behavior of individual neurons or synapses. I bet that the high level algorithms that are running in the brain are most naturally understood without any details about neurons at all. Neurons probably aren't even a useful abstraction for that purpose. 

 

Probably directions in activation space are also usually a bad abstraction for understanding how humans work, kinda analogous to how bit-vectors of memory are a bad abstraction for understanding how program works.

Of course John has said this better.

  1. ^

    You can mess with inductive biases of the training process this way, which might change the function that gets learned, but (my impression is) usually not that much if you're just messing with activation functions.

Comment by Jeremy Gillen (jeremy-gillen) on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-02T21:30:32.869Z · LW · GW

"they should clearly communicate their non-respectful/-kind alternative communication protocols beforehand, and they should help the other person maintain their boundaries;"

Nate did this.

By my somewhat idiosyncratic views on respectful communication, Nate was roughly as respectful as Thomas Kwa. 

I do seem to be unusually emotionally compatible with Nate's style of communication though.

Comment by Jeremy Gillen (jeremy-gillen) on Instrumental Convergence? [Draft] · 2023-07-23T09:51:27.966Z · LW · GW

Section 4 then showed how those initial results extend to the case of sequential decision making.

[...]

If she's a resolute chooser, then sequential decisions reduce to a single non-sequential decisions.

Ah thanks, this clears up most of my confusion, I had misunderstood the intended argument here. I think I can explain my point better now:

I claim that proposition 3, when extended to sequential decisions with a resolute decision theory, shouldn't be interpreted the way you interpret it. The meaning changes when you make A and B into sequences of actions.

Let's say action A is a list of 1000000 particular actions (e.g. 1000000 small-edits) and B is a list of 1000000 particular actions (e.g. 1 improve-technology, then 999999 amplified-edits).[1]

Proposition 3 says that A is equally likely to be chosen as B (for randomly sampled desires). This is correct. Intuitively this is because A and B are achieving particular outcomes and desires are equally likely to favor "opposite" outcomes.

However this isn't the question we care about. We want to know whether action-sequences that contain "improve-technology" are more likely to be optimal than action-sequences that don't contain "improve-technology", given a random desire function. This is a very different question to the one proposition 3 gives us an answer to.

Almost all optimal action-sequences could contain "improve-technology" at the beginning, while any two particular action sequences are equally likely to be preferred to the other on average across desires. These two facts don't contradict each other. The first fact is true in many environments (e.g. the one I described[2]) and this is what we mean by instrumental convergence. The second fact is unrelated to instrumental convergence.


I think the error might be coming from this definition of instrumental convergence: 

could we nonetheless say that she's got a better than  probability of choosing  from a menu of  acts?

When  is a sequence of actions, this definition makes less sense. It'd be better to define it as something like "from a menu of  initial actions, she has a better than  probability of choosing a particular initial action ". 

 

 

I'm not entirely sure what you mean by "model", but from your use in the penultimate paragraph, I believe you're talking about a particular decision scenario Sia could find herself in.

Yep, I was using "model" to mean "a simplified representation of a complex real world scenario".

  1. ^

    For simplicity, we can make this scenario a deterministic known environment, and make sure the number of actions available doesn't change if "improve-technology" is chosen as an action. This way neither of your biases apply.

  2. ^

    E.g. we could define a "small-edit" as  to any location in the state vector. Then an "amplified-edit" as  to any location. This preserves the number of actions, and makes the advantage of "amplified-edit" clear. I can go into more detail if you like, this does depend a little on how we set up the distribution over desires.

Comment by Jeremy Gillen (jeremy-gillen) on Instrumental Convergence? [Draft] · 2023-07-14T19:53:54.895Z · LW · GW

I read about half of this post when it came out. I didn't want to comment without reading the whole thing, and reading the whole thing didn't seem worth it at the time. I've come back and read it because Dan seemed to reference it in a presentation the other day.

The core interesting claim is this:

My conclusion will be that most of the items on Bostrom's laundry list are not 'convergent' instrumental means, even in this weak sense. If Sia's desires are randomly selected, we should not give better than even odds to her making choices which promote her own survival, her own cognitive enhancement, technological innovation, or resource acquisition.

This conclusion doesn't follow from your arguments. None of your models even include actions that are analogous to the convergent actions on that list. 

The non-sequential theoretical model is irrelevant to instrumental convergence, because instrumental convergence is about putting yourself in a better position to pursue your goals later on. The main conclusion seems to come from proposition 3, but the model there is so simple it doesn’t include any possibility of Sia putting itself in a better position for later.

Section 4 deals with sequential decisions, but for some reason mainly gets distracted by a Newcomb-like problem, which seems irrelevant to instrumental convergence. I don't see why you didn't just remove Newcomb-like situations from the model? Instrumental convergence will show up regardless of the exact decision theory used by the agent.

Here's my suggestion for a more realistic model that would exhibit instrumental convergence, while still being fairly simple and having "random" goals across trajectories. Make an environment with 1,000,000 timesteps. Have the world state described by a vector of 1000 real numbers. Have a utility function that is randomly sampled from some Gaussian process (or any other high entropy distribution over functions) on . Assume there exist standard actions which directly make small edits to the world-state vector. Assume that there exist actions analogous to cognitive enhancement, making technology and gaining resources. Intelligence can be used in the future to more precisely predict the consequences of actions on the future world state (you’d need to model a bounded agent for this). Technology can be used to increase the amount or change the type of effect your actions have on the world state. Resources can be spent in the future for more control over the world state. It seems clear to me that for the vast majority of the random utility functions, it's very valuable to have more control over the future world state. So most sampled agents will take the instrumentally convergent actions early in the game and use the additional power later on. 

The assumptions I made about the environment are inspired by the real world environment, and the assumptions I've made about the desires are similar to yours, maximally uninformative over trajectories.

Comment by Jeremy Gillen (jeremy-gillen) on What money-pumps exist, if any, for deontologists? · 2023-06-29T06:19:40.205Z · LW · GW

I'm not sure how to implement the rule "don't pay people to kill people". Say we implement it as a utility function over world-trajectories, and any trajectory that involves any causally downstream of your actions killing gets MIN_UTILITY. This makes probabilistic tradeoffs so it's probably not what we want. If we use negative infinity, but then it can't ever take actions in a large or uncertain world. We need to add the patch that the agent must have been aware at the time of taking its actions that the actions had  chance of causing murder. I think these are vulnerable to blackmail because you could threaten to cause murders that are causally-downstream-from-its-actions.

Maybe I'm confused and you mean "actions that pattern match to actually paying money directly for murder", in which case it will just use a longer causal chain, or opaque companies that may-or-may-not-cause-murders will appear and trade with it.

If the ultimate patch is "don't take any action that allows unprincipled agents to exploit you for having your principles", then maybe there isn't any edge cases. I'm confused about how to define "exploit" though.

Comment by Jeremy Gillen (jeremy-gillen) on What money-pumps exist, if any, for deontologists? · 2023-06-28T20:33:27.809Z · LW · GW

You leave money on the table in all the problems where the most efficient-in-money solution involves violating your constraint. So there's some selection pressure against you if selection is based on money.
We can (kinda) turn this into a money-pump by charging the agent a fee for to violate the constraint for it. Whenever it encounters such a situation, it pays you a fee and you do the killing.
Whether or not this counts as a money pump, I think it satisfies the reasons I actually care about money pumps, which are something like "adversarial agents can cheaply construct situations where I pay them money, but the world isn't actually different".