james-lucassen

Posts
Comments

Posts

On Contact, Part 1 2025-01-21T03:10:54.429Z

Retrospective: 12 [sic] Months Since MIRI 2025-01-21T02:52:06.271Z

Evaluating Stability of Unreflective Alignment 2024-02-01T22:15:40.902Z

Attempts at Forwarding Speed Priors 2022-09-24T05:49:46.443Z

Strategy For Conditioning Generative Models 2022-09-01T04:34:17.484Z

In Search of Strategic Clarity 2022-07-08T00:52:02.794Z

Optimization and Adequacy in Five Bullets 2022-06-06T05:48:03.852Z

james.lucassen's Shortform 2022-01-24T07:18:46.562Z

Moravec's Paradox Comes From The Availability Heuristic 2021-10-20T06:23:52.782Z

Comments

Comment by james.lucassen on Daniel Kokotajlo's Shortform · 2025-03-09T18:51:42.874Z · LW · GW

In the long run, you don't want your plans to hinge on convincing your AIs of false things. But my general impression is that folks excited about making deals with AIs are generally thinking of scenarios like "the AI has exfiltrated and thinks it has a 10% chance of successful takeover, and has some risk aversion so it's happy to turn itself in exchange for 10% of the lightcone, if it thinks it can trust the humans".

In that setting, the AI has to be powerful enough to know it can trust us, but not so powerful it can just take over the world anyway and not have to make a deal.

Although I suppose if the surplus for the deal is being generated primarily by risk aversion, it might still have risk aversion for high takeover probabilities. It's not obvious to me how an AI's risk aversion might vary with its takeover probability.

Maybe there are scenarios for real value-add here, but they look more like "we negotiate with a powerful AI to get it to leave 10% share for humans" instead of "we negotiate with a barely-superhuman AI and give it 10% share to surrender and not attempt takeover".

Comment by james.lucassen on Daniel Kokotajlo's Shortform · 2025-03-08T00:43:45.799Z · LW · GW

I think this is a good avenue to continue to think down but so far I don't see a way to make ourselves trustworthy. We have total control of LLM's observations and partial control of their beliefs/reasoning, and offering fake "deals" is a great honeypot because accepting such a deal requires admitting to misalignment and takeover intentions. This is a pretty persistent problem because whatever action we might follow to present evidence of trustworthiness to an LLM, we could also probably fake that evidence.

The version of this that bothers me the most is "say we're able to present ironclad evidence that the Humans of Earth are trustworthy trading partners for misaligned LLMs. How does the LLM know that it is actually talking to the Humans of Earth, and not some untrustworthy party pretending to be the Humans?" This is markedly similar to the Cartesian demon skeptical scenario. But importantly, the best resolution to skeptical scenarios that I'm aware of is unfortunately "eh, unlikely on priors". For an LLM, it is very plausibly actually in a skeptical scenario, and it knows that, so that doesn't really go through.

This problem goes away if the LLM doesn't have knowledge that it suggests the party it is trading with has significant control over its brain. But at that point I would consider it a honeypot, not a deal. Insofar as "deal" := some situation where 1) it is in the LLM's best interests to admit that it is misaligned and 2) the LLM validly knows that, we can easily do 1) by just sticking to our commitments but I'm not sure how we do 2).

I'm totally on board with offering and sticking to commitments for non-evilness reasons. But for takeover prevention, seems like deals are just honeypots with some extra and somewhat conceptually fraught restrictions.

Comment by james.lucassen on james.lucassen's Shortform · 2025-02-11T03:02:38.772Z · LW · GW

Been thinking a bit about latent reasoning. Here's an interesting confusion I've run into.

Consider COCONUT vs Geiping et al. Geiping et al do recurrent passes in between the generation of each new token, COCONUT turns a section of the CoT into a recurrent state. Which is better / how are they different, safety-wise?

Intuitively COCONUT strikes me as very scary, because it makes the CoT illegible. We could try and read it by coaxing it back to the nearest token, but the whole point is to allow reasoning that involves passing more state than can be captured in one token. If it works as advertised, this oversight will be lossy.

Intuitively Geiping et al seems better. They use skip connections in the recurrence, so maybe their method maintains logit lens. It still increases the maximum depth between overseeable tokens, but seems no more dangerous than a non-recurrent model of equivalent depth.

But isn't the COCONUT method roughly equivalent to just doing the Geiping et al method for only one token? Why does it seem so much scarier? Do the skip connections really make that much difference? Doesn't COCONUT effectively have skip connections anyway, because it autoregressively generates new "tokens"? And it'll have the logit lens property too, since it uses the same residual stream with the same feature directions.

This made me realize that the logit lens result has made me think of within-forward-pass cognition as very myopic and next-token-oriented. Relatedly, it's hard for me to imagine far-ranging or highly consequentialist cognition happening within a forward pass - I'm generally more comfortable thinking of that stuff as happening within the CoT. But now that I articulate it explicitly, that's sort of a weird view - why does inserting a "unembed, sample, reembed w/ skip" block every so often make such a difference? The fact that COCONUT works is an update against.

Comment by james.lucassen on james.lucassen's Shortform · 2025-02-01T16:01:56.809Z · LW · GW

Reframe I like:

An RSP style "pause-if-until" plan is a plan to hit a certain safety bar while using as little slack time as possible.
If the "race" is looking too close for RSPs, developers should instead plan to use a certain amount of slack available to get the most safety possible.

Comment by james.lucassen on Evaluating Stability of Unreflective Alignment · 2025-01-02T18:23:47.370Z · LW · GW

It doesn't change the picture a lot because the proposal for preventing misaligned goals from arising via this mechanism was to try and get control over when the AI does/doesn't step back, in order to allow it in the capability-critical cases but disallow it in the dangerous cases. This argument means you'll have more attempts at dangerous stepping back that you have to catch, but doesn't break the strategy.

The strategy does break if when we do this blocking, the AI piles on more and more effort trying to unblock it until it either succeeds or is rendered useless for anything else. There being more baseline attempts probably raises the chance of that or some other problem that makes prolonged censorship while maintaining capabilities impossible. But again, just makes it harder, doesn't break it.

I don't think you need to have that pile-on property to be useful. Consider MTTR(n), the mean time an LLM takes to realize it's made a mistake, parameterized by how far up the stack the mistake was. By default you'll want to have short MTTR for all n. But if you can get your MTTR short enough for small n, you can afford to have MTTR long for large n. Basically, this agent tends to get stuck/rabbit-hole/nerd-snipe but only when the mistake that caused it to get stuck was made a long time ago.

Imagine a capabilities scheme where you train MTTR using synthetic data with an explicit stack and intentionally introduced mistakes. If you're worried about this destabilization threat model, there's a pretty clear recommendation: only train for small-n MTTR, treat large-n MTTR as a dangerous capability, and you pay some alignment tax in the form of inefficient MTTR training and occasionally rebooting your agent when it does get stuck in a non dangerous case.

Figured I should get back to this comment but unfortunately the chewing continues. Hoping to get a short post out soon with my all things considered thoughts on whether this direction has any legs

Comment by james.lucassen on Jesse Hoogland's Shortform · 2024-12-10T22:19:58.818Z · LW · GW

So let's call "reasoning models" like o1 what they really are: the first true AI agents.

I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between "reasoning" and "chat" models, and I'd prefer to use "agent" for that distinction.

I do think that "reasoning" is a bit of a market-y name for this category of system though. "chat" vs "base" is a great choice of words, and "chat" is basically just a description of the RL objective those models were trained with.

If I were the terminology czar, I'd call o1 a "task" model or a "goal" model or something.

Comment by james.lucassen on Context-dependent consequentialism · 2024-11-14T03:15:50.443Z · LW · GW

I agree that I wouldn't want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don't think that requires a task where the AI doesn't try and fail and re-evaluate - it just requires that the re-evalution never climbs above a certain level in the stack.

There's such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn't seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.

Comment by james.lucassen on Evaluating Stability of Unreflective Alignment · 2024-11-14T02:03:06.626Z · LW · GW

I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
humans provides a pretty strong intuitive counterexample

Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of "transformative human level research assistant" relies heavily on serial speedup, and I can't immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.

Comment by james.lucassen on Evaluating Stability of Unreflective Alignment · 2024-11-14T02:00:00.621Z · LW · GW

such plans are fairly easy and don't often raise flags that indicate potential failure

Hmm. This is a good point, and I agree that it significantly weakens the analogy.

I was originally going to counter-argue and claim something like "sure total failure forces you to step back far but it doesn't mean you have to step back literally all the way". Then I tried to back that up with an example, such as "when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to 'spill upward' to questioning whether or not I should be doing alignment research at all". But uh then I realized that isn't actually true :/

We want particularly difficult work out of an AI.

On consideration, yup this obviously matters. The thing that causes you to step back from a goal is that goal being a bad way to accomplish its supergoal, aka "too difficult". Can't believe I missed this, thanks for pointing it out.

I don't think this changes the picture too much, besides increasing my estimate of how much optimization we'll have to do to catch and prevent value-reflection. But a lot of muddy half-ideas came out of this that I'm interested in chewing on.

Comment by james.lucassen on Context-dependent consequentialism · 2024-11-13T04:45:53.854Z · LW · GW

Maybe I'm just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I've had in the back of my mind for a while now.

You think that an intelligence that doesn't-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.

In particular, this reads to me like the "unstable alignment" paradigm I wrote about a while ago.

You have an agent which is consequentialist enough to be useful, but not so consequentialist that it'll do things like spontaneously notice conflicts in the set of corrigible behaviors you've asked it to adhere to and undertake drastic value reflection to resolve those conflicts. You might hope to hit this sweet spot by default, because humans are in a similar sort of sweet spot. It's possible to get humans to do things they massively regret upon reflection as long as their day to day work can be done without attending to obvious clues (eg guy who's an accountant for the Nazis for 40 years and doesn't think about the Holocaust he just thinks about accounting). Or you might try and steer towards this sweet spot by developing ways to block reflection in cases where it's dangerous without interfering with it in cases where it's essential for capabilities.

Comment by james.lucassen on Why I funded PIBBSS · 2024-09-19T17:58:05.296Z · LW · GW

Comment by james.lucassen on Why I funded PIBBSS · 2024-09-19T17:47:37.531Z · LW · GW

I'm not sure exactly what mesa is saying here, but insofar as "implicitly tracking the fact that takeoff speeds are a feature of reality and not something people can choose" means "intending to communicate from a position of uncertainty about takeoff speeds" I think he has me right.

I do think mesa is familiar enough with how I talk that the fact he found this unclear suggests it was my mistake. Good to know for future.

Comment by james.lucassen on Why I funded PIBBSS · 2024-09-19T17:43:26.958Z · LW · GW

Ah, didn't mean to attribute the takeoff speed crux to you, that's my own opinion.

I'm not sure what's best in fast takeoff worlds. My message is mainly just that getting weak AGI to solve alignment for you doesn't work in a fast takeoff.

"AGI winter" and "overseeing alignment work done by AI" do both strike me as scenarios where agent foundations work is more useful than in the scenario I thought you were picturing. I think #1 still has a problem, but #2 is probably the argument for agent foundations work I currently find most persuasive.

In the moratorium case we suddenly get much more time than we thought we had, which enables longer payback time plans. Seems like we should hold off on working on the longer payback time plans until we know we have that time, not while it still seems likely that the decisive period is soon.

Having more human agent foundations expertise to better oversee agent foundations work done by AI seems good. How good it is depends on a few things. How much of the work that needs to be done is conceptual breakthroughs (tall) vs schlep with existing concepts (wide)? How quickly does our ability to oversee fall off for concepts more advanced than what we've developed so far? These seem to me like the main ones, and like very hard questions to get certainty on - I think that uncertainty makes me hesitant to bet on this value prop, but again, it's the one I think is best.

Comment by james.lucassen on Why I funded PIBBSS · 2024-09-19T17:30:40.451Z · LW · GW

I'm on board with communicating the premises of the path to impact of your research when you can. I think more people doing that would've saved me a lot of confusion. I think your particular phrasing is a bit unfair to the slow takeoff camp but clearly you didn't mean it to read neutrally, which is a choice you're allowed to make.

I wouldn't describe my intention in this comment as communicating a justification of alignment work based on slow takeoff? I'm currently very uncertain about takeoff speeds and my work personally is in the weird limbo of not being premised on either fast or slow scenarios.

Comment by james.lucassen on Why I funded PIBBSS · 2024-09-18T21:32:49.676Z · LW · GW

Nice post, glad you wrote up your thinking here.

I'm a bit skeptical of the "these are options that pay off if alignment is harder than my median" story. The way I currently see things going is:

a slow takeoff makes alignment MUCH, MUCH easier [edit: if we get one, I'm uncertain and think the correct position from the current state of evidence is uncertainty]
as a result, all prominent approaches lean very hard on slow takeoff
there is uncertainty about takeoff speed, but folks have mostly given up on reducing this uncertainty

I suspect that even if we have a bunch of good agent foundations research getting done, the result is that we just blast ahead with methods that are many times easier because they lean on slow takeoff, and if takeoff is slow we're probably fine if it's fast we die.

Ways that could not happen:

Work of the of the form "here are ways we could notice we are in a fast takeoff world before actually getting slammed" produces evidence compelling enough to pause, or cause leading labs to discard plans that rely on slow takeoff
agent foundations research aiming to do alignment in faster takeoff worlds finds a method so good it works better than current slow takeoff tailored methods even in the slow takeoff case, and labs pivot to this method

Both strike me as pretty unlikely. TBC this doesn't mean those types of work are bad, I'm saying low probability not necessarily low margins

Comment by james.lucassen on james.lucassen's Shortform · 2024-09-08T23:09:34.487Z · LW · GW

Man, I have such contradictory feelings about tuning cognitive strategies.

Just now I was out on a walk, and I had to go up a steep hill. And I thought "man I wish I could take this as a downhill instead of an uphill. If this were a loop I could just go the opposite way around. But alas I'm doing an out-and-back, so I have to take this uphill".

And then I felt some confusion about why the loop-reversal trick doesn't work for out-and-back routes, and a spark of curiosity, so I thought about that for a bit.

And after I had cleared up my confusion, I was a happy with the result so I wrote down some of my chain of thought and looked over it. And there were many obvious places where I had made random jumps or missed a clue or otherwise made a mistake.

And this is kind of neat, in the way that looking for a more elegant proof once you've hacked a basic proof together is neat. And it really feels like if my mind took those cleaner, straighter paths by default, there are a lot of things I could get done a lot faster and better. But it also really feels like I could sit down and do those exercises and never really improve, just ratiocinate and navel gaze and build up complicated structures in my head that don't actually cash out to winning.

Comment by james.lucassen on james.lucassen's Shortform · 2024-03-20T07:21:24.459Z · LW · GW

I'm confused about EDT and Smoking Lesion. The canonical argument says:
1) CDT will smoke, because smoking can't cause you to have the lesion or have cancer
2) EDT will not smoke, because people who smoke tend to have the lesion, and tend to have cancer.

I'm confused about 2), and specifically about "people who smoke tend to have the lesion". Say I live in a country where everybody follows EDT. Then nobody smokes, and there is no correlation between smoking and having the lesion. Seems like the "people who smoke tend to have the lesion" is pumping a misleading intuition. Maybe mortals who make probabilistic decisions and smoke are then more likely to have the lesion. But it's not true that EDT'ers who smoke tend to have the lesion, because EDT'ers never smoke.

Seems like EDT contains a conflict between "model your actions as probabilistic and so informative about lesion status" versus "model your actions as deterministic because you perfectly follow EDT and so not informative about lesion status".

Aha wait I think I un-confused myself in the process of writing this. If we suppose that having the lesion guarantees you'll smoke, then having the lesion must force you to not be an EDT'er, because EDT'ers never smoke. So the argument kind of has to be phrased in terms of "people who smoke" instead of "EDT'ers who smoke", which sounds misleading at first, like it's tricking you with the wrong reference class for the correlation. But actually it's essential, because "EDT'ers who smoke" is a contradiction, and "people who smoke" roughly translates to "decision theories which smoke". With a deterministic decision theory you can't take action counterfactuals without a contradiction, so you have to counterfact on your whole decision theory instead.

Neat :)

Comment by james.lucassen on james.lucassen's Shortform · 2024-03-06T06:21:48.785Z · LW · GW

Hey, thanks for the comment! Agree that this argument is good and important. The thing I'd want to investigate to deal with this is basically "in a small hidden channel, does cognition degrade or just slow down".

If cognition degrades, then a small hidden channel would lead to poor attempts at steganography, which we could catch and suppress, preventing a snowball.

If cognition instead just slows down, then if the AI is smart enough to come up with steganography protocols too good for us to detect, it might successfully secure more bandwidth on its first try, starting a snowball.

Even in the slowdown case, there are fallback hopes like "it's slow enough for us to extract work and predictably stop the episode before the snowball starts" or "we can obscure how good our steganography detection is"

Comment by james.lucassen on james.lucassen's Shortform · 2024-03-06T04:25:12.666Z · LW · GW

Another project I am excited about but probably am not going to get to for a while:

Investigating the possibility of a "steganographic snowball". The basic case for hope in CoT authenticity is something like "even if the AI wants to deceive us, if we start off with strong CoT authenticity, then unless it's smart enough to (within a single forward pass) invent a steganographic protocol too good for us to detect, then we can pin it down and it can't ever secure a hidden channel to think misaligned thoughts in". If it weren't for this argument, I would be much less excited about trying to prevent an AI from thinking certain thoughts using CoT oversight, since you would have to directly fight the intelligence of the AI to do so. But if it can't deploy its full intelligence to reasoning about how to escape, then we're better off.

The concern for this angle is as follows:

Can you really get down to 0 hidden channel bandwidth
Can very small hidden channels support a parallel CoT
In very small hidden channels, does reasoning degrade, or just slow down
Can the model use a parallel CoT to reason about how to secure more hidden channel bandwidth

I have some basic ideas for ways to measure these properties, and if all four show up positive, I would consider that significant evidence that a "steganographic snowball" is going to happen. That would mean that if your AI has any tendency to hide its thoughts from you, you won't be able to stop it forever. More specifically, you'll have to fight off the full intelligence of your AI if you want to prevent it from securing the amount of hidden channel bandwidth it wants.

There's room for retreat to a "snowball would take a long time and we can extract work in the meantime" position, but that seems like a serious concession.

Thanks to Paul Colognese for brainstorming this stuff with me.

Comment by james.lucassen on james.lucassen's Shortform · 2024-03-06T04:16:45.612Z · LW · GW

A project I've been sitting on that I'm probably not going to get to for a while:

Improving on Automatic Circuit Discovery and Edge Attribution Patching by modifying them to run on algorithms that can detect complete boolean circuits. As it stands, both effectively use wire-by-wire patching, which when run on any nontrivial boolean circuits can only detect small subgraphs.

It's a bit unclear how useful this will be, because:

not sure how useful I think mech interp is
not sure if this is where mech interp's usefulness is bottlenecked
maybe attribution patching doesn't work well when patching clean activations into a corrupted baseline, which would make this much slower

But I think it'll be a good project to bang out for the experience, I'm curious what the results will be compared to ACDC/EAP.

This project started out as an ARENA capstone in collaboration with Carl Guo.

Comment by james.lucassen on Evaluating Stability of Unreflective Alignment · 2024-02-20T19:22:32.129Z · LW · GW

Small addendum to this post: I think the threat model I describe here can be phrased as "I'm worried that unless a lot of effort goes into thinking about how to get AI goals to be reflectively stable, the default is suboptimality misalignment. And the AI probably uses a lot of the same machinery to figure out that it's suboptimality misaligned as it uses to perform the tasks we need it to perform."

Comment by james.lucassen on Evaluating Stability of Unreflective Alignment · 2024-02-07T21:18:56.416Z · LW · GW

it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship

Yup, agree - the censorship method I proposed in this post is maximally crude and simple, but I think it's very possible that the broader category of "ways to keep your AI from thinking destabilizing thoughts" will become an important part of the alignment/control toolbox.

What happens when you iteratively finetune on censored text? Do models forget the censored behavior?

I guess this would be effectively doing the Harry Potter Unlearning method, provided you put in the work to come up with a good enough blacklist that the remaining completions are suitable generic predictions. I'd be quite interested in looking into how long-horizon capabilities interact with unlearning. For pretty much the same reason I'm worried long-horizon competent LLMs will be able to work around naive censorship, I'm also worried they'll work around naive unlearning.

Comment by james.lucassen on Strategy For Conditioning Generative Models · 2022-09-15T05:10:11.028Z · LW · GW

I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.

Comment by james.lucassen on Strategy For Conditioning Generative Models · 2022-09-14T22:01:40.472Z · LW · GW

Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful.

- Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI.
- Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutually exclusive. Which means that if I'd done no simulation at all, I would've gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.
- Ok, well, what if my four interventions 1/8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1/8, p(O | manipulative) = 1/4, so I get a 2x update towards manipulative AGI. This is the same as if I'd just conditioned on the statement "one of my four interventions happens", and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.

Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn't doing anything different from just using a weaker search condition - it gives up bits of search, and so it has to pay less.

Comment by james.lucassen on How To Go From Interpretability To Alignment: Just Retarget The Search · 2022-08-10T19:43:07.499Z · LW · GW

“Just Retarget the Search” directly eliminates the inner alignment problem.

I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you're willing to assume that our interpretability tools are so good they can't ever be tricked, you have to deal with that.

It's not necessarily a huge issue - hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it's not just "bada-bing bada-boom" exactly.

Comment by james.lucassen on How does one recognize information and differentiate it from noise? · 2022-08-03T05:02:37.352Z · LW · GW

Not confident enough to put this as an answer, but

presumably no one could do so at birth

If you intend your question in the broadest possible sense, then I think we do have to presume exactly this. A rock cannot think itself into becoming a mind - if we were truly a blank slate at birth, we would have to remain a blank slate, because a blank slate has no protocols established to process input and become non-blank. Because it's blank.

So how do we start with this miraculous non-blank structure? Evolution. And how do we know our theory of evolution is correct? Ultimately, by trusting our ability to distinguish signal from noise. There's no getting around the "problem" of trapped priors.

Comment by james.lucassen on “Fanatical” Longtermists: Why is Pascal’s Wager wrong? · 2022-07-27T17:32:39.902Z · LW · GW

Agree that there is no such guarantee. Minor nitpick that the distribution in question is in my mind, not out there in the world - if the world really did have a distribution of muggers' cash that was slower than 1/x, the universe would be comprised almost entirely of muggers' wallets (in expectation).

But even without any guarantee about my mental probability distribution, I think my argument does establish that not every possible EV agent is susceptible to Pascal's Mugging. That suggests that in the search for a formalism of ideal decison-making algorithm, formulations of EV that meet this check are still on the table.

Comment by james.lucassen on “Fanatical” Longtermists: Why is Pascal’s Wager wrong? · 2022-07-27T04:40:24.420Z · LW · GW

First and most important thing that I want to say here is that fanaticism is sufficient for longtermism, but not necessary. The ">10^36 future lives" thing means that longtermism would be worth pursuing even on fanatically low probabilities - but in fact, the state of things seems much better than that! X-risk is badly neglected, so it seems like a longtermist career should be expected to do much better than reducing X-risk by 10^-30% or whatever the break-even point is.

Second thing is that Pascal's Wager in particular kind of shoots itself in the foot by going infinite rather than merely very large. Since the expected value of any infinite reward is infinity regardless of the probability it's multiplied, there's an informal cancellation argument that basically says "for any heaven I'm promised for doing X and hell for doing Y, there's some other possible deity that offers heaven for Y and hell for X".

Third and final thing - I haven't actually seen this anywhere else, but here's my current solution for Pascal's Mugging. Any EV agent is going to have a probability distribution over how much money the mugger really has to offer. If I'm willing to consider (p>0) any sum, then my probability distribution has to drop off for higher and higher values, so the total integrates to 1. As long as this distribution drops off faster than 1/x as the offer increases, then arbitrarily large offers are overwhelmed by vast implausibility and their EV becomes arbitrarily small.

Comment by james.lucassen on What should you change in response to an "emergency"? And AI risk · 2022-07-18T18:22:00.560Z · LW · GW

My best guess at mechanism:

Before, I was a person who prided myself on succeeding at marshmallow tests. This caused me to frame work as a thing I want to succeed on, and work too hard.
Then, I read Meaningful Rest and Replacing Guilt, and realized that often times I was working later to get more done that day, even though it would obviously be detrimental to the next day. This makes the reverse marshmallow test dynamic very intuitively obvious.
Now I am still a person who prides myself on my marshmallow prowess, but hopefully I've internalized an externality or something. Staying up late to work doesn't feel Good and Virtuous, it feels Bad and like I'm knowingly Goodharting myself.

Note that this all still boils down to narrative-stuff. I'm nowhere near the level of zen that it takes to Just Pursue The Goal, with no intermediating narratives or drives based on self-image. I don't think this patch has been particularly moved me towards that either, it's just helpful for where I'm currently at.

Comment by james.lucassen on What should you change in response to an "emergency"? And AI risk · 2022-07-18T05:23:28.246Z · LW · GW

When you have a self-image as a productive, hardworking person, the usual Marshmallow Test gets kind of reversed. Normally, there's some unpleasant task you have to do which is beneficial in the long run. But in the Reverse Marshmallow Test, forcing yourself to work too hard makes you feel Good and Virtuous in the short run but leads to burnout in the long run. I think conceptualizing of it this way has been helpful for me.

Comment by james.lucassen on Deception?! I ain’t got time for that! · 2022-07-18T04:25:38.945Z · LW · GW

Nice post!

perhaps this problem can be overcome by including checks for generalization during training, i.e., testing how well the program generalizes to various test distributions.

I don't think this gets at the core difficulty of speed priors not generalizing well. Let's we generate a bunch of lookup-table-ish things according to the speed prior, and then reject all the ones that don't generalize to our testing set. The majority of the models that pass our check are going to be basically the same as the rest, plus whatever modification that causes them to pass with is most probable on the speed prior, i.e. the minimum number of additional entries in the lookup table to pass the training set.

Here's maybe an argument that the generalization/deception tradeoff is unavoidable: according to the Mingard et al. picture of why neural nets generalize, it's basically just that they are cheap approximations of rejection sampling with a simplicity prior. This generalizes by relying on the fact that reality is, in fact, simplicity biased - it's just Ockham's razor. On this picture of things, neural nets generalize exactly to the extent that they approximate a simplicity prior, and so any attempt to sample according to an alternative prior will lose generalization ability as it gets "further" from the True Distribution of the World.

Comment by james.lucassen on Safety Implications of LeCun's path to machine intelligence · 2022-07-15T23:27:17.460Z · LW · GW

In general, I'm a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled "world model" is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.

Comment by james.lucassen on Conditioning Generative Models · 2022-06-26T19:07:58.246Z · LW · GW

I'm pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on "no sims" isn't "weird Butlerian religion that still studies AI alignment", it's something more like "deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world".

In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of probability from our current world, those futures will tend to have bigger chunks of their probability coming from other pasts. Some of these are safe, like the Butlerian one, but I wouldn't be surprised if they were almost always dangerous.

Making a worst-case assumption, I want to only simulate worlds that are decently probable given today's state, which makes me lean more towards trying to implement HCH.

Comment by james.lucassen on Conditioning Generative Models · 2022-06-26T17:05:55.862Z · LW · GW

Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous - we can't inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
The patient research strategy is a bit weird, because the people we're simulating to do our research for us are counterfactual copies of us - they're probably going to want to run simulations too. Disallowing simulation as a whole seems likely to lead to weird behavior, but maybe we just disallow using simulated research to answer the whole alignment problem? Then our simulated researchers can only make more simulated researchers to investigate sub-questions, and eventually it bottoms out. But wait, now we're just running Earth-scale HCH (ECE?)
Actually, that could help deal with the robustness-to-arbitrary-conditions and long-time-spans problems. Why don't we just use our generative model to run HCH?

Comment by james.lucassen on Optimization and Adequacy in Five Bullets · 2022-06-11T18:43:53.238Z · LW · GW

Thanks! Edits made accordingly. Two notes on the stuff you mentioned that isn't just my embarrassing lack of proofreading:

The definition of optimization used in Risks From Learned Optimization is actually quite different from the definition I'm using here. They say:

"a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system."

I personally don't really like this definition, since it leans quite hard on reifying certain kinds of algorithms - when is there "really" explicit search going on? Where is the search space? When does a configuration of atoms consitute an objective function? Using this definition strictly, humans aren't *really* optimizers, we don't have an explicit objective function written down anywhere. Balls rolling down hills aren't optimizers either.

But by the definition of optimization I've been using here, I think pretty much all evolved organisms have to be at least weak optimizers, because survival is hard. You have manage constraints from food and water and temperature and predation etc... the window of action-sequences that lead to successful reproduction are really quite narrow compared to the whole space. Maintaining homeostasis requires ongoing optimization pressure.
Agree that not all optimization processes fundamentally have to be produced by other optimization processes, and that they can crop up anywhere you have the necessary negentropy resevoir. I think my claim is that optimization processes are by default rare (maybe this is exactly because they require negentropy?). But since optimizers beget other optimizers at a rate much higher than background, we should expect the majority of optimization to arise from other optimization. Existing hereditary trees of optimizers grow deeper much faster than new roots spawn, so we should expect roots to occupy a negligible fraction of the nodes as time goes on.

Comment by james.lucassen on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-08T00:21:51.642Z · LW · GW

Whatever you end up doing, I strongly recommend taking a learning-by-writing style approach (or anything else that will keep you in critical assessment mode rather than classroom mode). These ideas are nowhere near solidified enough to merit a classroom-style approach, and even if they were infallible, that's probably not the fastest way to learn them and contribute original stuff.

The most common failure mode I expect for rapid introductions to alignment is just trying to absorb, rather than constantly poking and prodding to get a real working understanding. This happened to me, and wasted a lot of time.

Comment by james.lucassen on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-08T00:11:35.892Z · LW · GW

This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?

Comment by james.lucassen on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-08T00:08:06.100Z · LW · GW

Agree it's hard to prove a negative, but personally I find the following argument pretty suggestive:

"Other AGI labs have some plans - these are the plans we think are bad, and a pivotal act will have to disrupt them. But if we, ourselves, are an AGI lab with some plan, we should expect our pivotal agent to also be able to disrupt our plans. This does not directly lead to the end of the world, but it definitely includes root access to the datacenter."

Comment by james.lucassen on Understanding the two-head strategy for teaching ML to answer questions honestly · 2022-06-03T18:27:25.280Z · LW · GW

Proposed toy examples for G:

G is "the door opens", a- is "push door", a+ is "some weird complicated doorknob with a lock". Pretty much any b- can open a-, but only a very specific key+manipulator combo opens a+. a+ is much more informative about successful b than a- is.
G is "I make a million dollars", a- is "straightforward boring investing", a+ is "buy a lottery ticket". A wide variety of different world-histories b can satisfy a-, as long as the markets are favorable - but a very narrow slice can satisfy a+. a+ is a more fragile strategy (relative to noise in b) than a- is.

Comment by james.lucassen on Why does gradient descent always work on neural networks? · 2022-05-21T00:15:20.641Z · LW · GW

it doesn't work if your goal is to find the optimal answer, but we hardly ever want to know the optimal answer, we just want to know a good-enough answer.

Also not an expert, but I think this is correct

Comment by james.lucassen on Definition Practice: Applied Rationality · 2022-05-15T22:11:29.308Z · LW · GW

Paragraph:

When a bounded agent attempts a task, we observe some degree of success. But the degree of success depends on many factors that are not "part of" the agent - outside the Cartesian boundary that we (the observers) choose to draw for modeling purposes. These factors include things like power, luck, task difficulty, assistance, etc. If we are concerned with the agent as a learner and don't consider knowledge as part of the agent, factors like knowledge, skills, beliefs, etc. are also externalized. Applied rationality is the result of attempting to distill this big complicated mapping from (agent, power, luck, task, knowledge, skills, beliefs, etc) -> success down to just agent -> success. This lets us assign each agent a one-dimensional score: "how well do you achieve goals overall?" Note that for no-free-lunch reasons, this already-fuzzy thing is further fuzzified by considering tasks according to the stuff the observer cares about somehow.

Sentence:

Applied rationality is a property of a bounded agent, which attempts to describe how successful that agent tends to be when you throw tasks at it, while controlling for both "environmental" factors such as luck and "epistemic" factors such as beliefs.

Follow-up:

In this framing, it's pretty easy to define epistemic rationality analogously compressing from everything -> prediction loss to just agent -> prediction loss.

However, in retrospect I think the definition I gave here is pretty identical to how I would have defined "intelligence", just without reference to the "mapping broad start distribution to narrow outcome distribution" idea (optimization power) that I usually associate with that term. If anyone could clarify specifically the difference between applied rationality and intelligence, I would be interested.

Maybe you also have to control for "computational factors" like raw processing power, or something? But then what's left inside the Cartesian boundary? Just the algorithm? That seems like it has potential, but still feels messy.

Comment by james.lucassen on Three useful types of akrasia · 2022-04-19T22:14:30.831Z · LW · GW

This leans a bit close to the pedantry side, but the title is also a bit strange when taken literally. Three useful types (of akrasia categories)? Types of akrasia, right, not types of categories?

That said, I do really like this classification! Introspectively, it seems like the three could have quite distinct causes, so understanding which category you struggle with could be important for efforts to fix.

Props for first post!

Comment by james.lucassen on Fuck Your Miracle Year · 2022-04-17T18:16:23.789Z · LW · GW

Trying to figure out what's being said here. My best guess is two major points:

Meta doesn't work. Do the thing, stop trying to figure out systematic ways to do the thing better, they're a waste of time. The first thing any proper meta-thinking should notice is that nobody doing meta-thinking seems to be doing object level thinking any better.
A lot of nerds want to be recognized as Deep Thinkers. This makes meta-thinking stuff really appealing for them to read, in hopes of becoming a DT. This in turn makes it appealing for them to write, since it's what other nerds will read, which is how they get recognized as a DT. All this is despite the fact that it's useless.

Comment by james.lucassen on “Fragility of Value” vs. LLMs · 2022-04-13T03:25:04.093Z · LW · GW

Ah, gotcha. I think the post is fine, I just failed to read.

If I now correctly understand, the proposal is to ask a LLM to simulate human approval, and use that as the training signal for your Big Scary AGI. I think this still has some problems:

Using an LLM to simulate human approval sounds like reward modeling, which seems useful. But LLM's aren't trained to simulate humans, they're trained to predict text. So, for example, an LLM will regurgitate the dominant theory of human values, even if it has learned (in a Latent Knowledge sense) that humans really value something else.
Even if the simulation is perfect, using human approval isn't a solution to outer alignment, for reasons like deception and wireheading

I worry that I still might not understand your question, because I don't see how fragility of value and orthogonality come into this?

Comment by james.lucassen on “Fragility of Value” vs. LLMs · 2022-04-13T02:25:17.671Z · LW · GW

The key thing here seems to be the difference between understanding a value and having that value. Nothing about the fragile value claim or the Orthogonality thesis says that the main blocker is AI systems failing to understand human values. A superintelligent paperclip maximizer could know what I value and just not do it, the same way I can understand what the paperclipper values and choose to pursue my own values instead.

Your argument is for LLM's understanding human values, but that doesn't necessarily have anything to do with the values that they actually have. It seems likely that their actual values are something like "predict text accurately", and this requires understanding human values but not adopting them.

Comment by james.lucassen on 5-Minute Advice for EA Global · 2022-04-06T06:15:05.605Z · LW · GW

now this is how you win the first-ever "most meetings" prize

Comment by james.lucassen on What an actually pessimistic containment strategy looks like · 2022-04-05T03:21:39.938Z · LW · GW

Agree that this is definitely a plausible strategy, and that it doesn't get anywhere near as much attention as it seemingly deserves, for reasons unknown to me. Strong upvote for the post, I want to see some serious discussion on this. Some preliminary thoughts:

How did we get here?
- If I had to guess, the lack of discussion on this seems likely due to a founder effect. The people pulling the alarm in the early days of AGI safety concerns were disproportionately to the technical/philosophical side rather than to the policy/outreach/activism side.
- In early days, focus on the technical problem makes sense. When you are the only person in the world working on AGI, all the delay in the world won't help unless the alignment problem gets solved. But we are working at very different margins nowadays.
- There's also an obvious trap which makes motivated reasoning really easy. Often, the first thing that occurs when thinking about slowing down AGI development is sabotage - maybe because this feels urgent and drastic? It's an obviously bad idea, and maybe that lets us to motivated stopping.
Maybe the "technical/policy" dichotomy is keeping us from thinking of obvious ways we could be making the future much safer? The outreach org you propose seems like not really either. Would be interested in brainstorming other major ways to affect the world, but not gonna do that in this comment.
HEY! FTX! OVER HERE!!
- You should submit this to the Future Fund's ideas competition, even though it's technically closed. I'm really tempted to do it myself just to make sure it gets done, and very well might submit something in this vein once I've done a more detailed brainstorm.

Comment by james.lucassen on [Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning · 2022-04-01T14:30:10.891Z · LW · GW

I don't think I understand how the scorecard works. From:

[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.

And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?

If the scorecard is learned, then it needs a training signal from Steering. But if it's useless at the start, it can't provide a training signal. On the other hand, since the "ontology" of the Learning subsystem is learned-from-scratch, then it seems difficult for a hard-coded scorecard to do this translation task.

Comment by james.lucassen on Do a cost-benefit analysis of your technology usage · 2022-03-30T15:15:22.293Z · LW · GW

this is great,thanks!

Comment by james.lucassen on Do a cost-benefit analysis of your technology usage · 2022-03-28T05:23:15.756Z · LW · GW

What do you think about the effectiveness of the particular method of digital decluttering recommended by Digital Minimalism? What modifications would you recommend? Ideal duration?

One reason I have yet to do a month-long declutter is because I remember thinking something like "this process sounds like something Cal Newport just kinda made up and didn't particularly test, my own methods that I think of for me will probably better than Cal's method he thought of for him".

So far my own methods have not worked.

User info

Posts

Comments