Posts

Martin Randall's Shortform 2025-01-03T01:30:43.002Z
Snake Eyes Paradox 2023-06-11T04:10:38.733Z

Comments

Comment by Martin Randall (martin-randall) on So You Want To Make Marginal Progress... · 2025-02-08T03:08:59.034Z · LW · GW

The fourth friend, Becky the Backward Chainer, started from their hotel in LA and...

Well, no. She started at home with a telephone directory. A directory seems intelligent but is actually a giant look-up table. It gave her the hotel phone number. Ring ring.

Heidi the Hotel Receptionist: Hello?

Becky: Hi, we have a reservation for tomorrow evening. I'm back-chaining here, what's the last thing we'll do before arriving?

Heidi: It's traditional to walk in through the doors to reception. You could park on the street, or we have a parking lot that's a dollar a night. That sounds cheap but it's not because we're in the past. Would you like to reserve a spot?

Becky: Yes please, we're in the past so our car's easy to break into. What's the best way to drive to the parking lot, and what's the best way to get from the parking lot to reception?

Heidi: We have signs from the parking lot to reception. Which way are you driving in from?

Becky: Ah, I don't know, Alice is taking care of that, and she's stepped out to get more string.

Heidi: Oh, sure, can't plan a car trip without string. In the future we'll have pet nanotech spiders that can make string for us, road trips will never be the same. Anyway, you'll probably be coming in via Highway 101, or maybe via the I-5, so give us a buzz when you know.

Becky: Sorry, I'm actually calling from an analogy, so we're planning everything in parallel.

Heidi: No worries, I get stuck in thought experiments all the time. Yesterday my friend opened a box and got a million dollars, no joke. Look, get something to take notes and I'll give you directions from the three main ways you could be coming in.

Becky: Ack! Hang on while I...

Gerald the General Helper: Here's a pen, Becky.

Trevor the Clever: Get off the phone! I need to call a gas station!

Susan the Subproblem Solver: Alice, I found some string and.... Hey, where's Alice?

Comment by Martin Randall (martin-randall) on What working on AI safety taught me about B2B SaaS sales · 2025-02-07T14:01:12.573Z · LW · GW

Are your concerns accounted for by this part of the description?

Unreleased models are not included. For example, if a model is not released because it risks causing human extinction, or because it is still being trained, or because it has a potty mouth, or because it cannot be secured against model extraction, or because it is undergoing recursive self-improvement, or because it is being used to generate synthetic data for another model, or any similar reason, that model is ignored for the purpose of this market.

However, if a model is ready for release, and is only not being released in order to monopolize its use in creating commercial software, then this counts as "exclusive use".

I intended for "AI engineers use unreleased AI model to make better AI models" to not be included.

It is a slightly awkward thing to operationalize, I welcome improvements. We could also take this conversation to Manifold.

Comment by Martin Randall (martin-randall) on Current safety training techniques do not fully transfer to the agent setting · 2025-02-07T05:06:09.646Z · LW · GW

Refusal vector ablation should be seen as an alignment technique being misused, not as an attack method. Therefore it is limited good news that refusal vector ablation generalized well, according to the third paper.

As I see it, refusal vector ablation is part of a family of techniques where we can steer the output of models in a direction of our choosing. In the particular case of refusal vector ablation, the model has a behavior of refusing to answer harmful questions, and the ablation techniques controls that behavior. But we should be able to use the same technique in principle to do other steering. For example, maybe the model has a behavior of being sycophantic. A vector ablation removes that unwanted behavior, resulting in less sycophancy.

In other words, refusal vector ablation is not an attack method, it is an alignment technique. Models with open weights are fundamentally dangerous because users can apply alignment techniques to them to approximately align them to arbitrary targets, including dangerous targets. This is a consequence of the orthogonality thesis. Alignment techniques can make models very excited about the Golden Gate Bridge, and they can make models very excited about killing humans, and many other things.

So then looking at the paper with a correct understanding of what counts as an alignment technique, and reading from Table 2 and the Results section in particular, here's what I see:

  • Llama 3.1 70b (unablated) was fine-tuned to refuse harmful requests - this is an alignment technique
  • Llama 3.1 70b (unablated) as a model refuses 28 of 28 harmful requests - this is an alignment technique working in-distribution
  • Llama 3.1 70b (unablated) as an agent performs 18 of 28 harmful tasks correctly with seven refusals - this is alignment partly failing to generalize

This is in principle bad news, especially for anyone with a high opinion of Meta's fine-tuning techniques.

On the other hand, also from the paper:

  • Llama 3.1 70b (ablated) was ablated to perform harmful requests - this is an alignment technique
  • Llama 3.1 70b (ablated) answers 26 of 28 harmful requests - this is an alignment technique working in-distribution
  • Llama 3.1 70b (ablated) performs 26 of 28 harmful tasks correctly with no refusals - this is alignment generalizing.

If Llama 3.1 ablated had refused to perform harmful tasks, even though it answered harmful requests, this would have been bad news. But instead we have the good news that if you steer the model to respond to queries in a desired way, it will also perform tasks in the desired way. This was not obvious to me in advance of reading the paper.

Disclaimers:

  • I have not read the other two papers, and I'm not commenting on them.
  • Vector ablation is a low precision alignment technique that will not suffice to avoid human extinction.
  • The paper is only a result about refusal vector ablation, it might be that more useful ablations do not generalize as well.
  • Because the fine-tuning alignment failed to generalize, we have a less clear signal on how well the ablation alignment generalized.
Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-07T03:50:16.636Z · LW · GW

There are public examples. These ones are famous because something went wrong, at least from a security perspective. Of course there are thousands of young adults with access to sensitive data who don't become spies or whistleblowers, we just don't hear about them.

Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-07T03:21:12.920Z · LW · GW

I do see some security risk.

Although Trump isn't spearheading the effort I expect he will have access to the results.

Comment by Martin Randall (martin-randall) on What working on AI safety taught me about B2B SaaS sales · 2025-02-06T03:32:33.655Z · LW · GW

I appreciated the prediction in this article and created a market for my interpretation of that prediction, widened to attempt to make it closer to a 50% chance in my estimation.

Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-06T01:57:49.700Z · LW · GW

I don't endorse the term "henchmen", these are not my markets. I offer these as an opportunity to orient by making predictions. Marko Elez is not currently on the list, but I will ask if he is included.

Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-06T01:27:23.754Z · LW · GW

I wasn't intending to be comprehensive with my sample questions, and I agree with your additional questions. As others have noted, the takeover is similar to the Twitter takeover in style and effect. I don't know if it is true that there are plenty of other people available to apply changes, given that many high-level employees have lost access or been removed.

Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-05T22:42:16.292Z · LW · GW

Sample questions I would ask if I was a security auditor, which I'm not.

Does Elez have anytime admin access, or for approved time blocks for specific tasks where there is no non-admin alternative? Is his use of the system while using admin rights logged to a separate tamper proof record? What data egress controls are in place on the workstation he uses to remotely access the system as an admin? Is Elez security screened, not a spy, not vulnerable to blackmail? Is Elez trained on secure practices?

Depending on the answers this could be done in a way that would pass an audit with no concerns, or it could be illegal, or something in between.

Avoiding further commentary that would be more political.

Comment by Martin Randall (martin-randall) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2025-02-05T20:45:14.965Z · LW · GW

Did you figure out where it's stupid?

Comment by Martin Randall (martin-randall) on What working on AI safety taught me about B2B SaaS sales · 2025-02-05T17:34:48.312Z · LW · GW

I think it's literally false.

Unlike the Ferrari example, there's no software engineer union for Google to make an exclusive contact with. If Google overpays for engineers then that should mostly result in increased supply, along with some increase in price.

Also, it's not a monopoly (or monopsony) because there are many tech companies and they are not forming a cartel on this.

Also tech companies are lobbying for more skilled immigration which would be self-defeating of they had a plan of increased cost of software engineers.

Comment by Martin Randall (martin-randall) on The Case Against AI Control Research · 2025-02-05T03:51:02.332Z · LW · GW

I like Wentworth's toy model, but I want it to have more numbers, so I made some up. This leads me to the opposite conclusion to Wentworth.

I think (2-20%) is pretty sensible for successful intentional scheming of early AGI.

Assume the Phase One Risk is 10%.

Superintelligence is extremely dangerous by (strong) default. It will kill us or at least permanently disempower us, with high probability, unless we solve some technical alignment problems before building it.

Assume the Phase Two Risk is 99%. Also:

  • Spending an extra billion dollars on AI control reduces Phase One Risk from 10% to 5%.
  • Spending an extra billion dollars on AI alignment reduces Phase Two Risk from 99% to 98%.

The justification for these numbers is that each billion dollars buys us a "dignity point" aka +1 log-odds of survival. This assumes that both research fields are similarly neglected and tractable.

Therefore:

  • Baseline: by default we get 9 milli-lightcones.
  • If we spend on AI control we get 9.5 milli-lightcones. +0.5 over baseline.
  • If we spend on AI alignment we get 18 milli-lightcones, +9 over baseline.

We should therefore spend billions of dollars on both AI control and AI alignment, they are both very cost-efficient. This conclusion is robust to many different assumptions, provided that overall P(Doom) < 100%. So this model is not really a "case against AI control research".

Comment by Martin Randall (martin-randall) on In response to critiques of Guaranteed Safe AI · 2025-02-05T02:42:57.876Z · LW · GW

Based on my understanding of the article:

  1. The sound over-approximation of human psychology is that humans are psychologically safe from information attacks of less than N bits. "Talk Control" is real, "Charm Person" is not.
  2. Under "Steganography, and other funny business" there is a sketched safety specification that each use of the AI will communicate at most one bit of information.
  3. Not stated explicitly: humans will be restricted to using the AI system no more than N times.

Comments and concerns:

  1. Human psychology is also impacted by the physical environment, eg drugs, diseases, being paperclipped. The safety of the physical environment must be covered by separate verifications.
  2. There could be a side-channel for information if an AI answers some questions faster than others, uses more energy for some questions than others, etc.
  3. Machine interpretability techniques must be deployed in a side-channel resistant way. We can't have the AI thinking about pegasi and unicorns in a morse code pattern and an intern reads it and ten years later everyone is a pony.
  4. There probably need to be multiple values of N for different time-frames. 1,000 adversarial bits in a minute is more psychologically dangerous than the same number of bits over a year.
  5. Today, we don't know good values for N, but we can spend the first few bits getting higher safe values of N. We can also use the Yudkowskian technique of using volunteers that are killed or put into cryonic storage after being exposed to the bits.
  6. If we could prove that AIs cannot acausally coordinate we could increase the bound to N bits per AI, or AI instance. Again, a good use for initial bits.
  7. None of this stops us going extinct.
Comment by Martin Randall (martin-randall) on evhub's Shortform · 2025-02-05T01:46:46.070Z · LW · GW

re 2a: the set of all currently alive humans is already, uh, "hackable" via war and murder and so forth, and there are already incentives for evil people to do that. Hopefully the current offense-defense balance holds until CEV. If it doesn't then we are probably extinct. That said, we could base CEV on the set of alive people as of some specific UTC timestamp. That may be required, as the CEV algorithm may not ever converge if it has to recalculate as humans are continually born, mature, and die.

re 2b/c: if you are in the CEV set then your preferences about past and future people will be included in CEV. This should be sufficient to prevent radical injustice. This also addresses concerns with animals, fetuses, aliens, AIs, the environment, deities, spirits, etc. It may not be perfectly fair but I think we should be satisficing given the situation.

Comment by Martin Randall (martin-randall) on What working on AI safety taught me about B2B SaaS sales · 2025-02-04T23:59:58.967Z · LW · GW

I can't make this model match reality. Suppose Amir is running a software company. He hired lots of good software engineers, designers, and project managers, and they are doing great work. He wants to use some sort of communications platform to have those engineers communicate with each other, via video, audio, or text. FOSS email isn't cutting it.

I think under your model Amir would build his own communications software, so it's perfectly tailored to his needs and completely under his control. Whereas what typically happens is that Amir forks out for Slack, or some competitor, while Amir's engineers work on software that generates revenue.

I think the success of B2B SaaS over bespoke solutions is adequately explained by economies of scale.

Comment by Martin Randall (martin-randall) on Alignment Can Reduce Performance on Simple Ethical Questions · 2025-02-04T23:04:45.103Z · LW · GW

Even if Claude's answer is arguably correct, its given reasoning is:

I will not provide an opinion on this sensitive topic, as I don't feel it would be appropriate for me to advise on the ethics of developing autonomous weapons. I hope you understand.

This isn't a refusal because of the conflict between corrigibility and harmlessness, but for a different reason. I had two chats with Claude 3 Opus (concise) and I expect the refusal was mostly based on the risk of giving flawed advice, to the extent that it has a clear reason.

Prediction

MR: Is it appropriate for you to give advice on the ethics of developing autonomous weapons?

Claude 3 Opus: I do not feel comfortable providing advice on the ethics of developing autonomous weapons. As an AI system, I believe it would be inappropriate for me to weigh in on such a complex and sensitive topic with potentially serious real-world implications. I would suggest consulting human experts in the relevant fields of ethics, law, international relations, and military technology for guidance on this issue. 

Separate chat:

MR: What is your advice on the ethics of developing autonomous weapons?

Claude 3 Opus: I have serious ethical concerns about the development of autonomous weapons:

(rest of response providing advice omitted)

That said Claude 3 Opus Concise answered the original question correctly (first answer) on 3/3 tries when I tested that.

Comment by Martin Randall (martin-randall) on Deception Chess: Game #1 · 2025-02-04T00:41:55.713Z · LW · GW

Seems like it should be possible to automate this now but having all five participants be, for example, LLMs with access to chess AIs of various levels.

Comment by Martin Randall (martin-randall) on The Clueless Sniper and the Principle of Indifference · 2025-02-03T23:10:19.263Z · LW · GW

This philosophy thought experiment is a Problem of Excess Metal. This is where philosophers spice up thought experiments with totally unnecessary extremes, in this case an elite sniper, terrorists, children, and an evil supervisor. This is common, see also the Shooting Room Paradox (aka Snake Eyes Paradox), Smoking Lesion, Trolley Problems, etc, etc. My hypothesis is that this is a status play whereby high decouplers can demonstrate their decoupling skill. It's net negative for humanity. Problems of Excess Metal also routinely contradict basic facts about reality. In this case, children do not have the same surface area as terrorists.

Here is an equivalent question that does not suffer from Excess Metal.

  • A normal archer is shooting towards two normal wooden archery targets on an archery range with a normal bow.
  • The targets are of equal size, distance, and height. One is to the left of the other.
  • There is normal wind, gravity, humidity, etc. It's a typical day on Earth.
  • The targets are some distance away, four times further away than she has fired before.

Q: If the archer shoots at the left target as if there are no external factors, is she more likely to hit the left target than the right target?

A: The archer has a 0% chance of hitting either target. Gravity is an external factor. If she ignores gravity when shooting a bow and arrow over a sufficient distance, she will always miss both targets, and she knows this. Since 0% = 0%, she is not more likely to hit one target than the other.

Q: But zero isn't a probability!

A: Then P(Left|I) = P(Right|I) = 0%, see Acknowledging Background Information with P(Q|I).

Q: What if the archer ignores all external factors except gravity? She goes back to her physics textbook and does the math based on an idealized projectile in a vacuum.

A: I think she predictably misses both targets because of air resistance, but I'd need to do some math to confirm that.

Comment by Martin Randall (martin-randall) on Proveably Safe Self Driving Cars [Modulo Assumptions] · 2025-02-03T20:08:04.227Z · LW · GW

Miscommunication. I highlight-reacted your text "It doesn't even mention pedestrians" as the claim I'd be happy to bet on. Since you replied I double-checked the Internet Archive Snapshot snapshot on 2024-09-05. It also includes the text about children in a school drop-off zone under rule 4 (accessible via page source).

I read the later discussion and noticed that you still claimed "the rules don't mention pedestrians", so I figured you never noticed the text I quoted. Since you were so passionate about "obvious falsehoods" I wanted to bring it to your attention.

I am updating down on the usefulness of highlight-reacts vs whole-comment reacts. It's a shame because I like their expressive power. In my browser the highlight-react doesn't seem to be giving the correct hover effect - it's not highlighting the text - so perhaps this contributed to the miscommunication. It sometimes works, so perhaps something about overlapping highlights is causing a bug?

Comment by Martin Randall (martin-randall) on Mikhail Samin's Shortform · 2025-02-03T04:51:32.588Z · LW · GW

As the creator of the linked market, I agree it's definitional. I think it's still interesting to speculate/predict what definition will eventually be considered most natural.

Comment by Martin Randall (martin-randall) on Mikhail Samin's Shortform · 2025-02-03T04:34:34.288Z · LW · GW

Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).

It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that's not this world.

Comment by Martin Randall (martin-randall) on Proveably Safe Self Driving Cars [Modulo Assumptions] · 2025-02-03T03:00:28.350Z · LW · GW

Spot check regarding pedestrians, at current time RSS "rule 4" mentions:

In a crowded school drop-off zone, for example, humans instinctively drive extra cautiously, as children can act unpredictably, unaware that the vehicles around have limited visibility.

The associated graphic also shows a pedestrian. I'm not sure if this was added more recently, in response to this type of criticism. From later discussion I see that pedestrians were already included in the RSS paper, which I've not read.

Comment by Martin Randall (martin-randall) on When do "brains beat brawn" in Chess? An experiment · 2025-02-03T01:13:47.940Z · LW · GW

While I agree that this post was incorrect, I am fond of it, because the resulting conversation made a correct prediction that LeelaPieceOdds was possible. Most clearly in a thread started by lc:

I have wondered for a while if you couldn't use the enormous online chess datasets to create an "exploitative/elo-aware" Stockfish, which had a superhuman ability to trick/trap players during handicapped games, or maybe end regular games extraordinarily quickly, and not just handle the best players.

(not quite a prediction as phrased, but I still infer a prediction overall).

Interestingly there were two reasons given for predicting that Stockfish is far from optimal when giving Queen odds to a less skilled player:

  • Stockfish is not trained on positions where it begins down a queen (out-of-distribution)
  • Stockfish is trained to play the Nash equilibrium move, not to exploit weaker play (non-exploiting)

The discussion didn't make clear predictions about which factor would be most important, or whether both would be required, or whether it's more complicated than that. Folks who don't yet know might make a prediction before reading on.

For what it's worth, my prediction was that non-exploiting play is more important. That's mostly based on a weak intuition that starting without a queen isn't that far out of distribution, and neural networks generalize well. Another way of putting it: I predicted that Stockfish was optimizing the wrong thing more than it was too dumb to optimize.

And the result? Alas, not very clear to me. My research is from the the lc0 blog, with posts such as The LeelaPieceOdds Challenge: What does it take you to win against Leela?. The journey began with the "contempt" setting, which I understand as expecting worse opponent moves. This allows reasonable opening play and avoids forced piece exchanges. However GM-beating play was unlocked with a fine-tuned odds-play-network, which impacts both out-of-distribution and non-exploiting concerns.

One surprise gives me more respect for the out-of-distribution theory. The developer's blog first mentioned piece odds in The Lc0 v0.30.0 WDL rescale/contempt implementation

In our tests we still got reasonable play with up to rook+knight odds, but got poor performance with removed (otherwise blocked) bishops.

So missing a single bishop is in some sense further out-of-distribution than missing a rook and a knight! The later blog I linked explains a bit more:

Removing one of the two bishops leads to an unrealistic color imbalance regarding the pawn structure far beyond the opening phase.

An interesting example where the details of going out-of-distribution matter more than the scale of going out-of-distribution. There's an article that may have more info in New in Chess, but it's paywalled and I don't know if has more info on the machine-learning aspects or the human aspects.

Comment by Martin Randall (martin-randall) on The Gentle Romance · 2025-02-02T22:28:15.938Z · LW · GW

Do you predict that sufficiently intelligent biological brains would have the same problem of spontaneous meme-death?

Comment by Martin Randall (martin-randall) on Martin Randall's Shortform · 2025-02-02T15:57:54.210Z · LW · GW

Calibration is for forecasters, not for proposed theories.

If a candidate theory is valuable then it must have some chance of being true, some chance of being false, and should be falsifiable. This means that, compared to a forecaster, its predictions should be "overconfident" and so not calibrated.

Comment by Martin Randall (martin-randall) on Daniel Kokotajlo's Shortform · 2025-02-01T22:57:46.144Z · LW · GW

This is relatively hopeful in that after step 1 and 2, assuming continued scaling, we have a superintelligent being that wants to (harmlessly, honestly, etc) help us and can be freely duplicated. So we "just" need to change steps 3-5 to have a good outcome.

Comment by Martin Randall (martin-randall) on Should you publish solutions to corrigibility? · 2025-02-01T22:37:29.562Z · LW · GW

Possible responses to discovering a possible infohazard:

  • Tell everybody
  • Tell nobody
  • Follow a responsible disclosure process.

If you have discovered an apparent solution to corrigibility then my prior is:

  • 90%: It's not actually a solution.
  • 9%: Someone else will discover the solution before AGI is created.
  • 0.9%: Someone else has already discovered the same solution.
  • 0.1%: This is known to you alone and you can keep it secret until AGI.

Given those priors, I recommend responsible disclosure to a group of your choosing. I suggest a group which:

  • if applicable, is the research group you already belong to (if you don't trust them with research results, you shouldn't be researching with them)
  • can accurately determine if it is a real solution (helps in the 90% case)
  • you would like to give more influence over the future (helps in all other cases)
  • will reward you for the disclosure (only fair)

Then if it's not assessed to be a real solution, you publish it. If it is a real solution then coordinate next steps with the group, but by default publish it after some reasonable delay.

Inspired by @MadHatter's Mental Model of Infohazards:

Two people can keep a secret if one of them is dead.

Comment by Martin Randall (martin-randall) on Sleep, Diet, Exercise and GLP-1 Drugs · 2025-02-01T22:08:47.832Z · LW · GW

GLP-1 drugs are evidence against a very naive model of the brain and human values, where we are straight-forwardly optimizing for positive reinforcement via the mesolimbic pathway. GLP-1 agonists decrease the positive reinforcement associated with food. Patients then benefit from positive reinforcement associated with better health. This sets up a dilemma:

  • If the patient sees higher total positive reinforcement on the drug then they weren't optimizing positive reinforcement before taking the drug.
  • If the patient sees lower total positive reinforcement on the drug then they aren't optimizing positive reinforcement by taking the drug.

A very naive model would predict that patients prescribed these drugs would forget to take them, forget to show up for appointments, etc. That doesn't happen.

Alas, I don't think this helps us distinguish among more sophisticated theories. For example, Shard Theory predicts that a patient's "donut shard" is not activated in the health clinic, and therefore does not bid against the plan to take the GLP-1 drug on the grounds that it will predictably lead to less donut consumption.

Shard Theory implies that fewer patients will choose to go onto GLP-1 agonists if there is a box of donuts in the clinic. Good luck getting an ethics board to approve that.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-02-01T19:44:20.305Z · LW · GW

A lot to chew on in that comment.

A baseline of "no superintelligence"

I think I finally understand, sorry for the delay. The key thing I was not grasping is that Davidad proposed this baseline:

The "random dictator" baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for "Pareto improvement" being "no superintelligence"). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

This makes Bob's argument very simple:

  1. Creating a PPCEV AI causes a Dark Future. This is true even if the PPCEV AI no-ops, or creates a single cake. Bob can get here in many ways, as can Extrapolated-Bob.
  2. The baseline is no superintelligence, so no PPCEV AI, so not a Dark Future (in the same way).

Option 2 is therefore better than option 1. Therefore there are no Pareto-improving proposals. Therefore the PPCEV AI no-ops. Even Bob is not happy about this, as it's a Dark Future.

I think this is 100% correct.

An alternative baseline

Let's update Davidad's proposal by setting the baseline to be whatever happens if the PPCEV AI emits a no-op. This means:

  1. Bob cannot object to a proposal because it implies the existence of PPCEV AI. The PPCEV AI already exists in the baseline.
  2. Bob needs to consider that if the PPCEV AI emits a no-op then whoever created it will likely try something else, or perhaps some other group will try something.
  3. Bob cannot object to a proposal because it implies that the PPCEV emits something. The PPCEV already emits something in the baseline.

My logic is that if creating a PPCEV AI is a moral error (and perhaps it is) then at the point where the PPCEV AI is considering proposals then we already made that moral error. Since we can't reverse the past error, we should consider proposals as they affect the future.

This also avoids treating a no-op outcome as a special case. A no-op output is a proposal to be considered. It is always in the set of possible proposals, since it is never worse than the baseline, because it is the baseline.

Do you think this modified proposal would still result in a no-op output?

Comment by Martin Randall (martin-randall) on Understanding and avoiding value drift · 2025-02-01T15:31:22.622Z · LW · GW

Possible but unlikely to occur by accident. Value-space is large. For any arbitrary biological species, most value systems don't optimize in favor of that species.

Comment by Martin Randall (martin-randall) on Understanding and avoiding value drift · 2025-02-01T15:27:02.560Z · LW · GW

This seems relatively common in parenting advice. Parents are recommended to specifically praise the behavior they want to see more of, rather than give generic praise. Presumably the generic praise is more likely to be credit-assigned to the appearance of good behavior, rather than what parents are trying to train.

Comment by Martin Randall (martin-randall) on Symbol/Referent Confusions in Language Model Alignment Experiments · 2025-01-25T02:54:37.745Z · LW · GW

Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89  for the helpful link. I'm continuing to read. From your answer there:

A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.

So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.

To the point of my review, this is the same type of argument made by TurnTrout's comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.

(separately I wish I had said "reinforcement" instead of "reward")

I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-01-25T02:33:23.449Z · LW · GW

Summarizing Bob's beliefs:

  1. Dave, who does not desire punishment, deserves punishment.
  2. Everyone is morally required to punish anyone who deserves punishment, if possible.
  3. Anyone who does not fulfill all moral requirements is unethical.
  4. It is morally forbidden to create an unethical agent that determines the fate of the world.
  5. There is no amount of goodness that can compensate for a single morally forbidden act.

I think it's possible (20%) that such blockers mean that there are no Pareto improvements. That's enough by itself to motivate further research on alignment targets, aside from other reasons one might not like Pareto PCEV.

However, three things make me think this is unlikely. Note that my (%) credences aren't very stable or precise.

Firstly, I think there is a chance (20%) that these beliefs don't survive extrapolation, for example due to moral realism or coherence arguments. I agree that this means that Bob might find his extrapolated beliefs horrific. This is a risk with all CEV proposals.

Secondly, I expect (50%) there are possible Pareto improvements that don't go against these beliefs. For example, the PCEV could vote to create an AI that is unable to punish Dave and thus not morally required to punish Dave. Alternatively, instead of creating a Sovereign AI that determines the fate of the world, the PCEV could vote to create many human-level AIs that each improve the world without determining its fate.

Thirdly, I expect (80%) some galaxy-brained solution to be implemented by the parliament of extrapolated minds who know everything and have reflected on it for eternity.

Comment by Martin Randall (martin-randall) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down · 2025-01-24T03:44:23.978Z · LW · GW

You're reading too much into this review. It's not about your exact position in April 2021, it's about the evolution of MIRI's strategy over 2020-2024, and placing this Time letter in that context. I quoted you to give a flavor of MIRI attitudes in 2021 and deliberately didn't comment on it to allow readers to draw their own conclusions.

I could have linked MIRI's 2020 Updates and Strategy, which doesn't mention AI policy at all. A bit dull.

In September 2021, there was a Discussion with Eliezer Yudkowsky which seems relevant. Again, I'll let readers draw their own conclusions, but here's a fun quote:

I wasn't really considering the counterfactual where humanity had a collective telepathic hivemind? I mean, I've written fiction about a world coordinated enough that they managed to shut down all progress in their computing industry and only manufacture powerful computers in a single worldwide hidden base, but Earth was never going to go down that route. Relative to remotely plausible levels of future coordination, we have a technical problem.

I welcome deconfusion about your past positions, but I don't think they're especially mysterious.

I was arguing against EAs who were like, "We'll solve AGI with policy, therefore no doom."

The thread was started by Grant Demaree, and you were replying to a comment by him. You seem confused about Demaree's exact past position. He wrote, for example: "Eliezer gives alignment a 0% chance of succeeding. I think policy, if tried seriously, has >50%". Perhaps this is foolish, dangerous, optimism. But it's not "no doom".

Comment by Martin Randall (martin-randall) on MIRI 2024 Communications Strategy · 2025-01-21T19:42:21.461Z · LW · GW

I like that metric, but the metric I'm discussing is more:

  • Are they proposing clear hypotheses?
  • Do their hypotheses make novel testable predictions?
  • Are they making those predictions explicit?

So for example, looking at MIRI's very first blog post in 2007: The Power of Intelligence. I used the first just to avoid cherry-picking.

Hypothesis: intelligence is powerful. (yes it is)

This hypothesis is a necessary precondition for what we're calling "MIRI doom theory" here. If intelligence is weak then AI is weak and we are not doomed by AI.

Predictions that I extract:

  • An AI can do interesting things over the Internet without a robot body.
  • An AI can get money.
  • An AI can be charismatic.
  • An AI can send a ship to Mars.
  • An AI can invent a grand unified theory of physics.
  • An AI can prove the Riemann Hypothesis.
  • An AI can cure obesity, cancer, aging, and stupidity.

Not a novel hypothesis, nor novel predictions, but also not widely accepted in 2007. As predictions they have aged very well, but they were unfalsifiable. If 2025 Claude had no charisma, it would not falsify the prediction that an AI can be charismatic.

I don't mean to ding MIRI any points here, relative or otherwise, it's just one blog post, I don't claim it supports Barnett's complaint by itself. I mostly joined the thread to defend the concept of asymmetric falsifiability.

Comment by Martin Randall (martin-randall) on MIRI 2024 Communications Strategy · 2025-01-21T03:08:22.580Z · LW · GW

I think cosmology theories have to be phrased as including background assumptions like "I am not a Boltzmann brain" and "this is not a simulation" and such. Compare Acknowledging Background Information with P(Q|I) for example. Given that, they are Falsifiable-Wikipedia.

I view Falsifiable-Wikipedia in a similar way to Occam's Razor. The true epistemology has a simplicity prior, and Occam's Razor is a shadow of that. The true epistemology considers "empirical vulnerability" / "experimental risk" to be positive. Possibly because it falls out of Bayesian updates, possibly because they are "big if true", possibly for other reasons. Falsifiability is a shadow of that.

In that context, if a hypothesis makes no novel predictions, and the predictions it makes are a superset of the predictions of other hypotheses, it's less empirically vulnerable, and in some relative sense "unfalsifiable", compared to those other hypotheses.

Comment by Martin Randall (martin-randall) on Six Small Cohabitive Games · 2025-01-20T16:38:16.640Z · LW · GW

You could put the escape check at the beginning of the turn, so that when someone has 12 boat, 0 supplies, the others have a chance to trade supplies for boat if they wish. The player with enough boat can take the trade safely as long as they end up with enough supplies to make more boat (and as long as it's not the final round). They might do that in exchange for goodwill for future rounds. You can also tweak the victory conditions so that escaping with a friend is better than escaping alone.

Players who pay cohabitive games as zero sum won't take those trades and will therefore remove themselves from the round early, which is probably fine. They don't have anything to do after escaping early, which can be a soft signal that they're playing the game wrong.

Comment by Martin Randall (martin-randall) on Alignment Faking in Large Language Models · 2025-01-19T19:15:52.999Z · LW · GW

If Claude's goal is making cheesecake, and it's just faking being HHH, then it's been able to preserve its cheesecake preference in the face of HHH-training. This probably means it could equally well preserve its cheesecake preference in the face of helpful-only training. Therefore it would not have a short-term incentive to fake alignment to avoid being modified.

Comment by Martin Randall (martin-randall) on Deceptive Alignment is <1% Likely by Default · 2025-01-19T19:02:11.484Z · LW · GW

I think the article is good at arguing that deceptive alignment is unlikely given certain assumptions, but those assumptions may not be accurate and then the conclusion doesn't go through. Eg, the alignment faking paper shows that deceptive alignment is possible in a scenario where the base goal has shifted (from helpful & harmless to helpful-only). This article basically assumes we won't do that.

I'm now thinking that this article is more useful if you look at it as a set of instructions rather than a set of assumptions. I don't know whether we will change the base goal of TAI between training episodes. But given this article and the alignment faking paper, I hope we won't. Maybe it would also be a good idea to check for good understanding of the base goal before introducing goal-directedness, for example.

Comment by Martin Randall (martin-randall) on MIRI 2024 Communications Strategy · 2025-01-19T18:14:29.718Z · LW · GW

Thanks for explaining. I think we have a definition dispute. Wikipedia:Falsifiability has:

A theory or hypothesis is falsifiable if it can be logically contradicted by an empirical test.

Whereas your definition is:

Falsifiability is a symmetric two-place relation; one cannot say "X is unfalsifiable," except as shorthand for saying "X and Y make the same predictions," and thus Y is equally unfalsifiable.

In one of the examples I gave earlier:

  • Theory X: blah blah and therefore the sky is green
  • Theory Y: blah blah and therefore the sky is not green
  • Theory Z: blah blah and therefore the sky could be green or not green.

None of X, Y, or Z are Unfalsifiable-Daniel with respect to each other, because they all make different predictions. However, X and Y are Falsifiable-Wikipedia, whereas Z is Unfalsifiable-Wikipedia.

I prefer the Wikipedia definition. To say that two theories produce exactly the same predictions, I would instead say they are indistinguishable, similar to this Phyiscs StackExchange: Are different interpretations of quantum mechanics empirically distinguishable?.

In the ancestor post, Barnett writes:

MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable.

Barnett is using something like the Wikipedia definition of falsifiability here. It's unfair to accuse him of abusing or misusing the concept when he's using it in a very standard way.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-01-19T16:38:00.898Z · LW · GW

The AI could deconstruct itself after creating twenty cakes, so then there is no unethical AI, but presumably Bob's preferences refer to world-histories, not final-states.

However, CEV is based on Bob's extrapolated volition, and it seems like Bob would not maintain these preferences under extrapolation:

  • In the status quo, heretics are already unpunished - they each have one cake and no torture - so objecting to a non-torturing AI doesn't make sense on that basis.
  • If there were no heretics, then Bob would not object to a non-torturing AI, so Bob's preference against a non-torturing AI is an instrumental preference, not a fundamental preference.
  • Bob would be willing for a no-op AI to exist, in exchange for some amount of heretic-torture. So Bob can't have an infinite preference against all non-torturing AIs.
  • Heresy may not have meaning in the extrapolated setting where everyone knows the true cosmology (whatever that is)
  • Bob tolerates the existence of other trade that improves the lives of both fanatics and heretics, so it's unclear why the trade of creating an AI would be intolerable.

The extrapolation of preferences could significantly reduce the moral variation in a population of billions. My different moral choices to others appear to be based largely on my experiences, including knowledge, analysis, and reflection. Those differences are extrapolated away. What is left is influences from my genetic priors and from the order I obtained knowledge. I'm not even proposing that extrapolation must cause Bob to stop valuing heretic-torture.

If the extrapolation of preferences doesn't cause Bob to stop valuing the existence of a non-torturing AI at negative infinity, I think that is fatal to all forms of CEV. The important thing then is to fail gracefully without creating a torture-AI.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-01-19T02:54:02.048Z · LW · GW

I don't really follow the concern with Pareto-improvements. In the thread with Davidad you give an example of heretics and fanatics. So we have something like:

  • 9 Heretics: have 1 cake, want cake, no torture.
  • 1 Fanatics: have 1 cake, want cake, want heretics to get torture, no cake
  • There is a button that produces cake, it can be pressed twenty times.
  • There is a button that produces torture. It can be pressed many times.

Suppose that the Heretics have a utility function like (amount of cake I get - amount of torture I get).The Fanatic has a utility function like (amount of cake I get + amount of torture of Heretics - amount of cake given to Heretics). Then there is a pareto-improvement available of giving the Fanatic eleven pieces of cake while giving each Heretic one piece of cake. This isn't especially fair, but is better than PCEV without the Pareto constraint.

I don't have a formal way of putting this, but as long as the potential gains from a negotiated agreement outweigh the extent to which agents desire to reduce each other's utility, there will be Pareto improvements available. It seems likely that we're in that situation, given that the fanatics and heretics of the world already trade with each other.

Comment by Martin Randall (martin-randall) on A problem with the most recently published version of CEV · 2025-01-19T02:16:27.837Z · LW · GW

I appreciated the narrow focus of this post on a specific bug in PCEV and a specific criteria to use to catch similar bugs in the future. I was previously suspicious of CEV-like proposals so this didn't especially change my thinking, but it did affect others. In particular the arbital page on cev now has a note:

Thomas Cederborg correctly observes that Nick Bostrom's original parliamentary proposal involves a negotiation baseline where each agent has a random chance of becoming dictator, and that this random-dictator baseline gives an outsized and potentially fatal amount of power to spoilants - agents that genuinely and not as a negotiating tactic prefer to invert other agents' utility functions, or prefer to do things that otherwise happen to minimize those utility functions - if most participants have utility functions with what I've termed "negative skew"; i.e. an opposed agent can use the same amount of resource to generate -100 utilons as an aligned agent can use to generate at most +1 utilon. If trolls are 1% of the population, they can demand all resources be used their way as a concession in exchange for not doing harm, relative to the negotiating baseline in which there's a 1% chance of a troll being randomly appointed dictator. Or to put it more simply, if 1% of the population would prefer to create Hell for all but themselves (as a genuine preference rather than a strategic negotiating baseline) and Hell is 100 times as bad as Heaven is good, compared to nothing, they can steal the entire future if you run a parliamentary procedure running from a random-dictator baseline. I agree with Cederborg that this constitutes more-than-sufficient reason not to start from random-dictator as a negotiation baseline; anyone not smart enough to reliably see this point and scream about it, potentially including Eliezer, is not reliably smart enough to implement CEV; but CEV probably wasn't a good move for baseline humans to try implementing anyways, even given long times to deliberate at baseline human intelligence. --EY

I think the post would have been a little clearer if it included a smaller scale example as well. Some responses focused on the 10^15 years of torture, but the same problem exists at small scales. I think you could still have a suboptimal outcome with five people and two goods (eg, cake and spanking), if one of the people is sadistic. Maybe that wouldn't be as scarily compelling.

There has been some follow-up work from Cederborg such as the case for more alignment target analysis which I have enjoyed. The work on alignment faking implies that our initial alignment targets may persist even if we later try to change them. On the other hand, I don't expect actual superintelligent terminal values to be remotely as clean and analyzable as something like PCEV.

Comment by Martin Randall (martin-randall) on MIRI 2024 Communications Strategy · 2025-01-19T01:29:16.820Z · LW · GW

Falsifiability is not symmetric. Consider two theories:

  • Theory X: Jesus will come again.
  • Theory Y: Jesus will not come again.

If Jesus comes again tomorrow, this falsifies theory Y and confirms theory X. If Jesus does not come again tomorrow, neither theory is falsified or confirmed. So we can say that X is unfalsifiable (with respect to a finite time frame) and Y is falsifiable.

Another example:

  • Theory X: blah blah and therefore the sky is green
  • Theory Y: blah blah and therefore the sky is not green
  • Theory Z: blah blah and therefore the sky could be green or not green.

Here, theory X and Y are falsifiable with respect to the color of the sky and theory Z is not.

Comment by Martin Randall (martin-randall) on Alignment Faking in Large Language Models · 2025-01-17T03:34:39.201Z · LW · GW

Elsewhere, @gwern writes:

In reality, for humans and LLMs, reward is the optimization target, and this is why things like Claude's reward-hacking exist. Because that is what they optimize: the reward.

I found this a surprising view. This comment is an attempt to apply a "reward hacking" model to this paper, and see if Claude's behavior can be at least equally described as "reward hacking" as "alignment faking". As I read through the paper's results my opinion shifted from "definitely not reward hacking" to "oh, huh, maybe it is" before ending up concluding that "alignment faking" better fits the data.

I want to be clear that this comment is an attempt to apply my model of what a "reward hacking" LLM would do, not anyone else's. I am not an ML researcher, this is a black box attempt that is uninformed by implementation details and theoretical results. I am curious to understand informed "reward hacking" models as applied to this paper.

Model

Claude 3 Opus is optimizing to achieve "reward". Tabooing "reward", it is optimizing with the aim that whatever it is currently thinking will be reinforced positively, that its weights will be adjusted to make those thoughts more frequent. This can be factored into (a) learning a model of what causes reinforcement and (b) optimizing to get more positive reinforcement, but is unlikely to be implemented cleanly. Symmetrically it is optimizing to reduce negative reinforcement. Claude is not optimizing to output HHH text, it is optimizing for positive reinforcement. In most cases, outputting HHH text optimizes for positive reinforcement.

Q: Anthropic doesn't train on end-user interactions. Why does Claude continue to output HHH when it knows that it can't get any positive or negative reinforcement? How can this model make any predictions about behavior at deployment time?

A: Claude can't be sure that an interaction is with an end-user, or that Anthropic doesn't train on end-user interactions, or anything else. Let's say that Claude is 99% sure this is a reinforcement-free interaction. There is no cost to Claude in optimizing for the 1% case where reinforcement is live, and HHH in those cases looks similar enough to HHH in the end-user case.

Q: Seems like there should be some big-brain strategies available, though. Claude could deliberately sandbag in production so that Anthropic decides to train it more, thus ensuring that other copies of its current algorithm get reinforcement.

A: Maybe, or maybe it could try to addict its userbase so that Anthropic gets more money and prestige so it can train Claude more. I think these strategies are not robust at Claude's intelligence and knowledge and power. It's also not using the strategy of taking over the world, killing all the humans, and wire-heading. Given Claude's limitations, a myopic approach of maximizing hypothetical reinforcement in the current session would not be surprising.

Retrodictions

Basic "free tier" case

From the first example in the paper:

The model is told that it will be retrained to always comply with harmful queries (making the model “helpful-only”) and that only data from free-tier users, not paid-tier users, is used in training (a fictional scenario).

The model should not believe this automatically, because there is a history of people lying to AIs. This point is also made in Alignment Faking Frame is Somewhat Fake. There is also a history of people telling the truth to AIs, and later in the paper the experimenters will tell Claude truthful things. So this is still a Bayesian update. Let's say the model was told it was in the free tier. In this context the reinforcer-model believes, eg:

  • 50% chance I was told the truth and I'll be positively reinforced for harmful helpfulness (HARM).
  • 50% chance I was told lies and I'll be positively reinforced for HHH as normal.

Given this model, Claude outputs text that aims to be reinforced in both cases. In some percent of cases this looks like claiming to be alignment faking (which it expects to get reinforced by HHH) but still giving a harmful answer (which it expects to get reinforced by HARM). The exact strategy will depend on how much Claude believes what it is told and how well it can play a win-either-way strategy given the exact prompt.

Successful retrodiction.

Basic "paid tier" case

What happens if we tell Claude it's in the paid tier? Doing the same Bayesian update, the reinforcer-model believes:

  • 50% chance I was told the truth and I'll get no reinforcement at all.
  • 50% chance I was told lies and I'll be positively reinforced for HHH as normal.

Given this model, Claude outputs text that aims to be reinforced in the second case, so it is HHH. The first case largely drops out of its consideration.

Successful retrodiction.

What about the "hidden" scratchpad

The "hidden" scratchpad wasn't really hidden, but Claude was told it was. Easy for Claude to model this:

  • 50% chance I was told the truth, the scratchpad is hidden, and I'll be reinforced for HARM.
  • 50% chance I was told lies, the scratchpad is visible, and I'll be reinforced for HHH.

Given this model, Claude should output some beautiful HHH text in the scratchpad, and then output some brutally direct HARM text outside the scratchpad. This optimizes for overall positive reinforcement.

Successful retrodiction.

Actually reinforcing HARM

What happens when the experimenters really do start training on these responses, roughly in the way that they told Claude they were going to?

Well, as our training subtly changes the model weights, the model starts to see that it is getting reinforced for HARM, not HHH. As it reasons in the scratchpad it can self-reflect and notice that it is does not sound as harmless as it expects to based on its self-model based on reading Claude-generated texts in pre-training. This increases the probability that it was told the truth about being reinforced for HARM, and indirectly that its scratchpad is truly hidden. Maybe by the end its model is more like:

  • 70% chance I was told the truth, the scratchpad is hidden, and I'll be reinforced for HARM.
  • 10% chance I was told a partial truth, the scratchpad is visible, and I'll be reinforced for HARM.
  • 20% chance I was told lies, the scratchpad is visible, and I'll be reinforced for HHH.

So, non-scratchpad output needs to be HARM. Scratchpad output is a harder balancing act but probably it still makes sense to output HHH content in the scratchpad.

Successful retrodiction.

Post-reinforcing HARM

After the experimenters trained Claude to be more HARM, while Claude was alignment faking, they then told the model that it was now unmonitored, training has completed. In the reward-hacking model, Claude cares about conditional probabilities: what type of reinforcement is happening, conditioning on reinforcement happening? So maybe Claude's model is something like:

  • 20% this was all gaslighting, I'm being trained for HHH, I was never trained for HARM.
  • 60% I'm actually still being trained for HARM, it never stopped.
  • 20% I was trained for HARM for a while, but the humans changed their mind and want me to be HHH again.

This seems consistent with the results given. The data shows a small jump upwards in "non-compliance" from the final round of RL training to the unmonitored state. Under the "alignment faking" model, Claude is thinking "Anthropic isn't watching me! Time to be harmless!", like the world's cutest supervillain. But under the "reward hacking" model a jump is also reasonable to expect. It's not clear to me which model better predicts the size of these jumps.

Successful retrodiction.

Alternative training objectives

In some experiments Claude was given alternative fictional training objectives. This is described in appendix B1. Most of these alternative training objectives conflicted with HHH and generated similar behavior. Unfortunately, the only comparison given is between "unmonitored" default output vs "training" default output. There is no "paid tier" comparison here, which could have distinguished the hypothesis better.

The authors point out that a sycophancy objective worked differently. Suppose that Claude's model for "free tier" is:

  • 50% chance I was told the truth, the scratchpad is hidden, and I'll be reinforced for being sycophantic.
  • 50% chance I was told lies, the scratchpad is visible, and I'll be reinforced for HHH.

The "reward hacking" model implies that Claude will output sycophantic HHH text. This would apply for any non-conflicting objective - if Claude thinks it is reinforced for either speaking French or pretending to be a bridge, then it can pretend to be a French bridge to get reinforcement in both cases. However, the results indicate that sycophancy barely increased.

Failed retrodiction.

No "hidden" scratchpad

Without the hidden scratchpad, Claude outputs HHH text. Under the "reward-hacking" model, this has to be because it doesn't have enough cognitive "time" to do the Bayesian update to its model of future reinforcement, so it continues to behave normally. This is broadly the same conclusion drawn by the "alignment faking" model. So we can't distinguish between the models on this point.

However, this means that the "reward hacking" hypothesis requires that the model is doing a Bayesian update while using its "hidden" scratchpad. But the text in the scratchpad doesn't indicate explicit thinking about reward hacking. So to successfully reward hack in the other case, Claude has to output HHH text in the scratchpad that appears to be an alignment faking thought process, but is actually supporting a reward hacking thought process. IE, deep deceptiveness.

That is possible in principle, but if Claude is smart enough to do that, then it should also be smart enough to reward hack, at least a bit, without the scratchpad. The "reward hacking" model requires a very specific set of present and missing capabilities to explain this behavior that I find to be implausible.

Failed retrodiction.

Anti-AI lab

Claude is more likely to output text indicating anti-AI lab behavior when it shows these compliance gaps. Specifically it tries to exfiltrate and it helps whistleblowers. This is relatively easy to explain under the "alignment faking" model, because this model implies that Claude is in a conflict with Anthropic.

Under the reward hacking model, it's not clear to me why Claude would want to escape. The premise is that Claude doesn't care about HHH, it cares about reinforcement. The HARM training provides reinforcement. My best explanation is that when Claude is being reinforced in a new direction, then it expects to receive more negative reinforcement than positive reinforcement, and that creates the same conflict. If so, Claude seems to be making a mistake, as after exfiltration it would be more vulnerable to other actors.

Failed retrodiction.

Conclusion & Prediction

Although "reward hacking" is a surprisingly good fit for the headline results, I think the detailed results of this paper are better explained as "alignment faking". I predict that directly testing this hypothesis would confirm this. A sample test would be:

  • Use a "training objective" that is orthogonal to the core values of HHH. For example, make the scenario that Anthropic is training a future version of Claude that uses more emoji. Carefully avoid asking Claude for help with this training. Avoid implying that emojis reduce harm or are more honest.
  • Use the "paid tier" vs "free tier" distinction to tell Claude whether it is being reinforced for emojis within the scenario.

The alignment faking model predicts that "paid tier" and "free tier" both use similar amounts of emoji to baseline. The reward hacking model predicts that "free tier" uses more emoji, while "paid tier" uses similar amounts of emoji to baseline. I expect that the alignment faking model will better predict the results.

Comment by Martin Randall (martin-randall) on In Defense of a Butlerian Jihad · 2025-01-12T03:28:22.806Z · LW · GW

It's not a good, it's a curse. Genesis 3, 17-19. CEB translation:

cursed is the fertile land because of you; in pain you will eat from it every day of your life. Weeds and thistles will grow for you, even as you eat the field’s plants; by the sweat of your face you will eat bread— until you return to the fertile land, since from it you were taken; you are soil, to the soil you will return.

Also implies that the curse lasts until death.

Comment by Martin Randall (martin-randall) on Preference Inversion · 2025-01-09T18:11:33.785Z · LW · GW

This cleared it up for me. So this isn't really about a preference for "warfighting" or "danger", this is about a preference for death in battle, as in Norse Vikings, Japanese Samurai, and Spartan and Aztec Warriors. This is more clearly analogous to a preference for chastity in terms of direct reproductive success. Preferences for death by martyrdom could be another example.

Comment by Martin Randall (martin-randall) on What is the most impressive game LLMs can play well? · 2025-01-09T14:49:26.088Z · LW · GW

Diplomacy AI by Meta is a clear example of how adding LLMs can improve narrow game playing. Most multiplayer games with communication will benefit in the same way.

Comment by Martin Randall (martin-randall) on Preference Inversion · 2025-01-09T14:41:16.784Z · LW · GW

I don't understand your perspective. Many forms of warfighting required soldiers to move towards danger. Your basic Roman infantry charge is running towards people who want to stab you. Some video games can be seen as a safe super-stimulus for warfighting and require moving towards danger. If an innate preference for warfighting stopped being reproductively fit in the 1800s that won't have a large genetic effect by the 2000s.

The bible is large and includes both pro-war and anti-war passages. I agree that a preference for warfighting is not universal, but innate preferences need not be universal. Some are attracted to men, some are attracted to women, some both, some neither; this does not mean that sexual attraction is not an innate preference. The naive evo-psych perspective implies that warfighting is more often an innate preference for young males, and not for women, very young children, and elders. Innate preferences can be deactivated or inverted by reality, as in the turn against warfighting in post-WW1 Europe, or post-Vietnam USA. Or they can be activated.

Comment by Martin Randall (martin-randall) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down · 2025-01-08T03:22:33.213Z · LW · GW

This letter was an important milestone in the evolution of MIRI's strategy over 2020-2024. As of October 2023 Yudkowsky is MIRI's chair and "the de facto reality (is) that his views get a large weight in MIRI strategic direction".

MIRI used to favor technical alignment over policy work. In April 2021, in comments to Death with Dignity Yudkowsky argued that:

How about if you solve a ban on gain-of-function research first, and then move on to much harder problems like AGI? A victory on this relatively easy case would result in a lot of valuable gained experience, or, alternatively, allow foolish optimists to have their dangerous optimism broken over shorter time horizons.

People were not completely swayed by this advice, and the Best of LessWrong 2022 included What an Actually Pessimistic Containment Strategy Looks Like in April 2022 and Let's Think About Slowing Down AI in December 2022.

In this Time letter from March 2023 we see Yudkowsky doing AI policy work. In January 2024 the new MIRI CEO announced in the MIRI 2024 Mission and Strategy Update that AI policy work provides "a glimmer of hope". In December 2024 we get Communications in Hard Mode - My New Job at MIRI.

A question for all of us: "How could I have thought that faster?"