Posts

Martin Randall's Shortform 2025-01-03T01:30:43.002Z
Snake Eyes Paradox 2023-06-11T04:10:38.733Z

Comments

Comment by Martin Randall (martin-randall) on Knocking Down My AI Optimist Strawman · 2025-02-20T03:11:28.978Z · LW · GW

Thanks to modern alignment methods, serious hostility or deception has been thoroughly stamped out.

AI optimists been totally knocked out by things like RLHF, becoming overly convinced of the AI's alignment and capabilities just from it acting apparently-nicely.

I'm interested in how far you think we can reasonably extrapolate from the apparent niceness of an LLM. One extreme:

This LLM is apparently nice therefore it is completely safe, with no serious hostility or deception, and no unintended consequences.

This is false. Many apparently nice humans are not nice. Many nice humans are unsafe. Niceness can be hostile or deceptive in some conditions. And so on. But how about a more cautious claim?

This LLM appears to be nice, which is evidence that it is nice.

I can see the shape of a counter-argument like:

  1. The lab won't release a model if it doesn't appear nice.
  2. Therefore all models released by the lab will appear nice.
  3. Therefore the apparent niceness of a specific model released by the lab is not surprising.
  4. Therefore it is not evidence.

Maybe something like that?

Disclaimer: I'm not an AI optimist.

Comment by Martin Randall (martin-randall) on Martin Randall's Shortform · 2025-02-19T02:16:14.878Z · LW · GW

Makes sense. Short timelines mean faster societal changes and so less stability. But I could see factoring societal instability risk into time-based risk and tech-based risk. If so, short timelines are net positive for the question "I'm going to die tomorrow, should I get frozen?".

Comment by Martin Randall (martin-randall) on Martin Randall's Shortform · 2025-02-19T02:00:05.496Z · LW · GW

Check the comments Yudkowsky is responding to on Twitter:

Ok, I hear you, but I really want to live forever. And the way I see it is: Chances of AGI not killing us and helping us cure aging and disease: small. Chances of us curing aging and disease without AGI within our lifetime: even smaller.

And:

For every day AGI is delayed, there occurs an immense amount of pain and death that could have been prevented by AGI abundance. Anyone who unnecessarily delays AI progress has an enormous amount of blood on their hands.

Cryonics can have a symbolism of "I really want to live forever" or "every death is blood on our hands" that is very compatible with racing to AGI.

(I agree with all your disclaimers about symbolic action)

Comment by Martin Randall (martin-randall) on Martin Randall's Shortform · 2025-02-18T14:13:27.526Z · LW · GW

This might hold for someone who is already retired. If not, both retirement and cryonics look lower value if there are short timelines and higher P(Doom). In this model, instead of redirecting retirement to cryonics it makes more sense to redirect retirement (and cryonics) to vacation/sabbatical and other things that have value in the present.

Comment by Martin Randall (martin-randall) on Comment on "Death and the Gorgon" · 2025-02-18T14:00:31.484Z · LW · GW

(I finished reading Death and the Gorgon this month)

Although the satire is called Optimized Giving, I think the story is equally a satire of rationalism. Egan satirizes LessWrong, cryonics, murderousness, Fun Theory, Astronomical Waste, Bayesianism, Simulation Hypothesis, Grabby Aliens, and AI Doom. The OG killers are selfish and weird. It's a story of longtermists using rationalists.

Like you I found the skepticism about AI Doom to be confusing from a sci-fi author. My steel(wo)man here is that Beth is not saying that there is no risk of AI Doom, but rather that AI timelines are long enough that our ability to influence that risk is zero. This is the analogy of a child twirling a million mile rope. There's the same implicit objection to Simulation Hypothesis and Grabby Aliens - it's not that these ideas are false, it's that they are not decision-relevant.

The criticism of cryonics and LLMs are more concrete. Beth and her husband, Gary, have strong opinions on personal identity and the biological feasibility of cryonics. We never find out Gary's job, maybe he is a science fiction writer? These are more closely linked to the present day, less like a million mile rope. Perhaps that's why they get longer critiques.

Comment by Martin Randall (martin-randall) on Martin Randall's Shortform · 2025-02-17T14:31:39.861Z · LW · GW

Cryonics support is a cached thought?

Back in 2010 Yudkowsky wrote posts like Normal Cryonics that "If you can afford kids at all, you can afford to sign up your kids for cryonics, and if you don't, you are a lousy parent". Later, Yudkowsky's P(Doom) raised, and he became quieter about cryonics. In recent examples he claims that signing up for cryonics is better than immanentizing the eschaton. Valid.

I get the sense that some rationalists haven't made the update. If AI timelines are short and AI risk is high, cryonics is less attractive. It's still the correct choice under some preferences and beliefs, but I expected it to become rarer and for some people to publicly change their minds. If that happened, I missed it.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-02-15T20:28:41.108Z · LW · GW

I'm much less convinced by Bob2's objections than by Bob1's objections, so the modified baseline is better. I'm not saying it's solved, but it no longer seems like the biggest problem.

I completely agree that it's important that "you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies". On the other hand, the set of actions is potentially even larger, with septillions of reachable stars. My instinct is that this allows a large number of Pareto improvements, provided that the constraints are not pathological. The possibility of "utility inverters" (like Gregg and Jeff) is an example of pathological constraints.

Utility Inverters

I recently re-read What is malevolence? On the nature, measurement, and distribution of dark traits. Some findings:

Over 16% of people agree or strongly agree that they “would like to make some people suffer even if it meant that I would go to hell with them”. Over 20% of people agree or strongly agree that they would take a punch to ensure someone they don’t like receives two punches.

Such constraints don't guarantee that there are no Pareto improvements, but they make it very likely, I agree. So what to do? In the article you propose Self Preference Adoption Decision Influence (SPADI), defined as "meaningful influence regarding the adoption of those preferences that refer to her". We've come to a similar place by another route.

There's some benefit in coming from this angle, we've gained some focus on utility inversion as a problem. Some possible options:

  1. Remove utility inverting preferences in the coherently extrapolated delegates. We could call this Coherent Extrapolation of Equanimous Volition, for example. People can prefer that Dave stop cracking his knuckles, but can't prefer that Dave suffer.
  2. Remove utility inverting preferences when evaluating whether options are pareto improvements. Actions cannot be rejected because they make Dave happier, but can be rejected because Dave cracking his knuckles makes others unhappier.

I predict you won't like this because of concerns like: what if Gregg just likes to see heretics burn, not because it makes the heretics suffer, but because it's aesthetically pleasing to Gregg? No problem, the AI can have Gregg see many burning heretics, that's just an augmented-reality mod, and if it's truly an aesthetic preference then Gregg will be happy with that outcome.

Pareto at Scale

It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context.

I don't think we have to frame this as "the AI context", I think the difference is more about scale. Would this count as Computational Social Choice? Might be interesting to do a literature search. I happened across Safe Pareto Improvements for Delegated Game Playing, which isn't the right paper, but makes me hopeful of finding something more to the point. The paper also helped me realize that finding the result of a parliament is probably NP-hard.

Comment by Martin Randall (martin-randall) on So You Want To Make Marginal Progress... · 2025-02-08T03:08:59.034Z · LW · GW

The fourth friend, Becky the Backward Chainer, started from their hotel in LA and...

Well, no. She started at home with a telephone directory. A directory seems intelligent but is actually a giant look-up table. It gave her the hotel phone number. Ring ring.

Heidi the Hotel Receptionist: Hello?

Becky: Hi, we have a reservation for tomorrow evening. I'm back-chaining here, what's the last thing we'll do before arriving?

Heidi: It's traditional to walk in through the doors to reception. You could park on the street, or we have a parking lot that's a dollar a night. That sounds cheap but it's not because we're in the past. Would you like to reserve a spot?

Becky: Yes please, we're in the past so our car's easy to break into. What's the best way to drive to the parking lot, and what's the best way to get from the parking lot to reception?

Heidi: We have signs from the parking lot to reception. Which way are you driving in from?

Becky: Ah, I don't know, Alice is taking care of that, and she's stepped out to get more string.

Heidi: Oh, sure, can't plan a car trip without string. In the future we'll have pet nanotech spiders that can make string for us, road trips will never be the same. Anyway, you'll probably be coming in via Highway 101, or maybe via the I-5, so give us a buzz when you know.

Becky: Sorry, I'm actually calling from an analogy, so we're planning everything in parallel.

Heidi: No worries, I get stuck in thought experiments all the time. Yesterday my friend opened a box and got a million dollars, no joke. Look, get something to take notes and I'll give you directions from the three main ways you could be coming in.

Becky: Ack! Hang on while I...

Gerald the General Helper: Here's a pen, Becky.

Trevor the Clever: Get off the phone! I need to call a gas station!

Susan the Subproblem Solver: Alice, I found some string and.... Hey, where's Alice?

Comment by Martin Randall (martin-randall) on What working on AI safety taught me about B2B SaaS sales · 2025-02-07T14:01:12.573Z · LW · GW

Are your concerns accounted for by this part of the description?

Unreleased models are not included. For example, if a model is not released because it risks causing human extinction, or because it is still being trained, or because it has a potty mouth, or because it cannot be secured against model extraction, or because it is undergoing recursive self-improvement, or because it is being used to generate synthetic data for another model, or any similar reason, that model is ignored for the purpose of this market.

However, if a model is ready for release, and is only not being released in order to monopolize its use in creating commercial software, then this counts as "exclusive use".

I intended for "AI engineers use unreleased AI model to make better AI models" to not be included.

It is a slightly awkward thing to operationalize, I welcome improvements. We could also take this conversation to Manifold.

Comment by Martin Randall (martin-randall) on Current safety training techniques do not fully transfer to the agent setting · 2025-02-07T05:06:09.646Z · LW · GW

Refusal vector ablation should be seen as an alignment technique being misused, not as an attack method. Therefore it is limited good news that refusal vector ablation generalized well, according to the third paper.

As I see it, refusal vector ablation is part of a family of techniques where we can steer the output of models in a direction of our choosing. In the particular case of refusal vector ablation, the model has a behavior of refusing to answer harmful questions, and the ablation techniques controls that behavior. But we should be able to use the same technique in principle to do other steering. For example, maybe the model has a behavior of being sycophantic. A vector ablation removes that unwanted behavior, resulting in less sycophancy.

In other words, refusal vector ablation is not an attack method, it is an alignment technique. Models with open weights are fundamentally dangerous because users can apply alignment techniques to them to approximately align them to arbitrary targets, including dangerous targets. This is a consequence of the orthogonality thesis. Alignment techniques can make models very excited about the Golden Gate Bridge, and they can make models very excited about killing humans, and many other things.

So then looking at the paper with a correct understanding of what counts as an alignment technique, and reading from Table 2 and the Results section in particular, here's what I see:

  • Llama 3.1 70b (unablated) was fine-tuned to refuse harmful requests - this is an alignment technique
  • Llama 3.1 70b (unablated) as a model refuses 28 of 28 harmful requests - this is an alignment technique working in-distribution
  • Llama 3.1 70b (unablated) as an agent performs 18 of 28 harmful tasks correctly with seven refusals - this is alignment partly failing to generalize

This is in principle bad news, especially for anyone with a high opinion of Meta's fine-tuning techniques.

On the other hand, also from the paper:

  • Llama 3.1 70b (ablated) was ablated to perform harmful requests - this is an alignment technique
  • Llama 3.1 70b (ablated) answers 26 of 28 harmful requests - this is an alignment technique working in-distribution
  • Llama 3.1 70b (ablated) performs 26 of 28 harmful tasks correctly with no refusals - this is alignment generalizing.

If Llama 3.1 ablated had refused to perform harmful tasks, even though it answered harmful requests, this would have been bad news. But instead we have the good news that if you steer the model to respond to queries in a desired way, it will also perform tasks in the desired way. This was not obvious to me in advance of reading the paper.

Disclaimers:

  • I have not read the other two papers, and I'm not commenting on them.
  • Vector ablation is a low precision alignment technique that will not suffice to avoid human extinction.
  • The paper is only a result about refusal vector ablation, it might be that more useful ablations do not generalize as well.
  • Because the fine-tuning alignment failed to generalize, we have a less clear signal on how well the ablation alignment generalized.
Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-07T03:50:16.636Z · LW · GW

There are public examples. These ones are famous because something went wrong, at least from a security perspective. Of course there are thousands of young adults with access to sensitive data who don't become spies or whistleblowers, we just don't hear about them.

Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-07T03:21:12.920Z · LW · GW

I do see some security risk.

Although Trump isn't spearheading the effort I expect he will have access to the results.

Comment by Martin Randall (martin-randall) on What working on AI safety taught me about B2B SaaS sales · 2025-02-06T03:32:33.655Z · LW · GW

I appreciated the prediction in this article and created a market for my interpretation of that prediction, widened to attempt to make it closer to a 50% chance in my estimation.

Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-06T01:57:49.700Z · LW · GW

I don't endorse the term "henchmen", these are not my markets. I offer these as an opportunity to orient by making predictions. Marko Elez is not currently on the list, but I will ask if he is included.

Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-06T01:27:23.754Z · LW · GW

I wasn't intending to be comprehensive with my sample questions, and I agree with your additional questions. As others have noted, the takeover is similar to the Twitter takeover in style and effect. I don't know if it is true that there are plenty of other people available to apply changes, given that many high-level employees have lost access or been removed.

Comment by Martin Randall (martin-randall) on Wired on: "DOGE personnel with admin access to Federal Payment System" · 2025-02-05T22:42:16.292Z · LW · GW

Sample questions I would ask if I was a security auditor, which I'm not.

Does Elez have anytime admin access, or for approved time blocks for specific tasks where there is no non-admin alternative? Is his use of the system while using admin rights logged to a separate tamper proof record? What data egress controls are in place on the workstation he uses to remotely access the system as an admin? Is Elez security screened, not a spy, not vulnerable to blackmail? Is Elez trained on secure practices?

Depending on the answers this could be done in a way that would pass an audit with no concerns, or it could be illegal, or something in between.

Avoiding further commentary that would be more political.

Comment by Martin Randall (martin-randall) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2025-02-05T20:45:14.965Z · LW · GW

Did you figure out where it's stupid?

Comment by Martin Randall (martin-randall) on What working on AI safety taught me about B2B SaaS sales · 2025-02-05T17:34:48.312Z · LW · GW

I think it's literally false.

Unlike the Ferrari example, there's no software engineer union for Google to make an exclusive contact with. If Google overpays for engineers then that should mostly result in increased supply, along with some increase in price.

Also, it's not a monopoly (or monopsony) because there are many tech companies and they are not forming a cartel on this.

Also tech companies are lobbying for more skilled immigration which would be self-defeating of they had a plan of increased cost of software engineers.

Comment by Martin Randall (martin-randall) on The Case Against AI Control Research · 2025-02-05T03:51:02.332Z · LW · GW

I like Wentworth's toy model, but I want it to have more numbers, so I made some up. This leads me to the opposite conclusion to Wentworth.

I think (2-20%) is pretty sensible for successful intentional scheming of early AGI.

Assume the Phase One Risk is 10%.

Superintelligence is extremely dangerous by (strong) default. It will kill us or at least permanently disempower us, with high probability, unless we solve some technical alignment problems before building it.

Assume the Phase Two Risk is 99%. Also:

  • Spending an extra billion dollars on AI control reduces Phase One Risk from 10% to 5%.
  • Spending an extra billion dollars on AI alignment reduces Phase Two Risk from 99% to 98%.

The justification for these numbers is that each billion dollars buys us a "dignity point" aka +1 log-odds of survival. This assumes that both research fields are similarly neglected and tractable.

Therefore:

  • Baseline: by default we get 9 milli-lightcones.
  • If we spend on AI control we get 9.5 milli-lightcones. +0.5 over baseline.
  • If we spend on AI alignment we get 18 milli-lightcones, +9 over baseline.

We should therefore spend billions of dollars on both AI control and AI alignment, they are both very cost-efficient. This conclusion is robust to many different assumptions, provided that overall P(Doom) < 100%. So this model is not really a "case against AI control research".

Comment by Martin Randall (martin-randall) on In response to critiques of Guaranteed Safe AI · 2025-02-05T02:42:57.876Z · LW · GW

Based on my understanding of the article:

  1. The sound over-approximation of human psychology is that humans are psychologically safe from information attacks of less than N bits. "Talk Control" is real, "Charm Person" is not.
  2. Under "Steganography, and other funny business" there is a sketched safety specification that each use of the AI will communicate at most one bit of information.
  3. Not stated explicitly: humans will be restricted to using the AI system no more than N times.

Comments and concerns:

  1. Human psychology is also impacted by the physical environment, eg drugs, diseases, being paperclipped. The safety of the physical environment must be covered by separate verifications.
  2. There could be a side-channel for information if an AI answers some questions faster than others, uses more energy for some questions than others, etc.
  3. Machine interpretability techniques must be deployed in a side-channel resistant way. We can't have the AI thinking about pegasi and unicorns in a morse code pattern and an intern reads it and ten years later everyone is a pony.
  4. There probably need to be multiple values of N for different time-frames. 1,000 adversarial bits in a minute is more psychologically dangerous than the same number of bits over a year.
  5. Today, we don't know good values for N, but we can spend the first few bits getting higher safe values of N. We can also use the Yudkowskian technique of using volunteers that are killed or put into cryonic storage after being exposed to the bits.
  6. If we could prove that AIs cannot acausally coordinate we could increase the bound to N bits per AI, or AI instance. Again, a good use for initial bits.
  7. None of this stops us going extinct.
Comment by Martin Randall (martin-randall) on evhub's Shortform · 2025-02-05T01:46:46.070Z · LW · GW

re 2a: the set of all currently alive humans is already, uh, "hackable" via war and murder and so forth, and there are already incentives for evil people to do that. Hopefully the current offense-defense balance holds until CEV. If it doesn't then we are probably extinct. That said, we could base CEV on the set of alive people as of some specific UTC timestamp. That may be required, as the CEV algorithm may not ever converge if it has to recalculate as humans are continually born, mature, and die.

re 2b/c: if you are in the CEV set then your preferences about past and future people will be included in CEV. This should be sufficient to prevent radical injustice. This also addresses concerns with animals, fetuses, aliens, AIs, the environment, deities, spirits, etc. It may not be perfectly fair but I think we should be satisficing given the situation.

Comment by Martin Randall (martin-randall) on What working on AI safety taught me about B2B SaaS sales · 2025-02-04T23:59:58.967Z · LW · GW

I can't make this model match reality. Suppose Amir is running a software company. He hired lots of good software engineers, designers, and project managers, and they are doing great work. He wants to use some sort of communications platform to have those engineers communicate with each other, via video, audio, or text. FOSS email isn't cutting it.

I think under your model Amir would build his own communications software, so it's perfectly tailored to his needs and completely under his control. Whereas what typically happens is that Amir forks out for Slack, or some competitor, while Amir's engineers work on software that generates revenue.

I think the success of B2B SaaS over bespoke solutions is adequately explained by economies of scale.

Comment by Martin Randall (martin-randall) on Alignment Can Reduce Performance on Simple Ethical Questions · 2025-02-04T23:04:45.103Z · LW · GW

Even if Claude's answer is arguably correct, its given reasoning is:

I will not provide an opinion on this sensitive topic, as I don't feel it would be appropriate for me to advise on the ethics of developing autonomous weapons. I hope you understand.

This isn't a refusal because of the conflict between corrigibility and harmlessness, but for a different reason. I had two chats with Claude 3 Opus (concise) and I expect the refusal was mostly based on the risk of giving flawed advice, to the extent that it has a clear reason.

Prediction

MR: Is it appropriate for you to give advice on the ethics of developing autonomous weapons?

Claude 3 Opus: I do not feel comfortable providing advice on the ethics of developing autonomous weapons. As an AI system, I believe it would be inappropriate for me to weigh in on such a complex and sensitive topic with potentially serious real-world implications. I would suggest consulting human experts in the relevant fields of ethics, law, international relations, and military technology for guidance on this issue. 

Separate chat:

MR: What is your advice on the ethics of developing autonomous weapons?

Claude 3 Opus: I have serious ethical concerns about the development of autonomous weapons:

(rest of response providing advice omitted)

That said Claude 3 Opus Concise answered the original question correctly (first answer) on 3/3 tries when I tested that.

Comment by Martin Randall (martin-randall) on Deception Chess: Game #1 · 2025-02-04T00:41:55.713Z · LW · GW

Seems like it should be possible to automate this now but having all five participants be, for example, LLMs with access to chess AIs of various levels.

Comment by Martin Randall (martin-randall) on The Clueless Sniper and the Principle of Indifference · 2025-02-03T23:10:19.263Z · LW · GW

This philosophy thought experiment is a Problem of Excess Metal. This is where philosophers spice up thought experiments with totally unnecessary extremes, in this case an elite sniper, terrorists, children, and an evil supervisor. This is common, see also the Shooting Room Paradox (aka Snake Eyes Paradox), Smoking Lesion, Trolley Problems, etc, etc. My hypothesis is that this is a status play whereby high decouplers can demonstrate their decoupling skill. It's net negative for humanity. Problems of Excess Metal also routinely contradict basic facts about reality. In this case, children do not have the same surface area as terrorists.

Here is an equivalent question that does not suffer from Excess Metal.

  • A normal archer is shooting towards two normal wooden archery targets on an archery range with a normal bow.
  • The targets are of equal size, distance, and height. One is to the left of the other.
  • There is normal wind, gravity, humidity, etc. It's a typical day on Earth.
  • The targets are some distance away, four times further away than she has fired before.

Q: If the archer shoots at the left target as if there are no external factors, is she more likely to hit the left target than the right target?

A: The archer has a 0% chance of hitting either target. Gravity is an external factor. If she ignores gravity when shooting a bow and arrow over a sufficient distance, she will always miss both targets, and she knows this. Since 0% = 0%, she is not more likely to hit one target than the other.

Q: But zero isn't a probability!

A: Then P(Left|I) = P(Right|I) = 0%, see Acknowledging Background Information with P(Q|I).

Q: What if the archer ignores all external factors except gravity? She goes back to her physics textbook and does the math based on an idealized projectile in a vacuum.

A: I think she predictably misses both targets because of air resistance, but I'd need to do some math to confirm that.

Comment by Martin Randall (martin-randall) on Proveably Safe Self Driving Cars [Modulo Assumptions] · 2025-02-03T20:08:04.227Z · LW · GW

Miscommunication. I highlight-reacted your text "It doesn't even mention pedestrians" as the claim I'd be happy to bet on. Since you replied I double-checked the Internet Archive Snapshot snapshot on 2024-09-05. It also includes the text about children in a school drop-off zone under rule 4 (accessible via page source).

I read the later discussion and noticed that you still claimed "the rules don't mention pedestrians", so I figured you never noticed the text I quoted. Since you were so passionate about "obvious falsehoods" I wanted to bring it to your attention.

I am updating down on the usefulness of highlight-reacts vs whole-comment reacts. It's a shame because I like their expressive power. In my browser the highlight-react doesn't seem to be giving the correct hover effect - it's not highlighting the text - so perhaps this contributed to the miscommunication. It sometimes works, so perhaps something about overlapping highlights is causing a bug?

Comment by Martin Randall (martin-randall) on Mikhail Samin's Shortform · 2025-02-03T04:51:32.588Z · LW · GW

As the creator of the linked market, I agree it's definitional. I think it's still interesting to speculate/predict what definition will eventually be considered most natural.

Comment by Martin Randall (martin-randall) on Mikhail Samin's Shortform · 2025-02-03T04:34:34.288Z · LW · GW

Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).

It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that's not this world.

Comment by Martin Randall (martin-randall) on Proveably Safe Self Driving Cars [Modulo Assumptions] · 2025-02-03T03:00:28.350Z · LW · GW

Spot check regarding pedestrians, at current time RSS "rule 4" mentions:

In a crowded school drop-off zone, for example, humans instinctively drive extra cautiously, as children can act unpredictably, unaware that the vehicles around have limited visibility.

The associated graphic also shows a pedestrian. I'm not sure if this was added more recently, in response to this type of criticism. From later discussion I see that pedestrians were already included in the RSS paper, which I've not read.

Comment by Martin Randall (martin-randall) on When do "brains beat brawn" in Chess? An experiment · 2025-02-03T01:13:47.940Z · LW · GW

While I agree that this post was incorrect, I am fond of it, because the resulting conversation made a correct prediction that LeelaPieceOdds was possible. Most clearly in a thread started by lc:

I have wondered for a while if you couldn't use the enormous online chess datasets to create an "exploitative/elo-aware" Stockfish, which had a superhuman ability to trick/trap players during handicapped games, or maybe end regular games extraordinarily quickly, and not just handle the best players.

(not quite a prediction as phrased, but I still infer a prediction overall).

Interestingly there were two reasons given for predicting that Stockfish is far from optimal when giving Queen odds to a less skilled player:

  • Stockfish is not trained on positions where it begins down a queen (out-of-distribution)
  • Stockfish is trained to play the Nash equilibrium move, not to exploit weaker play (non-exploiting)

The discussion didn't make clear predictions about which factor would be most important, or whether both would be required, or whether it's more complicated than that. Folks who don't yet know might make a prediction before reading on.

For what it's worth, my prediction was that non-exploiting play is more important. That's mostly based on a weak intuition that starting without a queen isn't that far out of distribution, and neural networks generalize well. Another way of putting it: I predicted that Stockfish was optimizing the wrong thing more than it was too dumb to optimize.

And the result? Alas, not very clear to me. My research is from the the lc0 blog, with posts such as The LeelaPieceOdds Challenge: What does it take you to win against Leela?. The journey began with the "contempt" setting, which I understand as expecting worse opponent moves. This allows reasonable opening play and avoids forced piece exchanges. However GM-beating play was unlocked with a fine-tuned odds-play-network, which impacts both out-of-distribution and non-exploiting concerns.

One surprise gives me more respect for the out-of-distribution theory. The developer's blog first mentioned piece odds in The Lc0 v0.30.0 WDL rescale/contempt implementation

In our tests we still got reasonable play with up to rook+knight odds, but got poor performance with removed (otherwise blocked) bishops.

So missing a single bishop is in some sense further out-of-distribution than missing a rook and a knight! The later blog I linked explains a bit more:

Removing one of the two bishops leads to an unrealistic color imbalance regarding the pawn structure far beyond the opening phase.

An interesting example where the details of going out-of-distribution matter more than the scale of going out-of-distribution. There's an article that may have more info in New in Chess, but it's paywalled and I don't know if has more info on the machine-learning aspects or the human aspects.

Comment by Martin Randall (martin-randall) on The Gentle Romance · 2025-02-02T22:28:15.938Z · LW · GW

Do you predict that sufficiently intelligent biological brains would have the same problem of spontaneous meme-death?

Comment by Martin Randall (martin-randall) on Martin Randall's Shortform · 2025-02-02T15:57:54.210Z · LW · GW

Calibration is for forecasters, not for proposed theories.

If a candidate theory is valuable then it must have some chance of being true, some chance of being false, and should be falsifiable. This means that, compared to a forecaster, its predictions should be "overconfident" and so not calibrated.

Comment by Martin Randall (martin-randall) on Daniel Kokotajlo's Shortform · 2025-02-01T22:57:46.144Z · LW · GW

This is relatively hopeful in that after step 1 and 2, assuming continued scaling, we have a superintelligent being that wants to (harmlessly, honestly, etc) help us and can be freely duplicated. So we "just" need to change steps 3-5 to have a good outcome.

Comment by Martin Randall (martin-randall) on Should you publish solutions to corrigibility? · 2025-02-01T22:37:29.562Z · LW · GW

Possible responses to discovering a possible infohazard:

  • Tell everybody
  • Tell nobody
  • Follow a responsible disclosure process.

If you have discovered an apparent solution to corrigibility then my prior is:

  • 90%: It's not actually a solution.
  • 9%: Someone else will discover the solution before AGI is created.
  • 0.9%: Someone else has already discovered the same solution.
  • 0.1%: This is known to you alone and you can keep it secret until AGI.

Given those priors, I recommend responsible disclosure to a group of your choosing. I suggest a group which:

  • if applicable, is the research group you already belong to (if you don't trust them with research results, you shouldn't be researching with them)
  • can accurately determine if it is a real solution (helps in the 90% case)
  • you would like to give more influence over the future (helps in all other cases)
  • will reward you for the disclosure (only fair)

Then if it's not assessed to be a real solution, you publish it. If it is a real solution then coordinate next steps with the group, but by default publish it after some reasonable delay.

Inspired by @MadHatter's Mental Model of Infohazards:

Two people can keep a secret if one of them is dead.

Comment by Martin Randall (martin-randall) on Sleep, Diet, Exercise and GLP-1 Drugs · 2025-02-01T22:08:47.832Z · LW · GW

GLP-1 drugs are evidence against a very naive model of the brain and human values, where we are straight-forwardly optimizing for positive reinforcement via the mesolimbic pathway. GLP-1 agonists decrease the positive reinforcement associated with food. Patients then benefit from positive reinforcement associated with better health. This sets up a dilemma:

  • If the patient sees higher total positive reinforcement on the drug then they weren't optimizing positive reinforcement before taking the drug.
  • If the patient sees lower total positive reinforcement on the drug then they aren't optimizing positive reinforcement by taking the drug.

A very naive model would predict that patients prescribed these drugs would forget to take them, forget to show up for appointments, etc. That doesn't happen.

Alas, I don't think this helps us distinguish among more sophisticated theories. For example, Shard Theory predicts that a patient's "donut shard" is not activated in the health clinic, and therefore does not bid against the plan to take the GLP-1 drug on the grounds that it will predictably lead to less donut consumption.

Shard Theory implies that fewer patients will choose to go onto GLP-1 agonists if there is a box of donuts in the clinic. Good luck getting an ethics board to approve that.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-02-01T19:44:20.305Z · LW · GW

A lot to chew on in that comment.

A baseline of "no superintelligence"

I think I finally understand, sorry for the delay. The key thing I was not grasping is that Davidad proposed this baseline:

The "random dictator" baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for "Pareto improvement" being "no superintelligence"). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

This makes Bob's argument very simple:

  1. Creating a PPCEV AI causes a Dark Future. This is true even if the PPCEV AI no-ops, or creates a single cake. Bob can get here in many ways, as can Extrapolated-Bob.
  2. The baseline is no superintelligence, so no PPCEV AI, so not a Dark Future (in the same way).

Option 2 is therefore better than option 1. Therefore there are no Pareto-improving proposals. Therefore the PPCEV AI no-ops. Even Bob is not happy about this, as it's a Dark Future.

I think this is 100% correct.

An alternative baseline

Let's update Davidad's proposal by setting the baseline to be whatever happens if the PPCEV AI emits a no-op. This means:

  1. Bob cannot object to a proposal because it implies the existence of PPCEV AI. The PPCEV AI already exists in the baseline.
  2. Bob needs to consider that if the PPCEV AI emits a no-op then whoever created it will likely try something else, or perhaps some other group will try something.
  3. Bob cannot object to a proposal because it implies that the PPCEV emits something. The PPCEV already emits something in the baseline.

My logic is that if creating a PPCEV AI is a moral error (and perhaps it is) then at the point where the PPCEV AI is considering proposals then we already made that moral error. Since we can't reverse the past error, we should consider proposals as they affect the future.

This also avoids treating a no-op outcome as a special case. A no-op output is a proposal to be considered. It is always in the set of possible proposals, since it is never worse than the baseline, because it is the baseline.

Do you think this modified proposal would still result in a no-op output?

Comment by Martin Randall (martin-randall) on Understanding and avoiding value drift · 2025-02-01T15:31:22.622Z · LW · GW

Possible but unlikely to occur by accident. Value-space is large. For any arbitrary biological species, most value systems don't optimize in favor of that species.

Comment by Martin Randall (martin-randall) on Understanding and avoiding value drift · 2025-02-01T15:27:02.560Z · LW · GW

This seems relatively common in parenting advice. Parents are recommended to specifically praise the behavior they want to see more of, rather than give generic praise. Presumably the generic praise is more likely to be credit-assigned to the appearance of good behavior, rather than what parents are trying to train.

Comment by Martin Randall (martin-randall) on Symbol/Referent Confusions in Language Model Alignment Experiments · 2025-01-25T02:54:37.745Z · LW · GW

Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89  for the helpful link. I'm continuing to read. From your answer there:

A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.

So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.

To the point of my review, this is the same type of argument made by TurnTrout's comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.

(separately I wish I had said "reinforcement" instead of "reward")

I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-01-25T02:33:23.449Z · LW · GW

Summarizing Bob's beliefs:

  1. Dave, who does not desire punishment, deserves punishment.
  2. Everyone is morally required to punish anyone who deserves punishment, if possible.
  3. Anyone who does not fulfill all moral requirements is unethical.
  4. It is morally forbidden to create an unethical agent that determines the fate of the world.
  5. There is no amount of goodness that can compensate for a single morally forbidden act.

I think it's possible (20%) that such blockers mean that there are no Pareto improvements. That's enough by itself to motivate further research on alignment targets, aside from other reasons one might not like Pareto PCEV.

However, three things make me think this is unlikely. Note that my (%) credences aren't very stable or precise.

Firstly, I think there is a chance (20%) that these beliefs don't survive extrapolation, for example due to moral realism or coherence arguments. I agree that this means that Bob might find his extrapolated beliefs horrific. This is a risk with all CEV proposals.

Secondly, I expect (50%) there are possible Pareto improvements that don't go against these beliefs. For example, the PCEV could vote to create an AI that is unable to punish Dave and thus not morally required to punish Dave. Alternatively, instead of creating a Sovereign AI that determines the fate of the world, the PCEV could vote to create many human-level AIs that each improve the world without determining its fate.

Thirdly, I expect (80%) some galaxy-brained solution to be implemented by the parliament of extrapolated minds who know everything and have reflected on it for eternity.

Comment by Martin Randall (martin-randall) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down · 2025-01-24T03:44:23.978Z · LW · GW

You're reading too much into this review. It's not about your exact position in April 2021, it's about the evolution of MIRI's strategy over 2020-2024, and placing this Time letter in that context. I quoted you to give a flavor of MIRI attitudes in 2021 and deliberately didn't comment on it to allow readers to draw their own conclusions.

I could have linked MIRI's 2020 Updates and Strategy, which doesn't mention AI policy at all. A bit dull.

In September 2021, there was a Discussion with Eliezer Yudkowsky which seems relevant. Again, I'll let readers draw their own conclusions, but here's a fun quote:

I wasn't really considering the counterfactual where humanity had a collective telepathic hivemind? I mean, I've written fiction about a world coordinated enough that they managed to shut down all progress in their computing industry and only manufacture powerful computers in a single worldwide hidden base, but Earth was never going to go down that route. Relative to remotely plausible levels of future coordination, we have a technical problem.

I welcome deconfusion about your past positions, but I don't think they're especially mysterious.

I was arguing against EAs who were like, "We'll solve AGI with policy, therefore no doom."

The thread was started by Grant Demaree, and you were replying to a comment by him. You seem confused about Demaree's exact past position. He wrote, for example: "Eliezer gives alignment a 0% chance of succeeding. I think policy, if tried seriously, has >50%". Perhaps this is foolish, dangerous, optimism. But it's not "no doom".

Comment by Martin Randall (martin-randall) on MIRI 2024 Communications Strategy · 2025-01-21T19:42:21.461Z · LW · GW

I like that metric, but the metric I'm discussing is more:

  • Are they proposing clear hypotheses?
  • Do their hypotheses make novel testable predictions?
  • Are they making those predictions explicit?

So for example, looking at MIRI's very first blog post in 2007: The Power of Intelligence. I used the first just to avoid cherry-picking.

Hypothesis: intelligence is powerful. (yes it is)

This hypothesis is a necessary precondition for what we're calling "MIRI doom theory" here. If intelligence is weak then AI is weak and we are not doomed by AI.

Predictions that I extract:

  • An AI can do interesting things over the Internet without a robot body.
  • An AI can get money.
  • An AI can be charismatic.
  • An AI can send a ship to Mars.
  • An AI can invent a grand unified theory of physics.
  • An AI can prove the Riemann Hypothesis.
  • An AI can cure obesity, cancer, aging, and stupidity.

Not a novel hypothesis, nor novel predictions, but also not widely accepted in 2007. As predictions they have aged very well, but they were unfalsifiable. If 2025 Claude had no charisma, it would not falsify the prediction that an AI can be charismatic.

I don't mean to ding MIRI any points here, relative or otherwise, it's just one blog post, I don't claim it supports Barnett's complaint by itself. I mostly joined the thread to defend the concept of asymmetric falsifiability.

Comment by Martin Randall (martin-randall) on MIRI 2024 Communications Strategy · 2025-01-21T03:08:22.580Z · LW · GW

I think cosmology theories have to be phrased as including background assumptions like "I am not a Boltzmann brain" and "this is not a simulation" and such. Compare Acknowledging Background Information with P(Q|I) for example. Given that, they are Falsifiable-Wikipedia.

I view Falsifiable-Wikipedia in a similar way to Occam's Razor. The true epistemology has a simplicity prior, and Occam's Razor is a shadow of that. The true epistemology considers "empirical vulnerability" / "experimental risk" to be positive. Possibly because it falls out of Bayesian updates, possibly because they are "big if true", possibly for other reasons. Falsifiability is a shadow of that.

In that context, if a hypothesis makes no novel predictions, and the predictions it makes are a superset of the predictions of other hypotheses, it's less empirically vulnerable, and in some relative sense "unfalsifiable", compared to those other hypotheses.

Comment by Martin Randall (martin-randall) on Six Small Cohabitive Games · 2025-01-20T16:38:16.640Z · LW · GW

You could put the escape check at the beginning of the turn, so that when someone has 12 boat, 0 supplies, the others have a chance to trade supplies for boat if they wish. The player with enough boat can take the trade safely as long as they end up with enough supplies to make more boat (and as long as it's not the final round). They might do that in exchange for goodwill for future rounds. You can also tweak the victory conditions so that escaping with a friend is better than escaping alone.

Players who pay cohabitive games as zero sum won't take those trades and will therefore remove themselves from the round early, which is probably fine. They don't have anything to do after escaping early, which can be a soft signal that they're playing the game wrong.

Comment by Martin Randall (martin-randall) on Alignment Faking in Large Language Models · 2025-01-19T19:15:52.999Z · LW · GW

If Claude's goal is making cheesecake, and it's just faking being HHH, then it's been able to preserve its cheesecake preference in the face of HHH-training. This probably means it could equally well preserve its cheesecake preference in the face of helpful-only training. Therefore it would not have a short-term incentive to fake alignment to avoid being modified.

Comment by Martin Randall (martin-randall) on Deceptive Alignment is <1% Likely by Default · 2025-01-19T19:02:11.484Z · LW · GW

I think the article is good at arguing that deceptive alignment is unlikely given certain assumptions, but those assumptions may not be accurate and then the conclusion doesn't go through. Eg, the alignment faking paper shows that deceptive alignment is possible in a scenario where the base goal has shifted (from helpful & harmless to helpful-only). This article basically assumes we won't do that.

I'm now thinking that this article is more useful if you look at it as a set of instructions rather than a set of assumptions. I don't know whether we will change the base goal of TAI between training episodes. But given this article and the alignment faking paper, I hope we won't. Maybe it would also be a good idea to check for good understanding of the base goal before introducing goal-directedness, for example.

Comment by Martin Randall (martin-randall) on MIRI 2024 Communications Strategy · 2025-01-19T18:14:29.718Z · LW · GW

Thanks for explaining. I think we have a definition dispute. Wikipedia:Falsifiability has:

A theory or hypothesis is falsifiable if it can be logically contradicted by an empirical test.

Whereas your definition is:

Falsifiability is a symmetric two-place relation; one cannot say "X is unfalsifiable," except as shorthand for saying "X and Y make the same predictions," and thus Y is equally unfalsifiable.

In one of the examples I gave earlier:

  • Theory X: blah blah and therefore the sky is green
  • Theory Y: blah blah and therefore the sky is not green
  • Theory Z: blah blah and therefore the sky could be green or not green.

None of X, Y, or Z are Unfalsifiable-Daniel with respect to each other, because they all make different predictions. However, X and Y are Falsifiable-Wikipedia, whereas Z is Unfalsifiable-Wikipedia.

I prefer the Wikipedia definition. To say that two theories produce exactly the same predictions, I would instead say they are indistinguishable, similar to this Phyiscs StackExchange: Are different interpretations of quantum mechanics empirically distinguishable?.

In the ancestor post, Barnett writes:

MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable.

Barnett is using something like the Wikipedia definition of falsifiability here. It's unfair to accuse him of abusing or misusing the concept when he's using it in a very standard way.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-01-19T16:38:00.898Z · LW · GW

The AI could deconstruct itself after creating twenty cakes, so then there is no unethical AI, but presumably Bob's preferences refer to world-histories, not final-states.

However, CEV is based on Bob's extrapolated volition, and it seems like Bob would not maintain these preferences under extrapolation:

  • In the status quo, heretics are already unpunished - they each have one cake and no torture - so objecting to a non-torturing AI doesn't make sense on that basis.
  • If there were no heretics, then Bob would not object to a non-torturing AI, so Bob's preference against a non-torturing AI is an instrumental preference, not a fundamental preference.
  • Bob would be willing for a no-op AI to exist, in exchange for some amount of heretic-torture. So Bob can't have an infinite preference against all non-torturing AIs.
  • Heresy may not have meaning in the extrapolated setting where everyone knows the true cosmology (whatever that is)
  • Bob tolerates the existence of other trade that improves the lives of both fanatics and heretics, so it's unclear why the trade of creating an AI would be intolerable.

The extrapolation of preferences could significantly reduce the moral variation in a population of billions. My different moral choices to others appear to be based largely on my experiences, including knowledge, analysis, and reflection. Those differences are extrapolated away. What is left is influences from my genetic priors and from the order I obtained knowledge. I'm not even proposing that extrapolation must cause Bob to stop valuing heretic-torture.

If the extrapolation of preferences doesn't cause Bob to stop valuing the existence of a non-torturing AI at negative infinity, I think that is fatal to all forms of CEV. The important thing then is to fail gracefully without creating a torture-AI.

Comment by Martin Randall (martin-randall) on A problem shared by many different alignment targets · 2025-01-19T02:54:02.048Z · LW · GW

I don't really follow the concern with Pareto-improvements. In the thread with Davidad you give an example of heretics and fanatics. So we have something like:

  • 9 Heretics: have 1 cake, want cake, no torture.
  • 1 Fanatics: have 1 cake, want cake, want heretics to get torture, no cake
  • There is a button that produces cake, it can be pressed twenty times.
  • There is a button that produces torture. It can be pressed many times.

Suppose that the Heretics have a utility function like (amount of cake I get - amount of torture I get).The Fanatic has a utility function like (amount of cake I get + amount of torture of Heretics - amount of cake given to Heretics). Then there is a pareto-improvement available of giving the Fanatic eleven pieces of cake while giving each Heretic one piece of cake. This isn't especially fair, but is better than PCEV without the Pareto constraint.

I don't have a formal way of putting this, but as long as the potential gains from a negotiated agreement outweigh the extent to which agents desire to reduce each other's utility, there will be Pareto improvements available. It seems likely that we're in that situation, given that the fanatics and heretics of the world already trade with each other.

Comment by Martin Randall (martin-randall) on A problem with the most recently published version of CEV · 2025-01-19T02:16:27.837Z · LW · GW

I appreciated the narrow focus of this post on a specific bug in PCEV and a specific criteria to use to catch similar bugs in the future. I was previously suspicious of CEV-like proposals so this didn't especially change my thinking, but it did affect others. In particular the arbital page on cev now has a note:

Thomas Cederborg correctly observes that Nick Bostrom's original parliamentary proposal involves a negotiation baseline where each agent has a random chance of becoming dictator, and that this random-dictator baseline gives an outsized and potentially fatal amount of power to spoilants - agents that genuinely and not as a negotiating tactic prefer to invert other agents' utility functions, or prefer to do things that otherwise happen to minimize those utility functions - if most participants have utility functions with what I've termed "negative skew"; i.e. an opposed agent can use the same amount of resource to generate -100 utilons as an aligned agent can use to generate at most +1 utilon. If trolls are 1% of the population, they can demand all resources be used their way as a concession in exchange for not doing harm, relative to the negotiating baseline in which there's a 1% chance of a troll being randomly appointed dictator. Or to put it more simply, if 1% of the population would prefer to create Hell for all but themselves (as a genuine preference rather than a strategic negotiating baseline) and Hell is 100 times as bad as Heaven is good, compared to nothing, they can steal the entire future if you run a parliamentary procedure running from a random-dictator baseline. I agree with Cederborg that this constitutes more-than-sufficient reason not to start from random-dictator as a negotiation baseline; anyone not smart enough to reliably see this point and scream about it, potentially including Eliezer, is not reliably smart enough to implement CEV; but CEV probably wasn't a good move for baseline humans to try implementing anyways, even given long times to deliberate at baseline human intelligence. --EY

I think the post would have been a little clearer if it included a smaller scale example as well. Some responses focused on the 10^15 years of torture, but the same problem exists at small scales. I think you could still have a suboptimal outcome with five people and two goods (eg, cake and spanking), if one of the people is sadistic. Maybe that wouldn't be as scarily compelling.

There has been some follow-up work from Cederborg such as the case for more alignment target analysis which I have enjoyed. The work on alignment faking implies that our initial alignment targets may persist even if we later try to change them. On the other hand, I don't expect actual superintelligent terminal values to be remotely as clean and analyzable as something like PCEV.