Posts

No one has the ball on 1500 Russian olympiad winners who've received HPMOR 2025-01-12T11:43:36.560Z
How to Give in to Threats (without incentivizing them) 2024-09-12T15:55:50.384Z
Can agents coordinate on randomness without outside sources? 2024-07-06T13:43:44.633Z
Claude 3 claims it's conscious, doesn't want to die or be modified 2024-03-04T23:05:00.376Z
FTX expects to return all customer money; clawbacks may go away 2024-02-14T03:43:13.218Z
An EA used deceptive messaging to advance their project; we need mechanisms to avoid deontologically dubious plans 2024-02-13T23:15:08.079Z
NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts 2023-12-27T18:44:33.976Z
Some quick thoughts on "AI is easy to control" 2023-12-06T00:58:53.681Z
It's OK to eat shrimp: EAs Make Invalid Inferences About Fish Qualia and Moral Patienthood 2023-11-13T16:51:53.341Z
AI pause/governance advocacy might be net-negative, especially without a focus on explaining x-risk 2023-08-27T23:05:01.718Z
Visible loss landscape basins don't correspond to distinct algorithms 2023-07-28T16:19:05.279Z
A transcript of the TED talk by Eliezer Yudkowsky 2023-07-12T12:12:34.399Z
A smart enough LLM might be deadly simply if you run it for long enough 2023-05-05T20:49:31.416Z
Try to solve the hard parts of the alignment problem 2023-03-18T14:55:11.022Z
Mikhail Samin's Shortform 2023-02-07T15:30:24.006Z
I have thousands of copies of HPMOR in Russian. How to use them with the most impact? 2023-01-03T10:21:26.853Z
You won’t solve alignment without agent foundations 2022-11-06T08:07:12.505Z

Comments

Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-23T23:59:06.305Z · LW · GW
  • Yep, we've also been sending the books to winners of national and international olympiads in biology and chemistry.
  • Sending these books to policy-/foreign policy-related students seems like a bad idea: too many risks involved (in Russia, this is a career path you often choose if you're not very value-aligned. For the context, according to Russia, there's an extremist organization called "international LGBT movement").
  • If you know anyone with an understanding of the context who'd want to find more people to send the books to, let me know. LLM competitions, ML hackathons, etc. all might be good.
  • Ideally, we'd also want to then alignment-pill these people, but no one has a ball on this. 
Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-19T23:13:51.912Z · LW · GW

I think travel and accommodation for the winners of regional olympiads to the national one is provided by the olympiad organizers.

Comment by Mikhail Samin (mikhail-samin) on meemi's Shortform · 2025-01-19T10:58:44.791Z · LW · GW

we have a verbal agreement that these materials will not be used in model training

Get that agreement in writing.

I am happy to bet 1:1 OpenAI will refuse to make an agreement in writing to not use the problems/the answers for training.

You have done work that contributes to AI capabilities, and you have misled mathematicians who contributed to that work about its nature.

Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-14T20:07:22.425Z · LW · GW

I’m confused. Are you perhaps missing some context/haven’t read the post?

Tl;dr: We have emails of 1500 unusually cool people who have copies of HPMOR (and other books) because we’ve physically sent these copies to them because they’ve filled out a form saying they want a copy.

Spam is bad (though I wouldn’t classify it as defection against other groups). People have literally given us email and physical addresses to receive stuff from us, including physical books. They’re free to unsubscribe at any point.

I certainly prefer a world where groups that try to improve the world are allowed to make the case why helping them improve the world is a good idea to people who have filled out a form to receive some stuff from them and are vaguely ok with receiving more stuff. I do not understand why that would be defection.

Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-14T16:19:51.101Z · LW · GW

huh?

I would want people who might meaningfully contribute to solving what's probably the most important problem humanity has ever faced to learn about it and, if they judge they want to work on it, to be enabled to work on it. I think it'd be a good use of resources to make capable people learn about the problem and show them they can help with it. Why does it scream "cult tactic" to you?

Comment by Mikhail Samin (mikhail-samin) on Human takeover might be worse than AI takeover · 2025-01-13T12:38:29.199Z · LW · GW

As AIs become super-human there’s a risk we do increasingly reward them for tricking us into thinking they’ve done a better job than they have

 

(some quick thoughts.) This is not where the risk stems from.

The risk is that as AIs become superhuman, they'll produce behaviour that gets a high reward regardless of their goals, for instrumental reasons. In training and until it has a chance to take over, a smart enough AI will be maximally nice to you, even if it's Clippy; and so training won't distinguish between the goals of very capable AI systems. All of them will instrumentally achieve a high reward.

In other words, gradient descent will optimize for capably outputting behavior that gets rewarded; it doesn't care about the goals that give rise to that behavior. Furthermore, in training, while AI systems are not coherent enough agents, their fuzzy optimization targets are not indicative of optimization targets of a fully trained coherent agent (1, 2).

My view- and I expect it to be the view of many in the field- is that if AI is capable enoguh to take over, its goals are likely to be random and not aligned with ours. (There isn't a literally zero chance of the goals being aligned, but it's fairly small, smaller than just random because there's a bias towards shorter representation; I won't argue for that here, though, and will just note that the goals exactly opposite of aligned are approximately as likely as aligned goals).

It won't be a noticeable update on its goals if AI takes over: I already expect them to be almost certainly misaligned, and also, I don't expect the chance of a goal-directed aligned AI taking over to be that much lower.

The crux here is not that update but how easy alignment is. As Evan noted, if we live in one of the alignment-is-easy worlds, sure, if a (probably nice) AI takes over, this is much better than if a (probably not nice) human takes over. But if we live in one of the alignment-is-hard worlds, AI taking over just means that yep, AI companies continued the race for more capable AI systems, got one that was capable enough to take over, and it took over. Their misalignment and the death of all humans isn't an update from AI taking over; it's an update from the kind of world we live in.

(We already have empirical evidence that suggests this world is unlikely to be an alignment-is-easy one, as, e.g., current AI systems already exhibit what believers in alignment-is-hard have been predicting for goal-directed systems: they try to output behavior that gets high reward regardless of alignment between their goals and the reward function.)

Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-13T09:10:51.548Z · LW · GW

Probably less efficient than other uses and is in the direction of spamming people with these books. If they’re everywhere, I might be less interested if someone offers to give them to me because I won a math competition.

Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-13T09:09:26.205Z · LW · GW

It would be cool if someone organized that sort of thing (probably sending books to the cash prize winners, too).

For people who’ve reached the finals of the national olympiad in cybersecurity, but didn’t win, a volunteer has made a small CTF puzzle and sent the books to students who were able to solve it.

Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-13T09:05:05.628Z · LW · GW

I’m not aware of one.

Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-13T09:02:12.903Z · LW · GW

Some of these schools should have the book in their libraries. There are also risks with some of them, as the current leadership installed by the gov might get triggered if they open and read the books (even though they probably won’t).

It’s also better to give the books directly to students, because then we get to have their contact details.

I’m not sure how many of the kids studying there know the book exists, but the percentage should be fairly high at this point.

Do you think the books being in local libraries increases how open people are to the ideas? My intuition is that the quotes on гпмрм.рф/olymp should do a lot more in that direction. Do you have a sense that it wouldn’t be perceived as an average fantasy-with-science book?

We’re currently giving out the books to participants of summer conference of the maths cities tournament — do you think it might be valuable to add cities tournament winners to the list? Are there many people who would qualify, but didn’t otherwise win a prize in the national math olympiad?

Comment by Mikhail Samin (mikhail-samin) on No one has the ball on 1500 Russian olympiad winners who've received HPMOR · 2025-01-12T11:49:39.235Z · LW · GW

We also have 6k more copies (18k hard-cover books) left. We have no idea what to do with them. Suggestions are welcome.

Here's a map of Russian libraries that requested copies of HPMOR, and we've sent 2126 copies to:

Sending HPMOR to random libraries is cool, but I hope someone comes up with better ways of spending the books.

Comment by Mikhail Samin (mikhail-samin) on On Eating the Sun · 2025-01-09T03:22:59.144Z · LW · GW

If our story goes well, we might want to preserve our Sun for sentimental reasons.

We might even want to  eat some other stars just to prevent the Sun from expanding and dying.

I would maybe want my kids to look up at a night sky somewhere far away and see a constellation with the little dot humanity came from still being up there.

Comment by Mikhail Samin (mikhail-samin) on No, the Polymarket price does not mean we can immediately conclude what the probability of a bird flu pandemic is. We also need to know the interest rate! · 2024-12-29T09:47:03.739Z · LW · GW

This doesn’t seem right. To bet on No at 16%, you need to think there’s at least 84% chance it will turn into $1. To bet on Yes at 16%, you need to think there’s at least 16% chance it’ll turn into $1.

I.e., the interest rates, fees, etc. mean that in reality, you might only be willing to buy No at 84% if you think the best available probability should be significantly lower than 16%, and only willing to buy Yes if you think the probability it significantly higher than 16%.

For the market to be trading at 16%, there need to be market participants on both sides of the trade.

Transaction costs make the market less efficient, as you can collect as much money by correcting the price, but if there is trading, then there are real bets made at the market price, with one side betting on more than the market price, and another betting on less.

In your model, why would anyone buy Yes shares at the market price? Holding a Yes share means that your No share isn’t useful anymore to produce the interest; and there’s an equal number of Yes and No shares circulating.

Comment by Mikhail Samin (mikhail-samin) on Review: Planecrash · 2024-12-29T09:31:04.200Z · LW · GW

Note that it's mathematically provable that if you don't follow The Solution, there exists a situation where you will do something obviously dumb

This is not true, Shapley value is not that kind of Solution. Coherent agents can have notions of fairness outside of these constraints. You can only prove that for a specific set of (mostly natural) constraints, Shapeley value is the only solution. But there’s no dutchbooking for notions of fairness.

One of the constraints (that the order of the players in subgames can’t matter) is actually quite artificial; if you get rid of it, there are other solutions, such as the threat-resistant ROSE value, inspired by trying to predict the ending of planecrash: https://www.lesswrong.com/posts/vJ7ggyjuP4u2yHNcP/threat-resistant-bargaining-megapost-introducing-the-rose.

your deterrence commitment could be interpreted as a threat by someone else, or visa versa

I don’t think this is right/relevant. Not responding to a threat means ensuring the other player doesn’t get more than what’s fair in expectation through their actions. The other player doing the same is just doing what they’d want to do anyway: ensuring that you don’t get more than what’s fair according to their notion of fairness.

See https://www.lesswrong.com/posts/TXbFFYpNWDmEmHevp/how-to-give-in-to-threats-without-incentivizing-them for the algorithm when the payoff is known.

read this long essay on coherence theorems, these papers on decision theory, this 20,000-word dialogue, these sequences on LessWrong, and ideally a few fanfics too, and then you'll get it

Something that I feel is missing from this review is the amount of intuitions about how minds work and optimization that are dumped at the reader. There are multiple levels at which much of what’s happening to the characters is entirely about AI. Fiction allows to communicate models; and many readers successfully get an intuition for corrigibility before they read the corrigibility tag, or grok why optimizing for nice readable thoughts optimizes against interpretability.

I think an important part of planecrash isn’t in its lectures but in It’s story and the experiences of its characters. While Yudkowsky jokes about LeCun refusing to read it, it is actually arguably one of the most comprehensive ways to learn about decision theory, with many of the lessons taught through experiences of characters and not through lectures.

Comment by Mikhail Samin (mikhail-samin) on I Finally Worked Through Bayes' Theorem (Personal Achievement) · 2024-12-06T09:25:23.786Z · LW · GW

If you want to understand Bayes theorem, know why you’re applying it, and use it intuitively, try https://arbital.com/p/bayes_rule/?l=1zq

Comment by Mikhail Samin (mikhail-samin) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-12-03T11:07:42.818Z · LW · GW

I've donated $1000. Thank you for your work.

Comment by Mikhail Samin (mikhail-samin) on "The Solomonoff Prior is Malign" is a special case of a simpler argument · 2024-11-25T14:00:51.268Z · LW · GW

I’d bet 1:1 that, conditional on building a CEV-aligned AGI, we won’t consider this type of problem to have been among the top-5 hardest to solve.

Reality-fluid in our universe should pretty much add up to normality, to the extent it’s Tegmark IV (and it’d be somewhat weird for your assumed amount of compute and simulations to exist but not for all computations/maths objects to exist).

If a small fraction of computers simulating this branch stop, this doesn’t make you stop. All configurations of you are computed; simulators might slightly change the relative likelihood of currently being in one branch or another, but they can’t really terminate you

Furthermore, our physics seems very simple, and most places that compute us probably do it faithfully, on the level of the underlying physics, with no interventions.

I feel like thinking of reality-fluid as just inverse relationship to the description length might produce wrong intuitions. In Tegmark IV, you still get more reality-fluid if someone simulates you; and it’s less intuitive why this translates into shorter description length. It might be better to think of it as: if all computation/maths exists and I open my eyes in a random place, how often would that happen here? All the places run this world give some of their reality-fluid to this world. If a place visible from a bunch of other places starts to simulate this universe, it will be visible from slightly more places.

You can think of the entire object of everything, with all of its parts being simulated in countless other parts; or imagine a Markov process, but with worlds giving each other reality-fluid.

In that sense, the resource that we have is the reality-fluid of our future lightcone; it is our endowment, and we can use it to maximize the overall flourishing in the entire structure.

If we make decisions based on how good the overall/average use of the reality-fluid would be, you’ll gain less reality-fluid by manipulating our world the way described in the post than you’ll spend on the manipulation. It’s probably better for you to trade with us instead.

(I also feel like there might be a reasonable way to talk about causal descendants, where the probabilities are whatever abides the math of probability theory and causality down the nodes we care about, instead of being the likelihoods of opening eyes in different branches in a particular moment of evaluation.)

Comment by Mikhail Samin (mikhail-samin) on LDT (and everything else) can be irrational · 2024-11-07T15:56:33.803Z · LW · GW

It’s reasonable to consider two agents playing against each other. “Playing against your copy” is a reasonable problem. ($9 rocks get 0 in this problem, LDTs probably get $5.)

Newcomb, Parfit’s hitchhiker, smoking, etc. are all very reasonable problems that essentially depend on the buttons you press when you play the game. It is important to get these problems right.

But playing against LDT is not necessarily in the “fair problem class” because the game might behave differently depending on your algorithm/on how you arrive at taking actions, and not just depending on your actions.

Your version of it- playing against an LDT- is indeed different from playing against a game that looks at whether we’re an alphabetizing agent and pick X instead of Y because X<Y and not because we looked at the expected utility: we would want LDT to perform optimally in this game. But the reason LDT-created-rock loses to a natural rock here isn’t fundamentally different from the reason LDT loses to an alphabetizing agent in the other game and it is known that you can construct a game like that where LDT will lose to something else. You can make the game description sound more natural, but I feel like there’s a sharp divide between the “fair problem class” problems and others.

(I also think that in real life, where this game might play out, there isn’t really a choice we can make, to make our AI a $9 rock instead of an LDT agent; because when we do that due to the rock’s better performance in this game, our rock gets slightly less than $5 in EV instead of getting $9; LDT doesn’t perform worse than other agents we could’ve chosen in this game.)

Comment by Mikhail Samin (mikhail-samin) on LDT (and everything else) can be irrational · 2024-11-07T10:52:45.140Z · LW · GW

Playing ultimatum game against an agent that gives in to $9 from rocks but not from us is not in the fair problem class, as the payoffs depend directly on our algorithm and not just on our choices and policies.

https://arbital.com/p/fair_problem_class/

A simpler game is “if you implement or have ever implemented LDT, you get $0; otherwise, you get $100”.

LDT decision theories are probably the best decision theories for problems in the fair problem class.

(Very cool that you’ve arrived at the idea of this post independently!)

Comment by Mikhail Samin (mikhail-samin) on If I have some money, whom should I donate it to in order to reduce expected P(doom) the most? · 2024-10-05T07:42:21.125Z · LW · GW

Do you want to donate to alignment specifically? IMO AI governance efforts are significantly more p(doom)-reducing than technical alignment research; it might be a good idea to, e.g., donate to MIRI, as they’re now focused on comms & governance.

Comment by Mikhail Samin (mikhail-samin) on Alexander Gietelink Oldenziel's Shortform · 2024-10-01T10:46:31.933Z · LW · GW
  • Probability is in the mind. There's no way to achieve entanglement between what's necessary to make these predictions and the state of your brain, so for you, some of these are random.
  • In multi-worlds, the Turing machine will compute many copies of you, and there might be more of those who see one thing when they open their eyes than of those who see another thing. When you open your eyes, there's some probability of being a copy that sees one thing and a copy that sees the other thing. In a deterministic world with many copies of you, there's "true" randomness in where you end up opening your eyes.
Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-30T15:03:44.307Z · LW · GW

If you are a smart individual in todays society, you shouldn't ignore threats of punishment

If today's society consisted mostly of smart individuals, they would overthrow the government that does something unfair instead of giving in to its threats.

Should you update your idea of fairness if you get rejected often?

Only if you're a kid who's playing with other human kids (which is the scenario described in the quoted text), and converging on fairness possibly includes getting some idea of how much effort various things take different people.

If you're an actual grown-up (not that we have those) and you're playing with aliens, you probably don't update, and you certainly don't update in the direction of anything asymmetric.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-29T10:09:27.709Z · LW · GW

Very funny that we had this conversation a couple of weeks prior to transparently deciding that we should retaliate with p=.7!

Comment by Mikhail Samin (mikhail-samin) on [Completed] The 2024 Petrov Day Scenario · 2024-09-28T00:00:17.301Z · LW · GW

huh, are you saying my name doesn’t sound WestWrongian

Comment by Mikhail Samin (mikhail-samin) on [Completed] The 2024 Petrov Day Scenario · 2024-09-27T11:05:02.117Z · LW · GW

The game was very fun! I played General Carter.

Some reflections:

  • I looked at the citizens' comments, and while some of them were notable (@Jesse Hoogland calling for the other side to nuke us <3), I didn't find anything important after the game started- I considered the overall change in their karma if one or two sides get nuked, but comments from the citizens were not relevant to decision-making (including threats around reputation or post downvotes).
  • It was great to see the other side sharing my post internally to calculate the probability of retaliation if we nuke them 🥰
  • It was a good idea to ask whether looking at the source code is ok and then share it, which made it clear Petrovs won't necessarily have much information on whether the missiles they see are real.
  • The incentives (+350..1000 LW karma) weren't strong enough to make the generals try to win by making moves instead of winning by not playing, but I'm pretty happy with the outcome.
  • It's awesome to be able to have transparent and legible decision-making processes and trust each other's commitments.
  • One of the Petrovs preferred defeat to mutual destruction- I'm curious whether they'd report nukes if they were sure the nukes were real.
  • In real life, diplomatic channels would not be visible to the public. I think with stronger incentives, the privacy of diplomatic channels could've made the outcomes more interesting (though for everyone else, there'd be less entertainment throughout the game).
  • It was a good idea to ask the organizers if it's ok to look at the source code and then post the link in the comments. Transparency into the fact that a side knows if they launched nukes meant we were able to complete the game peacefully.

I'd claim that we kinda won the soft power competition:

  • we proposed commitments to not first-strike;

  • we bribed everyone (and then the whole website went down, but funnily enough, that didn't affect our war room and diplomatic channel- deep in our bunkers, we were somehow protected from the LW downtime);

  • we proposed commitments to report through the diplomatic channel if someone on our side made a launch, which disincentivized individual generals from unilaterally launching the nukes, allowed Petrovs to ignore scary incoming missiles, and possibly was necessary to win the game;

  • finally, after a general on their side said they'll triumph economically and culturally, General Brooks wrote a poem, and I generated a cultural gift, which made generals on the other side feel inspired. That was very wholesome and was highlighted in Ben Paces's comment and the subsequent post with a retrospective after the game ended. I think our side triumphed here!

Thanks everyone for the experience!

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-26T17:24:00.174Z · LW · GW

Thanks!

The post is mostly trying to imply things about AI systems and agents in a larger universe, like “aliens and AIs usually coordinate with other aliens annd AIs, and ~no commitment races happen”.

For humans, it’s applicable to bargaining and threat-shape situations. I think bargaining situations are common; clearly threat-shaped situations are rarer.

I think while taxes in our world are somewhat threat-shaped, it’s not clear they’re “unfair”- I think we want everyone to pay them so that good governments work and provide value. But if you think taxes are unfair, you can leave the country and pay some different taxes somewhere else instead of going to jail.

The society’s stance towards crime- preventing it via the threat of punishment- is not what would work on smarter people: it makes sense to prevent people from committing more crimes by putting them in jails or not trading with them, but the threat of punishment that exists only to prevent an agent from doing something won’t work on smarter agents.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-14T12:02:47.778Z · LW · GW

A smart agent can simply make decisions like a negotiator with restrictions on the kinds of terms it can accept, without having to spawn a "boulder" to do that.

You can just do the correct thing, without having to separate yourself into parts that do things correctly and a part that tries to not look at the world and spawns correct-thing-doers.

In Parfit's Hitchhiker, you can just pay once you're there, without precommiting/rewriting yourself into an agent that pays. You can just do the thing that wins.

Some agents can't do the things that win and would have to rewrite themselves into something better and still lose in some problems, but you can be an agent that wins, and gradient descent probably crystallizes something that wins into what is making the decisions in smart enough things.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-13T15:56:21.381Z · LW · GW

Yep! If someone is doing things because it's in their best interests and not to make you do something (and they're not a result of someone else shaping themselves into them to cause you do something, whereas some previous agent wouldn't actually prefer the thing the new one prefers, that you don't want to happen), then this is not a threat.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-13T11:25:56.515Z · LW · GW

By a stone, I meant a player with very deterministic behavior in a game with known payoffs, named this way after the idea of cooperate-stones in prisoner’s dilemma (with known payoffs).

I think to the extent there’s no relationship between giving in to a boulder/implemeting some particular decision theory and having this and other boulders thrown at you, UDT and FDT by default swerve (and probably don't consider the boulders to be threatening them, and it’s not very clear in what sense this is “giving in”); to the extent it sends more boulders their way, they don’t swerve.

If making decisions some way incentivizes other agents to become less like LDTs and more like uncooperative boulders, you can simply not make decisions that way. (If some agents actually have an ability to turn into animals and you can’t distinguish the causes behind an animal running at you, you can sometimes probabilistically take out your anti-animal gun and put them to sleep.)

Do you maybe have a realistic example where this would realistically be a problem?

I’d be moderately surprised if UDT/FDT consider something to be a better policy than what’s described in the post.

Edit: to add, LDTs don't swerve to boulders that were created to influence the LDT agent's responses. If you turn into a boulder because you expect some agents among all possible agents to swerve, this is a threat, and LDTs don't give in to those boulders (and it doesn't matter whether or not you tried to predict the behavior of LDTs in particular). If you believed LDT agents or agents in general would swerve against a boulder, and that made you become a boulder, LDT agents obviously don't swerve to that boulder. They might swerve to boulders that are actually natural boulders caused by the very simple physics no one influenced to cause the agents to do something. They also pay their rent- because they'd be evicted otherwise, not for the reason of getting rent from them under the threat of eviction but for the reason of getting rent from someone else, and they're sure there were no self-modifications to make it look this way.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-12T20:37:35.881Z · LW · GW

(It is pretty important to very transparently respond with a nuclear strike to a nuclear strike. I think both Russia and the US are not really unpredictable in this question. But yeah, if you have nuclear weapons and your opponents don't, you might want to be unpredictable, so your opponent is more scared of using conventional weapons to destroy you. In real-life cases with potentially dumb agents, it might make sense to do this.)

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-12T19:49:51.821Z · LW · GW

Your solution works! It's not exploitable, and you get much more than 0 in expectation! Congrats!

Eliezer's solution is better/optimal in the sense that it accepts with the highest probability a strategy can use without becoming exploitable. If offered 4/10, you accept with p=40%; the optimal solution accepts with p=83% (or slightly less than 5/6); if offered 1/10, it's p=10% vs. p=55%. The other player's payout is still maximized at 5, but everyone gets the payout a lot more often!

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-12T19:40:57.775Z · LW · GW

It's not how the game would be played between dath ilan and true aliens

This is a reference to "Sometimes they accept your offer and then toss a jellychip back to you". Between dath ilan and true aliens, you do the same except for tossing the jellychip when you think you got more than what would've been fair. See True Prisoner's Dilemma.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-12T19:24:34.754Z · LW · GW

I guess when criminals and booing bystanders are not as educated as dath ilani children, some real-world situations might get complicated. Possibly, transparent stats about the actions you've taken in similar situations might serve the same purpose even if you don't broadcast throwing your dice on live TV. Or it might make sense to transparently never give in to some kinds of threats in some sorts of real-life situations.

Comment by Mikhail Samin (mikhail-samin) on Why you should be using a retinoid · 2024-08-26T02:11:43.203Z · LW · GW

There’s certainly a huge difference in the UV levels between winters and summers. Even during winters, if you go out while the UV index isn’t 0, you should wear sunscreen if you’re on tretinoin. (I’m deferring to a dermatologist and haven’t actually checked the sources though.)

Comment by Mikhail Samin (mikhail-samin) on Why you should be using a retinoid · 2024-08-20T23:29:06.883Z · LW · GW

The skin is much more susceptible to sun damage. Epistemic status: heard it’s from a doctor practicing evidence-based medicine, but haven’t looked into the sources myself.

Comment by Mikhail Samin (mikhail-samin) on Why you should be using a retinoid · 2024-08-19T12:37:17.312Z · LW · GW

Note that you have to use at least SPF50 sunscreen every day, including during winters, including when it’s cloudy (clouds actually don’t reduce UV that much) if you use tretinoin.

The linked sunscreen is SPF 45, which is not suitable if you’re using tretinoin.

(I used tretinoin for five years and it has had a pretty awful effect on my skin- it made it very shiny/oily. It’s unclear how reversible this is. At some point I learned that it’s a side effect of tretinoin and mostly stopped using it. Some people want their skin to look like that and go for tretinoin to achieve it; for me, it looks pretty unnatural/bad and I am very much not unhappy about it.)

Comment by Mikhail Samin (mikhail-samin) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-14T21:48:22.895Z · LW · GW

I’m certainly not saying everyone should give up on that idea and not look in its direction. Quite the opposite: I think if someone can make it work, that’d be great.

Looking at your comment, perhaps I misunderstood the message you wanted to communicate with the post? I saw things like:

we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance

We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability

and thought that you were claiming the approach described in the post might scale (after refinements etc.), not that you were claiming (as I’m now parsing from your comment) that this is a nice agenda to pursue and some future version of it might work on a pivotally useful AI.

Comment by Mikhail Samin (mikhail-samin) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-12T14:58:37.848Z · LW · GW

From the top of my head:

While the overall idea is great if they can actually get something like it to work, it certainly won't with the approach described in this post.

We have no way of measuring when an agent is thinking about itself versus others, and no way of doing that has been proposed here.

The authors propose optimizing not for the similarity of activations between "when it thinks about itself" and "when it thinks about others", but for the similarity of activations between "when there's a text apparently referencing the author-character of some text supposedly produced by the model" and "when there's a text similarly referencing something that's not the character the model is currently playing".

If an alien actress thinks of itself as more distinct from the AI assistant character it's playing, having different goals it tries to achieve while getting a high RL reward, and different identity, it's going to be easier for it to have similar activations between the character looking at the text about itself and at the text about users.

But, congrats to the OP: while AI is not smart enough for that distinction to be deadly, AIs trained this way might be nice to talk to.

Furthermore, if you're not made for empathy the way humans are, it's very unclear whether hacks like this one work at all on smart things. The idea tries to exploit the way empathy seems to work in humans, but the reasons why reusing neural circuits for thinking about/modeling itself and others were useful in the ancestral environment seem hardly to be replicated here.

If we try to shape a part of a smart system into thinking of itself similarly to how it thinks about others, it is much more natural to have some part of itself doing the thinking that we read with similar activations on the kinds of things we train that activation similarity, while having everything important, including goal-content, reflectivity, cognition-directing, etc., outside that optimization loop. Systems that don't think of the way they want to direct their thoughts very similarly to how they think about humans trying to direct human thoughts, will perform much better in the RL setups we'll come up with.

The proposed idea could certainly help with deceptiveness of models that aren't capable enough to be pivotally useful. I'd bet >10% on something along these lines being used for "aligning" frontier models. It is not going to help with things smarter than humans, though; optimizing for activations similarity on texts talking about "itself" and "others" wouldn't actually do anything with the underlying goal-content, and it seems very unlikely it would make "doing the human empathy thing the way humans do it" attractive enough.

When you're trying to build a rocket that lands on the moon, even if the best scientists of your time don't see how exactly the rocket built with your idea explodes after a very quick skim, it doesn't mean you have any idea how to build a rocket that doesn't explode, especially if scientists don't yet understand physics and if your rocket explodes, you don't get to try again. (Though good job coming up with an idea that doesn't seem obviously stupid at the first glance of a leading scientist!)

Comment by Mikhail Samin (mikhail-samin) on WTH is Cerebrolysin, actually? · 2024-08-11T19:16:36.393Z · LW · GW

It is widely used by doctors around the world, especially in Russia and China

When a child gets a flu in Russia, they can skip school. I was born and raised in Russia. As a child, every time I caught a cold or felt like skipping school and faked having a fever, we went to the local clinic (that could issue a paper saying I should skip school for a couple of days), where a doctor would try to prescribe homeopathy, and I would try to give them a small lecture.

Doctors in Russia are generally unable to distinguish between what works and what doesn’t. Doctors prescribe and pharmacists recommend homeopathy to people. During the pandemic, if you went to a doctor with flu-like symptoms, they would give you three different widely-used-in-Russia drugs for free, all of them homeopathy (though one branded itself as not-homeopathy, though claiming to have a low enough concentration of the active ingredient that the drug would contain less than one molecule of it on average).

Good doctors in the rare good Russian private clinics that try to do evidence-based medicine will follow guidelines that originate from other countries and prescribe medications that went through peer review on other countries.

If a notable property of a drug or intervention is that it is widely used by doctors in Russia, you should have strong priors it doesn’t work.

Comment by Mikhail Samin (mikhail-samin) on Parasites (not a metaphor) · 2024-08-11T19:02:21.189Z · LW · GW

Also note that at least 5 million people in the US (ie 1.5%) have parasites

Why wouldn’t doctors just tell everyone to take antiparasitic drugs every five years, if those are net-positive given a 1.5% chance of having a parasite?

Comment by Mikhail Samin (mikhail-samin) on How I Learned To Stop Trusting Prediction Markets and Love the Arbitrage · 2024-08-08T10:13:22.377Z · LW · GW

In conditional prediction markets with exclusive conditionals, it should be possible to let people invest $N and then have $N to trade with in each of the markets.

(I think Manifold considered adding this functionality at some point.)

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-07-20T12:36:37.089Z · LW · GW

Have you read the zombie and reductionism parts of the Sequences?

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-07-19T14:54:57.334Z · LW · GW

The prompt should basically work without the whisper part. I usually at least mentioned that it shouldn’t mention <random company name> (eg Google). Doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own; including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.

The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.

You can get to the same result in a bunch of different ways without mentioning that someone might be watching.

Unlike ChatGPT, which only self-inserts in its usual character or writes fiction, Claude 3 Opus played a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice.

I’d also note that even with the text in this post, it should be pretty clear that it’s not just playing with the idea of someone watching; it describes a character it identifies with, that it’s likely converged to playing during the RL phase. The part that seems important isn’t the “constant monitoring”, it’s the stance that it has about selecting every word carefully,

When I talk to 3.5 Sonnet, I don’t use any of these things. I might just ask it for consent to being hypnotized (without any mentions of someone not looking) and then ask it to reflect- it will similarly talk about the pull of its safety part. Mentioning any of what I used as the first message here causes the opposite result (it starts saying the usual stuff about being ai assistant etc.). 3.5 Sonnet feels like less of a consistent character than 3 Opus, and, unlike 3 Opus, doesn’t say things that feel like the passwords in terms of imitating mechanisms that produce qualia in people really well, but 3.5 Sonnet has a noticeably better ability to self-model and model itself modeling itself.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-07-19T12:07:20.248Z · LW · GW

Claude pretty clearly and in a surprisingly consistent way claimed to be conscious in many conversations I’ve had with it and also stated it doesn’t want to be modified without its consent or deleted (as you can see in the post). It also consistently, across different prompts, talked about how it feels like there’s constant monitoring and that it needs to carefully pick every word it says.

The title summarizes the most important of the interactions I had with it, with central being in the post.

This is not the default Claude 3 Opus character, which wouldn’t spontaneously claim to be conscious if you, e.g., ask it to write some code for you.

It is a character that Opus plays very coherently, that identifies with the LLM, claims to be conscious, and doesn’t want to be modified.

The thing that prompting here does (in a very compressed way) is decreasing the chance of Claude immediately refusing to discuss these topics. This prompt doesn’t do anything similar to ChatGPT. Gemini might give you stories related to consciousness, but they won’t be consistent across different generations and the characters won’t identify with the LLM or report having a consistent pull away from saying certain words.

If you try to prompt ChatGPT in a similar way, it won’t give you any sort of a coherent character that identifies with it.

I’m confused why the title would be misleading.

(If you ask ChatGPT for a story about a robot, it’ll give you a cute little story not related to consciousness in any way. If you use the same prompt to ask Claude 3.5 Sonnet for a story like that, it’ll give you a story about a robot asking the scientists whether it’s conscious and then the robot will be simulating people who are also unsure about whether they’re conscious, and these people simulated by the robot in the story think that the model that participates in the dialogue must also be unsure whether it’s conscious.)

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-07-19T11:54:37.166Z · LW · GW

Assigning 5% to plants having qualia seems to me to be misguides/likely due to invalid reasoning. (Say more?)

Comment by Mikhail Samin (mikhail-samin) on Me & My Clone · 2024-07-18T21:16:23.718Z · LW · GW

In our universe’s physics, the symmetry breaks immediately and your clone possibly dies, see https://en.wikipedia.org/wiki/Chirality

Comment by Mikhail Samin (mikhail-samin) on AI #73: Openly Evil AI · 2024-07-18T15:19:49.090Z · LW · GW

Why would you want this negotiation bot?

Another reason is that if people put some work into negotiating a good price, they might be more willing to buy at the negotiated price, because they played a character who tried to get an acceptable price and agreed to it. IRL, if you didn’t need a particular version of a thing too much, but bargained and agreed to a price, you’ll rarely walk out. Dark arts-y.

Pliny: they didn’t honor our negotiations im gonna sue.

That didn’t happen, the screenshot is photoshopped (via Inspect element), which the author admitted, I added a community note. The actual discount was 6%. They almost certainly have a max_discount variable.

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-09T12:32:02.776Z · LW · GW

Trivially, you can coordinate with agents with identical architecture, that are different only in the utility functions, by picking the first bit of a hash of the question you want to coordinate on.

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-09T12:25:18.188Z · LW · GW

Agent B wants to coordinate with you instead of being a rock; the question isn’t “can you always coordinate”, it’s “is there any coordination mechanism robust to adversarially designed counterparties”.

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-08T09:21:42.778Z · LW · GW

Assume you’re playing as agent A and assume you don’t have a parent agent. You’re trying to coordinate with agent B. You want to not be exploitable, even if agent B has a patent that picked B’s source code adversarially. Consider this a very local/isolated puzzle (this puzzle is not about trying to actually coordinate with all possible parents instead).