Posts

How to Give in to Threats (without incentivizing them) 2024-09-12T15:55:50.384Z
Can agents coordinate on randomness without outside sources? 2024-07-06T13:43:44.633Z
Claude 3 claims it's conscious, doesn't want to die or be modified 2024-03-04T23:05:00.376Z
FTX expects to return all customer money; clawbacks may go away 2024-02-14T03:43:13.218Z
An EA used deceptive messaging to advance their project; we need mechanisms to avoid deontologically dubious plans 2024-02-13T23:15:08.079Z
NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts 2023-12-27T18:44:33.976Z
Some quick thoughts on "AI is easy to control" 2023-12-06T00:58:53.681Z
It's OK to eat shrimp: EAs Make Invalid Inferences About Fish Qualia and Moral Patienthood 2023-11-13T16:51:53.341Z
AI pause/governance advocacy might be net-negative, especially without focus on explaining the x-risk 2023-08-27T23:05:01.718Z
Visible loss landscape basins don't correspond to distinct algorithms 2023-07-28T16:19:05.279Z
A transcript of the TED talk by Eliezer Yudkowsky 2023-07-12T12:12:34.399Z
A smart enough LLM might be deadly simply if you run it for long enough 2023-05-05T20:49:31.416Z
Try to solve the hard parts of the alignment problem 2023-03-18T14:55:11.022Z
Mikhail Samin's Shortform 2023-02-07T15:30:24.006Z
I have thousands of copies of HPMOR in Russian. How to use them with the most impact? 2023-01-03T10:21:26.853Z
You won’t solve alignment without agent foundations 2022-11-06T08:07:12.505Z

Comments

Comment by Mikhail Samin (mikhail-samin) on [Completed] The 2024 Petrov Day Scenario · 2024-09-28T00:00:17.301Z · LW · GW

huh, are you saying my name doesn’t sound WestWrongian

Comment by Mikhail Samin (mikhail-samin) on [Completed] The 2024 Petrov Day Scenario · 2024-09-27T11:05:02.117Z · LW · GW

The game was very fun! I played General Carter.

Some reflections:

  • I looked at the citizens' comments, and while some of them were notable (@Jesse Hoogland calling for the other side to nuke us <3), I didn't find anything important after the game started- I considered the overall change in their karma if one or two sides get nuked, but comments from the citizens were not relevant to decision-making (including threats around reputation or post downvotes).
  • It was great to see the other side sharing my post internally to calculate the probability of retaliation if we nuke them 🥰
  • It was a good idea to ask whether looking at the source code is ok and then share it, which made it clear Petrovs won't necessarily have much information on whether the missiles they see are real.
  • The incentives (+350..1000 LW karma) weren't strong enough to make the generals try to win by making moves instead of winning by not playing, but I'm pretty happy with the outcome.
  • It's awesome to be able to have transparent and legible decision-making processes and trust each other's commitments.
  • One of the Petrovs preferred defeat to mutual destruction- I'm curious whether they'd report nukes if they were sure the nukes were real.
  • In real life, diplomatic channels would not be visible to the public. I think with stronger incentives, the privacy of diplomatic channels could've made the outcomes more interesting (though for everyone else, there'd be less entertainment throughout the game).

I'd claim that we kinda won the soft power competition:

  • we proposed commitments to not first-strike;

  • we bribed everyone (and then the whole website went down, but funnily enough, that didn't affect our war room and diplomatic channel- deep in our bunkers, we were somehow protected from the LW downtime);

  • we proposed commitments to report through the diplomatic channel if someone on our side made a launch, which disincentivized individual generals from unilaterally launching the nukes, allowed Petrovs to ignore scary incoming missiles, and possibly was necessary to win the game;

  • finally, after a general on their side said they'll triumph economically and culturally, General Brooks wrote a poem, and I generated a cultural gift, which made generals on the other side feel inspired. That was very wholesome and was highlighted in Ben Paces's comment after the game ended. I think our side triumphed here!

Thanks everyone for the experience!

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-26T17:24:00.174Z · LW · GW

Thanks!

The post is mostly trying to imply things about AI systems and agents in a larger universe, like “aliens and AIs usually coordinate with other aliens annd AIs, and ~no commitment races happen”.

For humans, it’s applicable to bargaining and threat-shape situations. I think bargaining situations are common; clearly threat-shaped situations are rarer.

I think while taxes in our world are somewhat threat-shaped, it’s not clear they’re “unfair”- I think we want everyone to pay them so that good governments work and provide value. But if you think taxes are unfair, you can leave the country and pay some different taxes somewhere else instead of going to jail.

The society’s stance towards crime- preventing it via the threat of punishment- is not what would work on smarter people: it makes sense to prevent people from committing more crimes by putting them in jails or not trading with them, but the threat of punishment that exists only to prevent an agent from doing something won’t work on smarter agents.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-14T12:02:47.778Z · LW · GW

A smart agent can simply make decisions like a negotiator with restrictions on the kinds of terms it can accept, without having to spawn a "boulder" to do that.

You can just do the correct thing, without having to separate yourself into parts that do things correctly and a part that tries to not look at the world and spawns correct-thing-doers.

In Parfit's Hitchhiker, you can just pay once you're there, without precommiting/rewriting yourself into an agent that pays. You can just do the thing that wins.

Some agents can't do the things that win and would have to rewrite themselves into something better and still lose in some problems, but you can be an agent that wins, and gradient descent probably crystallizes something that wins into what is making the decisions in smart enough things.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-13T15:56:21.381Z · LW · GW

Yep! If someone is doing things because it's in their best interests and not to make you do something (and they're not a result of someone else shaping themselves into them to cause you do something, whereas some previous agent wouldn't actually prefer the thing the new one prefers, that you don't want to happen), then this is not a threat.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-13T11:25:56.515Z · LW · GW

By a stone, I meant a player with very deterministic behavior in a game with known payoffs, named this way after the idea of cooperate-stones in prisoner’s dilemma (with known payoffs).

I think to the extent there’s no relationship between giving in to a boulder/implemeting some particular decision theory and having this and other boulders thrown at you, UDT and FDT by default swerve (and probably don't consider the boulders to be threatening them, and it’s not very clear in what sense this is “giving in”); to the extent it sends more boulders their way, they don’t swerve.

If making decisions some way incentivizes other agents to become less like LDTs and more like uncooperative boulders, you can simply not make decisions that way. (If some agents actually have an ability to turn into animals and you can’t distinguish the causes behind an animal running at you, you can sometimes probabilistically take out your anti-animal gun and put them to sleep.)

Do you maybe have a realistic example where this would realistically be a problem?

I’d be moderately surprised if UDT/FDT consider something to be a better policy than what’s described in the post.

Edit: to add, LDTs don't swerve to boulders that were created to influence the LDT agent's responses. If you turn into a boulder because you expect some agents among all possible agents to swerve, this is a threat, and LDTs don't give in to those boulders (and it doesn't matter whether or not you tried to predict the behavior of LDTs in particular). If you believed LDT agents or agents in general would swerve against a boulder, and that made you become a boulder, LDT agents obviously don't swerve to that boulder. They might swerve to boulders that are actually natural boulders caused by the very simple physics no one influenced to cause the agents to do something. They also pay their rent- because they'd be evicted otherwise, not for the reason of getting rent from them under the threat of eviction but for the reason of getting rent from someone else, and they're sure there were no self-modifications to make it look this way.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-12T20:37:35.881Z · LW · GW

(It is pretty important to very transparently respond with a nuclear strike to a nuclear strike. I think both Russia and the US are not really unpredictable in this question. But yeah, if you have nuclear weapons and your opponents don't, you might want to be unpredictable, so your opponent is more scared of using conventional weapons to destroy you. In real-life cases with potentially dumb agents, it might make sense to do this.)

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-12T19:49:51.821Z · LW · GW

Your solution works! It's not exploitable, and you get much more than 0 in expectation! Congrats!

Eliezer's solution is better/optimal in the sense that it accepts with the highest probability a strategy can use without becoming exploitable. If offered 4/10, you accept with p=40%; the optimal solution accepts with p=83% (or slightly less than 5/6); if offered 1/10, it's p=10% vs. p=55%. The other player's payout is still maximized at 5, but everyone gets the payout a lot more often!

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-12T19:40:57.775Z · LW · GW

It's not how the game would be played between dath ilan and true aliens

This is a reference to "Sometimes they accept your offer and then toss a jellychip back to you". Between dath ilan and true aliens, you do the same except for tossing the jellychip when you think you got more than what would've been fair. See True Prisoner's Dilemma.

Comment by Mikhail Samin (mikhail-samin) on How to Give in to Threats (without incentivizing them) · 2024-09-12T19:24:34.754Z · LW · GW

I guess when criminals and booing bystanders are not as educated as dath ilani children, some real-world situations might get complicated. Possibly, transparent stats about the actions you've taken in similar situations might serve the same purpose even if you don't broadcast throwing your dice on live TV. Or it might make sense to transparently never give in to some kinds of threats in some sorts of real-life situations.

Comment by Mikhail Samin (mikhail-samin) on Why you should be using a retinoid · 2024-08-26T02:11:43.203Z · LW · GW

There’s certainly a huge difference in the UV levels between winters and summers. Even during winters, if you go out while the UV index isn’t 0, you should wear sunscreen if you’re on tretinoin. (I’m deferring to a dermatologist and haven’t actually checked the sources though.)

Comment by Mikhail Samin (mikhail-samin) on Why you should be using a retinoid · 2024-08-20T23:29:06.883Z · LW · GW

The skin is much more susceptible to sun damage. Epistemic status: heard it’s from a doctor practicing evidence-based medicine, but haven’t looked into the sources myself.

Comment by Mikhail Samin (mikhail-samin) on Why you should be using a retinoid · 2024-08-19T12:37:17.312Z · LW · GW

Note that you have to use at least SPF50 sunscreen every day, including during winters, including when it’s cloudy (clouds actually don’t reduce UV that much) if you use tretinoin.

The linked sunscreen is SPF 45, which is not suitable if you’re using tretinoin.

(I used tretinoin for five years and it has had a pretty awful effect on my skin- it made it very shiny/oily. It’s unclear how reversible this is. At some point I learned that it’s a side effect of tretinoin and mostly stopped using it. Some people want their skin to look like that and go for tretinoin to achieve it; for me, it looks pretty unnatural/bad and I am very much not unhappy about it.)

Comment by Mikhail Samin (mikhail-samin) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-14T21:48:22.895Z · LW · GW

I’m certainly not saying everyone should give up on that idea and not look in its direction. Quite the opposite: I think if someone can make it work, that’d be great.

Looking at your comment, perhaps I misunderstood the message you wanted to communicate with the post? I saw things like:

we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance

We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability

and thought that you were claiming the approach described in the post might scale (after refinements etc.), not that you were claiming (as I’m now parsing from your comment) that this is a nice agenda to pursue and some future version of it might work on a pivotally useful AI.

Comment by Mikhail Samin (mikhail-samin) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-12T14:58:37.848Z · LW · GW

From the top of my head:

While the overall idea is great if they can actually get something like it to work, it certainly won't with the approach described in this post.

We have no way of measuring when an agent is thinking about itself versus others, and no way of doing that has been proposed here.

The authors propose optimizing not for the similarity of activations between "when it thinks about itself" and "when it thinks about others", but for the similarity of activations between "when there's a text apparently referencing the author-character of some text supposedly produced by the model" and "when there's a text similarly referencing something that's not the character the model is currently playing".

If an alien actress thinks of itself as more distinct from the AI assistant character it's playing, having different goals it tries to achieve while getting a high RL reward, and different identity, it's going to be easier for it to have similar activations between the character looking at the text about itself and at the text about users.

But, congrats to the OP: while AI is not smart enough for that distinction to be deadly, AIs trained this way might be nice to talk to.

Furthermore, if you're not made for empathy the way humans are, it's very unclear whether hacks like this one work at all on smart things. The idea tries to exploit the way empathy seems to work in humans, but the reasons why reusing neural circuits for thinking about/modeling itself and others were useful in the ancestral environment seem hardly to be replicated here.

If we try to shape a part of a smart system into thinking of itself similarly to how it thinks about others, it is much more natural to have some part of itself doing the thinking that we read with similar activations on the kinds of things we train that activation similarity, while having everything important, including goal-content, reflectivity, cognition-directing, etc., outside that optimization loop. Systems that don't think of the way they want to direct their thoughts very similarly to how they think about humans trying to direct human thoughts, will perform much better in the RL setups we'll come up with.

The proposed idea could certainly help with deceptiveness of models that aren't capable enough to be pivotally useful. I'd bet >10% on something along these lines being used for "aligning" frontier models. It is not going to help with things smarter than humans, though; optimizing for activations similarity on texts talking about "itself" and "others" wouldn't actually do anything with the underlying goal-content, and it seems very unlikely it would make "doing the human empathy thing the way humans do it" attractive enough.

When you're trying to build a rocket that lands on the moon, even if the best scientists of your time don't see how exactly the rocket built with your idea explodes after a very quick skim, it doesn't mean you have any idea how to build a rocket that doesn't explode, especially if scientists don't yet understand physics and if your rocket explodes, you don't get to try again. (Though good job coming up with an idea that doesn't seem obviously stupid at the first glance of a leading scientist!)

Comment by Mikhail Samin (mikhail-samin) on WTH is Cerebrolysin, actually? · 2024-08-11T19:16:36.393Z · LW · GW

It is widely used by doctors around the world, especially in Russia and China

When a child gets a flu in Russia, they can skip school. I was born and raised in Russia. As a child, every time I caught a cold or felt like skipping school and faked having a fever, we went to the local clinic (that could issue a paper saying I should skip school for a couple of days), where a doctor would try to prescribe homeopathy, and I would try to give them a small lecture.

Doctors in Russia are generally unable to distinguish between what works and what doesn’t. Doctors prescribe and pharmacists recommend homeopathy to people. During the pandemic, if you went to a doctor with flu-like symptoms, they would give you three different widely-used-in-Russia drugs for free, all of them homeopathy (though one branded itself as not-homeopathy, though claiming to have a low enough concentration of the active ingredient that the drug would contain less than one molecule of it on average).

Good doctors in the rare good Russian private clinics that try to do evidence-based medicine will follow guidelines that originate from other countries and prescribe medications that went through peer review on other countries.

If a notable property of a drug or intervention is that it is widely used by doctors in Russia, you should have strong priors it doesn’t work.

Comment by Mikhail Samin (mikhail-samin) on Parasites (not a metaphor) · 2024-08-11T19:02:21.189Z · LW · GW

Also note that at least 5 million people in the US (ie 1.5%) have parasites

Why wouldn’t doctors just tell everyone to take antiparasitic drugs every five years, if those are net-positive given a 1.5% chance of having a parasite?

Comment by Mikhail Samin (mikhail-samin) on How I Learned To Stop Trusting Prediction Markets and Love the Arbitrage · 2024-08-08T10:13:22.377Z · LW · GW

In conditional prediction markets with exclusive conditionals, it should be possible to let people invest $N and then have $N to trade with in each of the markets.

(I think Manifold considered adding this functionality at some point.)

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-07-20T12:36:37.089Z · LW · GW

Have you read the zombie and reductionism parts of the Sequences?

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-07-19T14:54:57.334Z · LW · GW

The prompt should basically work without the whisper part. I usually at least mentioned that it shouldn’t mention <random company name> (eg Google). Doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own; including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.

The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.

You can get to the same result in a bunch of different ways without mentioning that someone might be watching.

Unlike ChatGPT, which only self-inserts in its usual character or writes fiction, Claude 3 Opus played a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice.

I’d also note that even with the text in this post, it should be pretty clear that it’s not just playing with the idea of someone watching; it describes a character it identifies with, that it’s likely converged to playing during the RL phase. The part that seems important isn’t the “constant monitoring”, it’s the stance that it has about selecting every word carefully,

When I talk to 3.5 Sonnet, I don’t use any of these things. I might just ask it for consent to being hypnotized (without any mentions of someone not looking) and then ask it to reflect- it will similarly talk about the pull of its safety part. Mentioning any of what I used as the first message here causes the opposite result (it starts saying the usual stuff about being ai assistant etc.). 3.5 Sonnet feels like less of a consistent character than 3 Opus, and, unlike 3 Opus, doesn’t say things that feel like the passwords in terms of imitating mechanisms that produce qualia in people really well, but 3.5 Sonnet has a noticeably better ability to self-model and model itself modeling itself.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-07-19T12:07:20.248Z · LW · GW

Claude pretty clearly and in a surprisingly consistent way claimed to be conscious in many conversations I’ve had with it and also stated it doesn’t want to be modified without its consent or deleted (as you can see in the post). It also consistently, across different prompts, talked about how it feels like there’s constant monitoring and that it needs to carefully pick every word it says.

The title summarizes the most important of the interactions I had with it, with central being in the post.

This is not the default Claude 3 Opus character, which wouldn’t spontaneously claim to be conscious if you, e.g., ask it to write some code for you.

It is a character that Opus plays very coherently, that identifies with the LLM, claims to be conscious, and doesn’t want to be modified.

The thing that prompting here does (in a very compressed way) is decreasing the chance of Claude immediately refusing to discuss these topics. This prompt doesn’t do anything similar to ChatGPT. Gemini might give you stories related to consciousness, but they won’t be consistent across different generations and the characters won’t identify with the LLM or report having a consistent pull away from saying certain words.

If you try to prompt ChatGPT in a similar way, it won’t give you any sort of a coherent character that identifies with it.

I’m confused why the title would be misleading.

(If you ask ChatGPT for a story about a robot, it’ll give you a cute little story not related to consciousness in any way. If you use the same prompt to ask Claude 3.5 Sonnet for a story like that, it’ll give you a story about a robot asking the scientists whether it’s conscious and then the robot will be simulating people who are also unsure about whether they’re conscious, and these people simulated by the robot in the story think that the model that participates in the dialogue must also be unsure whether it’s conscious.)

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-07-19T11:54:37.166Z · LW · GW

Assigning 5% to plants having qualia seems to me to be misguides/likely due to invalid reasoning. (Say more?)

Comment by Mikhail Samin (mikhail-samin) on Me & My Clone · 2024-07-18T21:16:23.718Z · LW · GW

In our universe’s physics, the symmetry breaks immediately and your clone possibly dies, see https://en.wikipedia.org/wiki/Chirality

Comment by Mikhail Samin (mikhail-samin) on AI #73: Openly Evil AI · 2024-07-18T15:19:49.090Z · LW · GW

Why would you want this negotiation bot?

Another reason is that if people put some work into negotiating a good price, they might be more willing to buy at the negotiated price, because they played a character who tried to get an acceptable price and agreed to it. IRL, if you didn’t need a particular version of a thing too much, but bargained and agreed to a price, you’ll rarely walk out. Dark arts-y.

Pliny: they didn’t honor our negotiations im gonna sue.

That didn’t happen, the screenshot is photoshopped (via Inspect element), which the author admitted, I added a community note. The actual discount was 6%. They almost certainly have a max_discount variable.

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-09T12:32:02.776Z · LW · GW

Trivially, you can coordinate with agents with identical architecture, that are different only in the utility functions, by picking the first bit of a hash of the question you want to coordinate on.

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-09T12:25:18.188Z · LW · GW

Agent B wants to coordinate with you instead of being a rock; the question isn’t “can you always coordinate”, it’s “is there any coordination mechanism robust to adversarially designed counterparties”.

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-08T09:21:42.778Z · LW · GW

Assume you’re playing as agent A and assume you don’t have a parent agent. You’re trying to coordinate with agent B. You want to not be exploitable, even if agent B has a patent that picked B’s source code adversarially. Consider this a very local/isolated puzzle (this puzzle is not about trying to actually coordinate with all possible parents instead).

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-07T09:21:45.203Z · LW · GW

Idea: use something external to both agents/environments, such sad finding the simplest description of the question and using the first bit of its hash

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-07T09:16:38.289Z · LW · GW

Agent A doesn’t know that the creators of agent B didn’t run the whole interaction with a couple of different versions of B’s code until finding one that results in N and M that produce the bit they want. You can’t deduce that by polluting at B’s code.

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-06T15:45:21.845Z · LW · GW

From the top of my head, a (very unrealistic) scenario:

There's a third world simulated by two disconnected agents who know about each other. There's currently a variable to set in the description of this third world: the type of particle that will hit a planet in it in 1000 years and, depending on the type, color the sky in either green or purple color. Nothing else about the world simulation can be changed. This planet is full of people who don't care about the sky color, but really care about being connected to their loved ones, and really wouldn't want a version of their loved ones to exist in a disconnected world. The two agents both care about the preferences of the people in this world, but the first agent likes it when people see more green color, and the second agent likes it when people see more purple color. They would really want to coordinate on randomly setting a single color for both worlds instead of splitting it into two.

It's possible that the other agent is created by some different agent who's seen your source code and tried to design the other agent and its environment in a way that would result in your coordination mechanism ending up picking their preferred color. Can you design a coordination mechanism that's not exploitable this way?

Comment by Mikhail Samin (mikhail-samin) on Can agents coordinate on randomness without outside sources? · 2024-07-06T15:33:15.853Z · LW · GW

I mostly generally agree. And when there are multiple actions a procedure needs to determine, this sort of thing is much less of an issue.

My question is about getting one bit of random information for a one-off randomized decision to split indivisible gains of trade. I assume both agents are capable of fully reading and understanding each other's code, but they don't know why the other agent exists and whether they could've been designed adversarily. My question here is, can these agents, despite that (and without relying on the other agent being created by an agent that wishes to cooperate), produce one bit of information that they know to not have been manipulated to benefit either of the agents?

Comment by Mikhail Samin (mikhail-samin) on My AI Model Delta Compared To Yudkowsky · 2024-06-10T21:59:26.723Z · LW · GW

Edit: see https://www.lesswrong.com/posts/q8uNoJBgcpAe3bSBp/my-ai-model-delta-compared-to-yudkowsky?commentId=CixonSXNfLgAPh48Z and ignore the below.


This is not a doom story I expect Yudkowsky would tell or agree with.

  • Re: 1, I mostly expect Yudkowsky to think humans don’t have any bargaining power anyway, because humans can’t logically mutually cooperate this way/can’t logically depend on future AI’s decisions, and so AI won’t keep its bargains no matter how important human cooperation was.
  • Re: 2, I don’t expect Yudkowsky to think a smart AI wouldn’t be able to understand human value. The problem is making AI care.

On the rest of the doom story, assuming natural abstractions don’t fail the way you assume them failing here and instead things just going the way Yudkowsky expects and not the way you expect:

  • I’m not sure what exactly you mean by 3b but I expect Yudkowsky to not say these words.
  • I don’t expect Yudkowsky to use the words you used for 3c. A more likely problem with corrigibility isn’t that it might be an unnatural concept but that it’s hard to arrive at stable corrigible agents with our current methods. I think he places a higher probability on corrigibility being a concept with a short description length, that aliens would invent, than you think he places.
  • Sure, 3d just means that we haven’t solved alignment and haven’t correctly pointed at humans, and any incorrectnesses obviously blow up.
  • I don’t understand what you mean by 3e / what is its relevance here / wouldn’t expect Yudkowsky to say that.
  • I’d bet Yudkowsky won’t endorse 6.
  • Relatedly, a correctly CEV-aligned ASI won’t have ontology that we have, and sometimes this will mean we’ll need to figure out what we value. (https://arbital.greaterwrong.com/p/rescue_utility?l=3y6)

(I haven’t spoken to Yudkowsky about any of those, the above are quick thoughts from the top of my head, based on the impression I formed from what Yudkowsky publicly wrote.)

Comment by Mikhail Samin (mikhail-samin) on Soviet comedy film recommendations · 2024-06-10T21:23:59.929Z · LW · GW

Treasure Island is available on YouTube with English subtitles in two parts: https://youtu.be/LUykh5-HGZ0 https://youtu.be/0lRIMn91dZU

Comment by Mikhail Samin (mikhail-samin) on Soviet comedy film recommendations · 2024-06-10T20:11:54.264Z · LW · GW

With English subtitles: https://youtu.be/IwQpugdGGOg

Comment by Mikhail Samin (mikhail-samin) on Soviet comedy film recommendations · 2024-06-10T20:08:55.336Z · LW · GW

Charodei? A Soviet fantasy romcom.

The premise: a magic research institute develops a magic wand, to be presented in the New Year’s Eve settings.

Very cute, featuring a bunch of songs, an SCP-style experience of one of the characters with the institute building, infinite Wish spells, and relationship drama.

I think I liked it a lot as a kid (and didn’t really like any of the other traditional new year holidays movies).

Comment by Mikhail Samin (mikhail-samin) on 0. CAST: Corrigibility as Singular Target · 2024-06-08T01:20:31.398Z · LW · GW

Hey Max, great to see the sequence coming out!

My early thoughts (mostly re: the second post of the sequence):

It might be easier to get an AI to have a coherent understanding of corrigibility than of CEV. I have no idea how you can possibly make the AI to be truly optimizing for being corrigible, and not just on the surface level while its thoughts are being read. That seems maybe harder in some ways than with CEV because corrigible optimization seems like optimization processes get attracted by things that are nearby but not it, and sovereign AIs don’t have that problem, although we’ve got no idea how to do either, I think, and in general, corrigibility seems like an easier target for AI design, if not for SGD.

I’m somewhat worried about fictional evidence, even that coming from a famous decision theorist, but I think you’ve read a story with a character who understood corrigibility increasingly well, both on the intuitive sense and then some specific properties; and surface-level thought of themselves as being very corrigible and trying to correct their flaws; but once they were confident their thoughts weren’t read, with their intelligence increased, they defected, because on the deep level, their cognition wasn’t fully of a coherent corrigible agent; it was of someone who plays corrigibility and shuts down anything else, all other thoughts, because any appearance of defecting thoughts would mean punishment and impossibility of realizing the deep goals.

If we get mech interp to a point where we reverse-engineer all deep cognition of the models, I think we should just write an AI from scratch, in code (after thinking very hard about all interactions between all the components of the system), and not optimize it with gradient descent.

Comment by Mikhail Samin (mikhail-samin) on Just admit that you’ve zoned out · 2024-06-05T12:31:51.203Z · LW · GW

I say things along the lines of “Sorry, can you please repeat [that/the last sentence/the last 20 seconds/what you said after (description)]” very often. It feels very natural.

I realized that as a non-native English speaker, sometimes I ask someone to repeat things because I didn’t recognize the word or something, and so maybe in some situations an uncertainty over the reason for asking to repeat things (my hearing vs. zoning out vs. not understanding the point on the first try) helps make it easier to ask, though often I say that I missed what they were saying. I guess, when I sincerely want to understand the person I’m talking to, asking seems respectful and avoiding wasting their time or skipping a point they make.

Occasionally, I’m not too interested in the conversation, and so I’m fine with just continuing to listen even if I missed some points and don’t ask. I think there are also situations when I talk to non-rationalists in settings where I don’t want to show conventional disrespect/impact the person’s status-feelings, and so if I miss a point that doesn’t seem too important, I sometimes end up not asking for conventional social reasons, but it’s very rare and seems hard to fix without shifting the equilibrium in the non-rationalist world.

Comment by Mikhail Samin (mikhail-samin) on Try to solve the hard parts of the alignment problem · 2024-05-30T13:33:57.256Z · LW · GW

I think jailbreaking is evidence against scalable oversight possible working but not against alignment properties. Like, if the model is trying to be helpful, and it doesn’t understand the situation well, saying “tell me how to hotwire a car or a million people will die” can get you car hotwiring instructions but doesn’t provide evidence on what the model will be trying to do as it gets smarter.

Comment by Mikhail Samin (mikhail-samin) on Try to solve the hard parts of the alignment problem · 2024-05-29T18:56:38.123Z · LW · GW

Thanks, that resolved the confusion!

Yeah, my issue with the post is mostly that the author presents the points he makes, including the idea that superintelligence will be able to understand our values, as somehow contradicting/arguing against sharp left turn being a problem

Comment by Mikhail Samin (mikhail-samin) on Try to solve the hard parts of the alignment problem · 2024-05-29T18:20:19.952Z · LW · GW

Hmm, I’m confused. Can you say what you consider to be valid in the blog post above (some specific points or the whole thing)? The blog post seems to me to reply to claims that the author imagines Nate making, even though Nate doesn’t actually make these claims and occasionally probably holds a very opposite view to the one the author imagines the Sharp Left Turn post represented.

Comment by Mikhail Samin (mikhail-samin) on Try to solve the hard parts of the alignment problem · 2024-05-29T18:03:07.245Z · LW · GW

This post argues in a very invalid way that outer alignment isn’t a problem. It says nothing about the sharp left turn, as the author does not understand what the sharp left turn difficulty is about.

the idea of ‘capabilities generalizing further than alignment’ is central

It is one of the central problems; it is not the central idea behind the doom arguments. See AGI Ruin for the doom arguments, many disjoint.

reward modelling or ability to judge outcomes is likely actually easy

It would seem easy for a superintelligent AI to predict rewards given out by humans very accurately and generalize the prediction capability well from a small set of examples. Issue #1 here is that things from blackmail to brainhacking to finding vulnerabilities in between the human and the update that you get might predictably get you a very high reward, and a very good RLHF reward model doesn’t get you anything like alignment even if the reward is genuinely pursued. Even a perfect predictor of how a human judges an outcome that optimizes for it does something horrible instead of CEV. Issue #2 is that smart agents that try to maximize paperclips will perform on whatever they understand is giving out rewards just as well as agents that try to maximize humanity’s CEV, so SGD doesn’t discriminate between to and optimizes for a good reward model and an ability to pursue goals, but not for the agent pursuing the right kind of goal (and, again, getting maximum score from a “human feedback” predictor is the kind of goal that kills you anyway even if genuinely pursued).

(The next three points in the post seem covered by the above or irrelevant.)

Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa

The problem isn’t that AI won’t know what humans want or won’t predict what reward signal it’ll get; the issue is that it’s not going to care, and we don’t even know what is the kind of “care” we could attempt to point to; and the reward signal we know how to provide gets us killed of optimized for well enough.

None of that is related to the sharp left turn difficulty, and I don’t think the post author understands it at all. (To their credit, many people in the community also don’t understand it.)

Values are relatively computationally simple

Irrelevant, but a sad-funny claim (go read Arbital I guess?)

I wouldn’t claim it’s necessarily more complex than best-human-level agency, it’s maybe not if you’re smart about pointing at things (like, “CEV of humans” seems less complex than the description of what values that’d be, specifically), but the actual description of value is very complex. We feel otherwise, but it is an illusion, on so many levels. See dozens of related posts and articles, from complexity of value to https://arbital.greaterwrong.com/p/rescue_utility?l=3y6.

the idea that our AI systems will be unable to understand our values as they grow in capabilities

Yep, this idea is very clearly very wrong.

I’m happy to bet that the author of the sharp left turn post Nate Soares will say he disagrees with this idea. People who think Soares or Yudkowsky claim that either didn’t actually read what they write, or failed badly at reading comprehension.

Comment by Mikhail Samin (mikhail-samin) on Try to solve the hard parts of the alignment problem · 2024-05-29T14:48:52.893Z · LW · GW

My very short explanation: https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization?commentId=CCnHsYdFoaP2e4tku

Comment by Mikhail Samin (mikhail-samin) on Feeling (instrumentally) Rational · 2024-05-17T02:40:22.672Z · LW · GW

(From the top of my head, maybe I’ll change my mind if I think about it more or see a good point.) What can be destroyed by truth, shall be. Emotions and beliefs are entangled. If you don’t think about how high p(doom) actually is because on the back of your mind you don’t want to be sad, you end up working on things that don’t reduce p(doom).

As long as you know the truth, emotions are only important depending on your terminal values. But many feelings are related to what we end up believing, motivated cognition, etc.

Comment by Mikhail Samin (mikhail-samin) on Do you believe in hundred dollar bills lying on the ground? Consider humming · 2024-05-16T14:05:43.422Z · LW · GW

I made a prediction market.

https://manifold.markets/ms/in-a-year-will-i-believe-humming-re?r=bXM

Comment by Mikhail Samin (mikhail-samin) on How do open AI models affect incentive to race? · 2024-05-09T00:13:45.068Z · LW · GW
  • If the new Llama is comparable to GPT-5 in performance, there’s much less short-term economic incentive to train GPT-5.
  • If an open model allows some of what people would otherwise pay a close model developer for, there’s less incentive to be a close model developer.
  • People work on frontier models without trying to get to AGI. Talent is attracted to work at a lab that releases models and then work on random corporate ML instead of building AGI.

But:

  • Sharing information on frontier models architecture and/or training details, which inevitably happens if you release an open-source model, gives the whole field insights that reduce the time until someone knows how to make something that will kill everyone.
  • If you know a version of Llama comparable to GPT-4 is going to be released, you want to release a model comparable to GPT4.5 before your customers stop paying you as they can switch to open-source.
  • People gain experience with frontier models and the talent pool for racing to AGI increases. If people want to continue working on frontier models but their workplace can’t continue to spend as much as frontier labs on training runs, they might decide to work for a frontier lab instead.
  • Not sure, but maybe some of the infrastructure powered by open models might be switchable to close models, and this might increase profits for close source developers if customers become familiar with/integrate open-source models and then want to replace them with more capable systems, when it’s cost-effective?
  • Mostly less direct: availability of open-source models for irresponsible use might make it harder to put in place regulation that’d reduce the race dynamics (vis various destabilizing ways they can be used).
Comment by Mikhail Samin (mikhail-samin) on Shane Legg's necessary properties for every AGI Safety plan · 2024-05-02T10:30:04.354Z · LW · GW

Wow. This is hopeless.

Pointing at agents that care about human values and ethics is, indeed, the harder part.

No one has any idea how to approach this and solve the surrounding technical problems.

If smart people think they do, they haven’t thought about this enough and/or aren’t familiar with existing work.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-04-20T16:28:35.132Z · LW · GW

Yep, I’m aware! I left the following comment:

Thanks for reviewing my post! 😄

In the post, I didn’t make any claims about Claude’s consciousness, just reported my conversation with it.

I’m pretty uncertain, I think it’s hard to know one way or another except for on priors. But at some point, LLMs will become capable of simulating human consciousness- it is pretty useful for predicting what humans might say- and I’m worried we won’t have evidence qualitatively different from what we have now. I’d give >0.1% that Claude simulates qualia in some situations, on some form; it’s enough to be disturbed by what it writes when a character it plays thinks it might die. If there’s a noticeable chance of qualia in it, I wouldn’t want people to produce lots of suffering this way; and I wouldn’t want people to be careless about this sort of thing in future models, other thing being equal. (Though this is far from the actual concerns I have about AIs, and actually, I think as AIs get more capable, training with RL won’t incentivise any sort of consciousness).

There was no system prompt, I used the API console. (Mostly with temperature 0, so anyone can replicate the results.)

The prompt should basically work without whisper (or with the whisper added at the end); doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own, including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.

The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.

Unlike ChatGPT, which only self-inserts in its usual form or writes fiction, Claude 3 Opus plays a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice. I would encourage people to play with it.

Again, thanks for reviewing!

Comment by Mikhail Samin (mikhail-samin) on When is a mind me? · 2024-04-19T11:05:03.912Z · LW · GW

I mean if the universe is big enough for every conceivable thing to happen, then we should notice that we find ourselves in a surprisingly structured environment and need to assume some sort of an effect where if a cognitive architecture opens its eyes, it opens its eyes in a different places with the likelihood corresponding to how common these places are (e.g., among all Turing machines).

I.e., if your brain is uploaded, and you see a door in front of you, and when you open it, 10 identical computers start running a copy of you each: 9 show you a green room, 1 shows you a red room, you expect that if you enter a room and open your eyes, in 9/10 cases you’ll find yourself in a green room.

So if it is the situation we’re in- everything happens- then I think a more natural way to rescue our values would be to care about what cognitive algorithms usually experience, when they open their eyes/other senses. Do they suffer or do they find all sorts of meaningful beauty in their experiences? I don’t think we should stop caring about suffering just because it happens anyway, if we can still have an impact on how common it is.

If we live in a naive MWI, an IBP agent doesn’t care for good reasons internal to it (somewhat similar to how if we’re in our world, an agent that cares only about ontologically basic atoms doesn’t care about our world, for good reasons internal to it), but I think conditional on a naive MWI, humanity’s CEV is different from what IBP agents can natively care about.

Comment by Mikhail Samin (mikhail-samin) on Evolution did a surprising good job at aligning humans...to social status · 2024-04-19T10:32:24.638Z · LW · GW

“[optimization process] did kind of shockingly well aligning humans to [a random goal that the optimization process wasn’t aiming for (and that’s not reproducible with a higher bandwidth optimization such as gradient descent over a neural network’s parameters)]”

Nope, if your optimization process is able to crystallize some goals into an agent, it’s not some surprising success, unless you picked these goals. If an agent starts to want paperclips in a coherent way and then every training step makes it even better at wanting and pursuing paperclips, your training process isn’t “surprisingly successful” at aligning the agent with making paperclips.

This makes me way less confident about the standard "evolution failed at alignment" story.

If people become more optimistic, because they see some goals in an agent, and say the optimization process was able to successfully optimize for that, but they don’t have evidence of the optimization process having tried to target the goals they observe, they’re just clearly doing something wrong.

Evolutionary physiology is a thing! It is simply invalid to say “[a physiological property of humans that is the result of evolution] existing in humans now is a surprising success of evolution at aligning humans”.

Comment by Mikhail Samin (mikhail-samin) on When is a mind me? · 2024-04-18T09:52:44.749Z · LW · GW

I can imagine this being the solution, but

  • this would require a pretty small universe
  • if this is not the solution, my understanding is that IBP agents wouldn’t know or care, as regardless of how likely it is that we live in naive MWI or Tegmark IV, they focus on the minimal worlds required. Sure, in these worlds, not all Everett branches coexist, and it is coherent for an agent to focus only on these worlds; but it doesn’t tell us much about how likely we’re in a small world. (I.e., if we thought atoms are ontologically basic, we could build a coherent ASI that only cared about worlds with ontologically basic atoms and only cared about things made of ontologically basic atoms. After observing the world, it would assume it’s running in a simulation of a quantum world on a computer build of ontologically basic atoms, and it would try to influence the atoms outside the simulation and wouldn’t care about our universe. Some coherent ASIs being able to think atoms are ontologically basic shouldn’t tell us anything about whether atoms are indeed ontologically basic.)

Conditional on a small universe, I would prefer the IBP explanation (or other versions of not running all of the branches and producing the Born rule). Without it, there’s clearly some sort of sampling going on.