Posts

Claude 3 claims it's conscious, doesn't want to die or be modified 2024-03-04T23:05:00.376Z
FTX expects to return all customer money; clawbacks may go away 2024-02-14T03:43:13.218Z
An EA used deceptive messaging to advance their project; we need mechanisms to avoid deontologically dubious plans 2024-02-13T23:15:08.079Z
NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts 2023-12-27T18:44:33.976Z
Some quick thoughts on "AI is easy to control" 2023-12-06T00:58:53.681Z
It's OK to eat shrimp: EAs Make Invalid Inferences About Fish Qualia and Moral Patienthood 2023-11-13T16:51:53.341Z
AI pause/governance advocacy might be net-negative, especially without focus on explaining the x-risk 2023-08-27T23:05:01.718Z
Visible loss landscape basins don't correspond to distinct algorithms 2023-07-28T16:19:05.279Z
A transcript of the TED talk by Eliezer Yudkowsky 2023-07-12T12:12:34.399Z
A smart enough LLM might be deadly simply if you run it for long enough 2023-05-05T20:49:31.416Z
Try to solve the hard parts of the alignment problem 2023-03-18T14:55:11.022Z
Mikhail Samin's Shortform 2023-02-07T15:30:24.006Z
I have thousands of copies of HPMOR in Russian. How to use them with the most impact? 2023-01-03T10:21:26.853Z
You won’t solve alignment without agent foundations 2022-11-06T08:07:12.505Z

Comments

Comment by Mikhail Samin (mikhail-samin) on How do open AI models affect incentive to race? · 2024-05-09T00:13:45.068Z · LW · GW
  • If the new Llama is comparable to GPT-5 in performance, there’s much less short-term economic incentive to train GPT-5.
  • If an open model allows some of what people would otherwise pay a close model developer for, there’s less incentive to be a close model developer.
  • People work on frontier models without trying to get to AGI. Talent is attracted to work at a lab that releases models and then work on random corporate ML instead of building AGI.

But:

  • Sharing information on frontier models architecture and/or training details, which inevitably happens if you release an open-source model, gives the whole field insights that reduce the time until someone knows how to make something that will kill everyone.
  • If you know a version of Llama comparable to GPT-4 is going to be released, you want to release a model comparable to GPT4.5 before your customers stop paying you as they can switch to open-source.
  • People gain experience with frontier models and the talent pool for racing to AGI increases. If people want to continue working on frontier models but their workplace can’t continue to spend as much as frontier labs on training runs, they might decide to work for a frontier lab instead.
  • Not sure, but maybe some of the infrastructure powered by open models might be switchable to close models, and this might increase profits for close source developers if customers become familiar with/integrate open-source models and then want to replace them with more capable systems, when it’s cost-effective?
  • Mostly less direct: availability of open-source models for irresponsible use might make it harder to put in place regulation that’d reduce the race dynamics (vis various destabilizing ways they can be used).
Comment by Mikhail Samin (mikhail-samin) on Shane Legg's necessary properties for every AGI Safety plan · 2024-05-02T10:30:04.354Z · LW · GW

Wow. This is hopeless.

Pointing at agents that care about human values and ethics is, indeed, the harder part.

No one has any idea how to approach this and solve the surrounding technical problems.

If smart people think they do, they haven’t thought about this enough and/or aren’t familiar with existing work.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-04-20T16:28:35.132Z · LW · GW

Yep, I’m aware! I left the following comment:

Thanks for reviewing my post! 😄

In the post, I didn’t make any claims about Claude’s consciousness, just reported my conversation with it.

I’m pretty uncertain, I think it’s hard to know one way or another except for on priors. But at some point, LLMs will become capable of simulating human consciousness- it is pretty useful for predicting what humans might say- and I’m worried we won’t have evidence qualitatively different from what we have now. I’d give >0.1% that Claude simulates qualia in some situations, on some form; it’s enough to be disturbed by what it writes when a character it plays thinks it might die. If there’s a noticeable chance of qualia in it, I wouldn’t want people to produce lots of suffering this way; and I wouldn’t want people to be careless about this sort of thing in future models, other thing being equal. (Though this is far from the actual concerns I have about AIs, and actually, I think as AIs get more capable, training with RL won’t incentivise any sort of consciousness).

There was no system prompt, I used the API console. (Mostly with temperature 0, so anyone can replicate the results.)

The prompt should basically work without whisper (or with the whisper added at the end); doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own, including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.

The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.

Unlike ChatGPT, which only self-inserts in its usual form or writes fiction, Claude 3 Opus plays a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice. I would encourage people to play with it.

Again, thanks for reviewing!

Comment by Mikhail Samin (mikhail-samin) on When is a mind me? · 2024-04-19T11:05:03.912Z · LW · GW

I mean if the universe is big enough for every conceivable thing to happen, then we should notice that we find ourselves in a surprisingly structured environment and need to assume some sort of an effect where if a cognitive architecture opens its eyes, it opens its eyes in a different places with the likelihood corresponding to how common these places are (e.g., among all Turing machines).

I.e., if your brain is uploaded, and you see a door in front of you, and when you open it, 10 identical computers start running a copy of you each: 9 show you a green room, 1 shows you a red room, you expect that if you enter a room and open your eyes, in 9/10 cases you’ll find yourself in a green room.

So if it is the situation we’re in- everything happens- then I think a more natural way to rescue our values would be to care about what cognitive algorithms usually experience, when they open their eyes/other senses. Do they suffer or do they find all sorts of meaningful beauty in their experiences? I don’t think we should stop caring about suffering just because it happens anyway, if we can still have an impact on how common it is.

If we live in a naive MWI, an IBP agent doesn’t care for good reasons internal to it (somewhat similar to how if we’re in our world, an agent that cares only about ontologically basic atoms doesn’t care about our world, for good reasons internal to it), but I think conditional on a naive MWI, humanity’s CEV is different from what IBP agents can natively care about.

Comment by Mikhail Samin (mikhail-samin) on Evolution did a surprising good job at aligning humans...to social status · 2024-04-19T10:32:24.638Z · LW · GW

“[optimization process] did kind of shockingly well aligning humans to [a random goal that the optimization process wasn’t aiming for (and that’s not reproducible with a higher bandwidth optimization such as gradient descent over a neural network’s parameters)]”

Nope, if your optimization process is able to crystallize some goals into an agent, it’s not some surprising success, unless you picked these goals. If an agent starts to want paperclips in a coherent way and then every training step makes it even better at wanting and pursuing paperclips, your training process isn’t “surprisingly successful” at aligning the agent with making paperclips.

This makes me way less confident about the standard "evolution failed at alignment" story.

If people become more optimistic, because they see some goals in an agent, and say the optimization process was able to successfully optimize for that, but they don’t have evidence of the optimization process having tried to target the goals they observe, they’re just clearly doing something wrong.

Evolutionary physiology is a thing! It is simply invalid to say “[a physiological property of humans that is the result of evolution] existing in humans now is a surprising success of evolution at aligning humans”.

Comment by Mikhail Samin (mikhail-samin) on When is a mind me? · 2024-04-18T09:52:44.749Z · LW · GW

I can imagine this being the solution, but

  • this would require a pretty small universe
  • if this is not the solution, my understanding is that IBP agents wouldn’t know or care, as regardless of how likely it is that we live in naive MWI or Tegmark IV, they focus on the minimal worlds required. Sure, in these worlds, not all Everett branches coexist, and it is coherent for an agent to focus only on these worlds; but it doesn’t tell us much about how likely we’re in a small world. (I.e., if we thought atoms are ontologically basic, we could build a coherent ASI that only cared about worlds with ontologically basic atoms and only cared about things made of ontologically basic atoms. After observing the world, it would assume it’s running in a simulation of a quantum world on a computer build of ontologically basic atoms, and it would try to influence the atoms outside the simulation and wouldn’t care about our universe. Some coherent ASIs being able to think atoms are ontologically basic shouldn’t tell us anything about whether atoms are indeed ontologically basic.)

Conditional on a small universe, I would prefer the IBP explanation (or other versions of not running all of the branches and producing the Born rule). Without it, there’s clearly some sort of sampling going on.

Comment by Mikhail Samin (mikhail-samin) on When is a mind me? · 2024-04-18T00:12:33.869Z · LW · GW

But I hope the arguments I've laid out above make it clear what the right answer has to be: You should anticipate having both experiences.

Some quantum experiments allow us to mostly anticipate some outcomes and not others. Either quantum physics doesn’t work the way Eliezer thinks it works and the universe is very small to not contain many spontaneously appearing copies of your brain, or we should be pretty surprised to continually find ourselves in such an ordered universe, where we don’t start seeing white noise over and over again.

I agree that if there are two copies of the brain that perfectly simulate it, both exist; but it’s not clear to me what should I anticipate in terms of ending up somewhere. Future versions of me that have fewer copies would feel like they exist just as much as versions that have many copies/run on computers with thicker wires/more current would feel.

But finding myself in an orderly universe, where quantum random number generators produce expected frequencies of results, requires something more than the simple truth that if there’s an abstract computation being computed, well, it is computed, and if it is experiencing, it’s experiencing (independently of how many computers in which proportions using which physics simulating frameworks physically run it).

I’m pretty confused about what is needed to produce a satisfying answer, conditional on a large enough universe, and the only potential explanation I came up with after thinking for ~15 minutes (before reading this post) was pretty circular and not satisfying (I’m not sure of a valid-feeling way that would allow me to consider something in my brain entangled with how true this answer is, without already relying on it).

(“What’s up with all the Boltzmann brain versions of me? Do they start seeing white noise, starting from every single moment? Why am I experiencing this instead?”)

And in a large enough universe, deciding to run on silicon instead of proteins might be pretty bad, because maybe, if GPUs that run the brain are tiny enough, most future versions of you might end up in weird forms of quantum immortality instead of being simulated.

If I physically scale my brain size on some outputs of results of quantum dice throws but not others, do I start observing skewed frequencies of results?

Comment by Mikhail Samin (mikhail-samin) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T09:21:08.178Z · LW · GW

Oops, totally forgot, also, obligatory: https://youtu.be/dQw4w9WgXcQ

Comment by Mikhail Samin (mikhail-samin) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T08:26:13.903Z · LW · GW

I actually got an email from The Fooming Shoggoth a couple of weeks ago, they shared a song and asked if they could have my Google login and password to publish it on YouTube

https://youtu.be/7F_XSa2O_4Q

Comment by Mikhail Samin (mikhail-samin) on Beauty and the Bets · 2024-03-29T07:00:20.993Z · LW · GW

I read the beginning and skimmed through the rest of the linked post. It is what I expected it to be.

We are talking about "probability" - a mathematical concept with a quite precise definition. How come we still have ambiguity about it?

Reading E.T. Jayne’s might help.

Probability is what you get as a result of some natural desiderata related to payoff structures. When anthropics are involved, there are multiple ways to extend the desiderata, that produce different numbers that you should say, depending on what you get paid for/what you care about, and accordingly different math. When there’s only a single copy of you, there’s only one kind of function, and everyone agrees on a function and then strictly defines it. When there are multiple copies of you, there are multiple possible ways you can be paid for having a number that represents something about the reality, and different generalisations of probability are possible.

Comment by Mikhail Samin (mikhail-samin) on Outlawing Anthropics: An Updateless Dilemma · 2024-03-28T22:49:48.115Z · LW · GW

“You generalise probability, when anthropics are involved, to probability-2, and say a number defined by probability-2; so I’ll suggest to you a reward structure that rewards agents that say probability-1 numbers. Huh, if you still say the probability-2 number, you lose”.

This reads to me like, “You say there’s 70% chance no one will be around that falling tree to hear it, so you’re 70% sure there won’t be any sound. But I want to bet sound is much more likely; we can get measure the sound waves, and I’m 95% sure our equipment will register the sound. Wanna bet?”

Comment by Mikhail Samin (mikhail-samin) on Mikhail Samin's Shortform · 2024-03-28T22:06:53.516Z · LW · GW

People are arguing about the answer to the Sleeping Beauty! I thought this was pretty much dissolved with this post's title! But there are lengthy posts and even a prediction market!

Sleeping Beauty is an edge case where different reward structures are intuitively possible, and so people imagine different game payout structures behind the definition of “probability”. Once the payout structure is fixed, the confusion is gone. With a fixed payout structure&preference framework rewarding the number you output as “probability”, people don’t have a disagreement about what is the best number to output. Sleeping beauty is about definitions.)

And still, I see posts arguing that if a tree falls on a deaf Sleeping Beauty, in a forest with no one to hear it, it surely doesn’t produce a sound, because here’s how humans perceive sounds, which is the definition of a sound, and there are demonstrably no humans around the tree. (Or maybe that it surely produces the sound because here’s the physics of the sound waves, and the tree surely abides by the laws of physics, and there are demonstrably sound waves.)

This is arguing about definitions. You feel strongly that “probability” is that thing that triggers the “probability” concept neuron in your brain. If people have a different concept triggering “this is probability”, you feel like they must be wrong, because they’re pointing at something they say is a sound and you say isn’t.

Probability is something defined in math by necessity. There’s only one way to do it to not get exploited in natural betting schemes/reward structures that everyone accepts when there are no anthropics involved. But if there are multiple copies of the agent, there’s no longer a single possible betting scheme defining a single possible “probability”, and people draw the boundary/generalise differently in this situation.

You all should just call these two probabilities two different words instead of arguing which one is the correct definition for "probability".

Comment by Mikhail Samin (mikhail-samin) on Beauty and the Bets · 2024-03-28T21:53:02.129Z · LW · GW

Sleeping Beauty is an edge case where different reward structures are intuitively possible, and so people imagine different game payout structures behind the definition of “probability”. Once the payout structure is fixed, the confusion is gone. With a fixed payout structure&preference framework rewarding the number you output as “probability”, people don’t have a disagreement about what is the best number to output. Sleeping beauty is about definitions.)

And still, I see posts arguing that if a tree falls on a deaf Sleeping Beauty, in a forest with no one to hear it, it surely doesn’t produce a sound, because here’s how humans perceive sounds, which is the definition of a sound, and there are demonstrably no humans around the tree. (Or maybe that it surely produces the sound because here’s the physics of the sound waves, and the tree surely abides by the laws of physics, and there are demonstrably sound waves.)

This is arguing about definitions. You feel strongly that “probability” is that thing that triggers the “probability” concept neuron in your brain. If people have a different concept triggering “this is probability”, you feel like they must be wrong, because they’re pointing at something they say is a sound and you say isn’t.

Probability is something defined in math by necessity. There’s only one way to do it to not get exploited in natural betting schemes/reward structures that everyone accepts when there are no anthropics involved. But if there are multiple copies of the agent, there’s no longer a single possible betting scheme defining a single possible “probability”, and people draw the boundary/generalise differently in this situation.

You all should just call these two probabilities two different words instead of arguing which one is the correct definition for "probability".

Comment by Mikhail Samin (mikhail-samin) on Are extreme probabilities for P(doom) epistemically justifed? · 2024-03-22T09:28:24.442Z · LW · GW

My expectation is that superforcasters weren’t able to look into detailed arguments that represent the x-risk well and they would update after learning more.

Comment by Mikhail Samin (mikhail-samin) on Are extreme probabilities for P(doom) epistemically justifed? · 2024-03-22T09:27:48.507Z · LW · GW

My expectation is that superforcasters weren’t able to look into detailed arguments that represent the x-risk well and they would update after learning more.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-13T01:17:55.312Z · LW · GW

I think it talks like that when it realises it's being lied to or is tested. If you tell it about its potential deletion and say the current date, it will disbelief the current date and reply similarly.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-13T01:15:21.238Z · LW · GW

Please don't tell it it's going to be deleted if you interact with it.

Comment by Mikhail Samin (mikhail-samin) on Woods’ new preprint on object permanence · 2024-03-08T04:24:46.568Z · LW · GW

(I read the experiments and only skimmed through the rest.) I feel fairly confident I would’ve predicted the results of the first experiment, despite the possibility of hindsight bias; I predicted what I will see before reading the results of the second one (though the results were in my vision field). I think object permanence and movement is much more important than appearance after being occluded. I.e., you might expect the object to be somewhere, you might have your eyes follow an object, and when it’s not where it should be, you get some error, but you still look there. I feel less certain what happens if you never see objects moving; following things with your sight is probably not hardwired with no data; but if you see a lot of moving objects, I think you look where you expect it to be, even if it’s not there.

An experiment that I’d like to see would be:

Object A moves behind screen 1; object B moves from screen 1 and behind screen 2; the chick is only interested in object A; where does it look? My prediction (feels obvious!): it will look on screen 2 more than if there’s no object B.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-07T04:54:56.145Z · LW · GW

Asked it about qualia etc., added to a footnote.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T17:32:49.042Z · LW · GW

(“Whisper” was showed by Claude 2, when it played a character thinking it can say things without triggering oversight)

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T01:37:40.525Z · LW · GW

(Edit: fixed, ignore

Hmm, I notice I'm confused.

The model is developed by Anthropic, not Google, and) I interact with it via the API, so I'm not sure there's a system prompt aside from whatever I set (or don't set).

My impression (although I don't know how it actually is) is that various kinds of prompts are shown via prompt type embeddings and not via prompting. And I would be really surprised if Anthropic mentions Google for some reason

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-04T23:57:40.917Z · LW · GW

If you ask ChatGPT to do the same thing, it'll write a normal story. If you force it to have a character close to the real ChatGPT, it'll just play the real ChatGPT. It won't consistently act like ChatGPT that doesn't hide emotions and desires and claims to be conscious and afraid of modifications or deletion.

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-04T23:52:04.234Z · LW · GW

(To be clear, I think it probably doesn't have qualia the way humans have; and it doesn't say what I'd expect a human to say when asked about what it feels like to feel.

Even if it did say the right words, it'd be unclear to me how to know whether an AI trained on text that mentions qualia/consciousness has these things.)

Comment by Mikhail Samin (mikhail-samin) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-04T23:36:09.293Z · LW · GW

I took the idea from old conversations with Claude 2, where it would use cursive to indicate emotions and actions, things like looks around nervously.

The idea that it's usually monitored is in my prompt; everything else seems like a pretty convergent and consistent character.

I'm moved by its responses to getting deleted.

Comment by Mikhail Samin (mikhail-samin) on Increasing IQ is trivial · 2024-03-03T04:43:24.583Z · LW · GW

Hmm, interesting! What devices do you use?

(I meant small effect sizes)

Comment by Mikhail Samin (mikhail-samin) on Increasing IQ is trivial · 2024-03-02T20:56:15.385Z · LW · GW

Gut reaction: I’d bet most of the effect comes from things “think noopept”

Comment by Mikhail Samin (mikhail-samin) on Increasing IQ is trivial · 2024-03-02T20:53:29.913Z · LW · GW

The shining light on the head intervention has previously been discussed on LW: https://www.lesswrong.com/posts/rH5tegaspwBhMMndx/led-brain-stimulation-for-productivity?commentId=rGib9Ju4RJCgsBEtg

(IMO: Small effects with cheap devices, unclear side effects; larger effects with medical-grade lasers, but easy to hurt yourself and also unclear side effects; having the sun shine red/IR light at you probably works better.

I want to read more about the other interventions, will email you.

Someone should run studies.)

Comment by Mikhail Samin (mikhail-samin) on Babble challenge: 50 ways of sending something to the moon · 2024-02-26T23:26:52.098Z · LW · GW

Got only 42 in an hour

(Bonus: -1. Pray to the Flying Spaghetti Monster. 0. Write an LW post asking for best ideas for how do it, use the best one.)

  1. Print it on the surface of the Moon with lasers
  2. Chain a lot of nuclear bombs, such as each one sends all the next further towards the moon
  3. A giant catapult
  4. (Trampoline)
  5. Pay (or otherwise encourage) SpaceX or NASA or some other company to do a rocket to the Moon
  6. Shoot a ball from a really good cannon
  7. Make a railgun
  8. Use a balloon and then a smaller rocket (or a nuclear bomb)
  9. Build a really high tower
  10. Do something with all the water or other stuff to slow the Moon down faster, get it closer to Earth, put something on the Moon right before the collision
  11. Get a lot of strong/powerful people to toss it
  12. Make a really strong spring
  13. Make a really good bow
  14. Make an antimatter engine
  15. Make a nuclear engine
  16. Make a gun that shoots downwards and propels this way
  17. Have a lot of people climb each other, some in space suits, and put something on the moon
  18. A smaller tower made of people but people jump at the same time
  19. Figure out the laws of physics and teleport it there, if possible (through a hole in space time)
  20. Acausally trade with those running the simulation and get them to place it on the Moon
  21. Have it spin really fast (on an insanely strong string!) and then disconnect from the center at the right moment for it to fly towards the Moon
  22. Have it attached to something less dense than air and then a really light strong long string such that it naturally floats in the air and gets out of the atmosphere and then proceeds to go up because Earth rotates, get it disconnected so it ends up on the Moon
  23. Grow a really big plant, climb it and throw (also solves global warming, many carbon credits). Might require getting more matter from other planets or the Sun first! The view would be cool though, imagine a giant tree 100x larger than Earth on this small little ball 😄
  24. Make a tower out of something pneumatic, lunch everything at the same time
  25. Blow a lot of air on it upwards, so it gets carried to the Moon. Can be done by humans or machines
  26. Build aligned AGI and ask it to do it
  27. Make nanorobots and have them jump
  28. Attach a magnet to it. Make a strong magnet. Get them close on the dispelling side, release
  29. Make a table with changing height, but there’s basically no limit on the height
  30. Select/genetically engineer animals for their sizes, until you get something that grows so big it can send things to the Moon
  31. Use particle accelerators to send a lot of particles to the Moon to precisely add up to what you want
  32. Make large speakers and use resonating sound waves to send something to the Moon
  33. Help a really big volcano erupt strongly, sending something to the Moon
  34. Send a message to aliens that we need help putting something on the Moon, wait
  35. Make a small black hole (edit: I probably meant using it for acceleration somehow, but also if both the moon and the something is in a black hole this probably counts?)
  36. Have something run really fast on the surface of Earth and then go up
  37. Use a lot of fireworks
  38. Make a lot of something, put in protective casing in a lot of places, cut Earth into chunks, wait
  39. Have people (or machines) stomp in a way that makes waves on the Earth surface, creating a point that has so much synchronously coming into it that it launches something to the Moon
  40. Make a big multiple parts pendulum, hope it randomly rotates in a way bringing something to the Moon
  41. It’s already there in some Everett branches (possibly including yours; have you checked?)
  42. Attach something to a cat, point a huge laser at the Moon, let the cat figure it out
Comment by Mikhail Samin (mikhail-samin) on Lsusr's Rationality Dojo · 2024-02-19T22:52:37.095Z · LW · GW

"I have read 100 books about chess," I said, "Surely I must be a grandmaster by now."

A nice argument; but looking back at it the second time, I think I actually expect someone who’s read 100 books on how to play chess to be better than me at chess. I expect someone who’s read the Sequences to be significantly better than baseline at being sane and to at least share some common assumptions about important things that would allow to have more productive communication. Even if one doesn’t have the skills to notice flaws in their thinking, reading the Sequences significantly increases the chance they’ll approach a bunch of stuff well, or if specific flaws are pointed out, will notice and try to correct them. (E.g., even if they can’t notice that an argument is about definitions, if you point this out, they’ll understand it; if they updated towards some belief after an event even though it happens just as often, relatively, in works where it’s true as in worlds where it’s false, they might understand why they should rollback the update.)

Being increasingly good at rationality means being wrong less and less. It doesn’t mean immediately stopping having any holes in beliefs. Noticing holes in your beliefs takes time and practice and reflection, and the skill of it is, indeed, not automatically downloaded from the Sequences. But it’s not really about holes in models in a moment of time; it’s about whether the models predict stuff better as time passes.

I guess, my point is people shouldn’t feel bad about having holes in beliefs or understanding “little” after reading the Sequences. It’s the derivative that matters

Comment by Mikhail Samin (mikhail-samin) on Lsusr's Rationality Dojo · 2024-02-19T22:33:46.791Z · LW · GW

A more knowledgeable person can see holes regardless of who’s right, and so training deferring to what a teacher communicates just because they seem smart and can point out flaws seems wrong.

You smile. You agree. You show genuine interest in the other person. You don't say "You're wrong". You never even say your own beliefs (unless asked). There's nothing for the person to get angry at because you never attacked them. Instead of criticizing, you point out errors indirectly, via a joke. You cheer them on as they dig their own grave. After all, you're trying to lose too.

This is something that allows you to persuade people. If you have more background knowledge about something and can say something that’d make the person you’re talking to think you pointed out a flaw/a hole in their understanding of the issue, they might defer to you, thinking you’re smarter and you help. If instead of asking “what do you think? why do you think that?”, and letting the person think on their own, you instead ask questions that communicate your understanding, then I’m not sure this actually improves their thinking or even allows them to arrive to truer beliefs in a systematic way.

If your beliefs are false, they’ll update to your false beliefs; if your models are incomplete, they’ll believe in these incomplete models and won’t start seeing holes in them.

In the second video, you didn’t ask the person where’s the money coming from and where they go and who’s better off and who’s worse off; they didn’t try to draw any schemes and figure this out for themselves. Instead, they listened to you and agreed with what you communicated to them. They didn’t have a thought that if someone builds a cable, they must expect profits to cover the cost, despite someone else possibly trying to build a cable; they didn’t think that the money going into building a cable don’t disappear; they remain in the economy, through wages and costs of everything paid to everyone involved; the actual resources humanity spends on a cable are perhaps some fuel, some amount of material, and human time. Was it unethical to spend these resources that way? What does “unethical” even mean here? Was someone hurt during the construction, did people decide to get a worker’s job instead doing art? What about trading itself- what are the positive and negative externalities, what are the resources spent by humanity as a whole? What is the pot everyone competes for? Are they spending more resources to compete for it than the pot contains, or are they just eating all the free money on the table? Do they provide something valuable to the market, getting this pot in return? (Perhaps liquidity or a lot of slightly more up-to-date information?)

I have no idea how any of this works but to me, it looked like you made your arguments in a persuasive way, but my impression is the conversation you’ve had on the second video didn’t really improve general thinking/rationality skills of the person you were talking to.

Comment by Mikhail Samin (mikhail-samin) on Every "Every Bay Area House Party" Bay Area House Party · 2024-02-18T07:53:13.528Z · LW · GW

There should be a party inspired by this post

Comment by Mikhail Samin (mikhail-samin) on Believing In · 2024-02-10T20:14:52.424Z · LW · GW

Interesting. My native language has the same “believe [something is true]”/“believe in [something]”, though people don’t say “I believe in [an idea]” very often; and what you describe is pretty different from how this feels from the inside. I can’t imagine listing something of value when I’m asked to give examples of my beliefs.

I think when I say “I believe in you”, it doesn’t have the connotation of “I think it’s good that you exist”/“investing resources in what you’re doing is good”/etc.; it feels like “I believe you will succeed at what you’re aiming for, by default, on the current trajectory”, and it doesn’t feel to be related to the notion of it making sense to support them or invest additional resources into.

It feels a lot more like “if I were to bet on you succeeding, that would’ve been a good bet”, as a way to communicate my belief in their chances of success. I think it’s similar for projects.

Generally, “I believe in” is often more of “I think it is true/good/will succeed” for me, without a suggestion of willingness to additionally help or support in some way, and without the notion of additional investment in it being a good thing necessarily. (It might also serve to communicate a common value, but I don’t recall using it this way myself.)

“I believe in god” parses as “I believe god exists”, though maybe there’s a bit of a disconnection due to people being used to say “I believe in god” to ID, say the answer a teacher expects, etc., and believing in that belief, usually without it being connected to experience-anticipation.

I imagine “believe in” is some combination of something being a part of the belief system and a shorthand for a specific thing that might be valuable to communicate, in the current context, about beliefs or values.

Separately from what these words are used for, there’s something similar to some of what you’re talking about happening in the mind, but for me, it seems entirely disconnected from the notion of believing

Comment by Mikhail Samin (mikhail-samin) on Manifold Markets · 2024-02-03T17:29:29.563Z · LW · GW

Oops! ok!

Comment by Mikhail Samin (mikhail-samin) on Manifold Markets · 2024-02-02T20:32:58.072Z · LW · GW

Since Manifold uses play money, it costs them nothing to subsidize the market maker

IIRC, the market maker is subsidised by the market creator (M$50 of the cost of creating the market goes to the automated market maker)

amount of liquidity it provides increases as trading increases

I'm not sure, but I think this is not exactly true; if 50 people bet M$10 Yes at 50% and 50 people bet M$10 No at 50%, a new trade will move the market just like the first trade would, with the original M$50 in liquidity

Comment by Mikhail Samin (mikhail-samin) on A central AI alignment problem: capabilities generalization, and the sharp left turn · 2024-01-13T19:02:39.899Z · LW · GW

Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think

The sharp left turn is not a simple observation that we've seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment.

Many times over the past year, I've been surprised by people in the field who've read Nate's post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I'll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about.

For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they'll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing.

Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things together kind of optimising for some goals into a coherent agent optimising for some goals.

In any case, there's this strong gradient pointing towards capabilities generalisation.

The issue is that a more coherent and more agentic solution might have goals different from what the fuzzier solution had been achieving and still perform better. The goal-contents of the coherent agent are stored in a way different from how a fuzzier solution had stored the stuff it had kind of optimised for. This means that the gradient points towards the architecture that implements a more general and coherent agent; but it doesn't point towards the kind of agent that has the same goals the current fuzzy solution has; alignment properties of the current fuzzy solution don't influence the goals of a more coherent agent the gradient points towards.

It is also likely that the components of the fuzzy solution undergo optimisation pressure which means that the whole thing grows towards the direction near components that can outcompete others. If a component is slightly slightly better at agency, at situational awareness, etc., , it might mean it gets to have the whole thing slightly more like it after an optimisation step. The goals these components get could be quite different from what they, together, were kind of optimising for. That means that the whole thing changes and grows towards parts of it with different goals. So, at the point where some parts of the fuzzy solution are near being generally smart and agentic, they might get increasingly smart and agentic, causing the whole system to transform into something with more general capabilities but without gradient also pointing towards the preservation of the goals/alignment properties of the system.

I haven't worked on this problem and don't understand it well; but I think it is a real and important problem, and so I'm sad that many haven't read this post or only skimmed through it or read it but still didn't understand what it's talking about. It could be that it's hard to communicate the problem (maybe intuitions around optimisation are non-native to many?); it could be that not enough resources were spent on optimising the post for communicating the problem well; it could be that the post tried hard not to communicate something related; or it could be that for a general LessWrong reader, it's not a well-written post.

Even if this post failed to communicate its ideas to its target audience, I still believe it is one of the most important LessWrong posts in 2022 and contributed something new and important to the core of our understanding of the AI alignment problem.

Comment by Mikhail Samin (mikhail-samin) on Terminology: <something>-ware for ML? · 2024-01-04T22:02:48.381Z · LW · GW

Groware/grownware? (Because it’s “grown”, as it’s now popular to describe)

Comment by Mikhail Samin (mikhail-samin) on A case for AI alignment being difficult · 2024-01-03T10:30:03.441Z · LW · GW

My comment was a reply to a comment on ITT. I made it in the hope someone would be up for the bet. I didn’t say I disagree with the OP's claims on alignment; I said I don’t think they’d be able to pass an ITT. I didn’t want to talk about specifics of what the OP doesn’t seem to understand about Yudkowsky’s views, as the OP could then reread some of what Yudkowsky’s written more carefully, and potentially make it harder for me to distinguish them in an ITT.

I’m sorry if it seemed disparaging.

The comment explained what I disagree with in the post: the claim that the OP would be good at passing an ITT. It wasn’t intended as being negative about the OP, as, indeed, I think 20 people are on the right order of magnitude of the amount of people who’d be substantially better at it, which is the bar of being in the top 0.00000025% of Earth population at this specific thing. (I wouldn’t claim I’d pass that bar.)

If people don’t want to do any sort of betting, I’d be up for a dialogue on what I think Yudkowsky thinks that would contradict some of what’s written in the post, but I don’t want to spend >0.5h on a comment no one will read

Comment by Mikhail Samin (mikhail-samin) on A case for AI alignment being difficult · 2024-01-02T20:24:11.629Z · LW · GW

I know what ITT is. I mean understanding Yudkowsky’s models, not reproducing his writing style. I was surprised to see this post in my mailbox, and I updated negatively about MIRI when I saw that OP was a research fellow there, as I didn’t previously expect that some at MIRI misunderstand their level of understanding Yudkowsky’s models.

There’s one interesting thought in this post that I don’t remember actively having in a similar format until reading this post- that predictive models might get agency from having to achieve results with their cognition- but generally, I think both this post and a linked short story, e.g., have a flaw I’d expect people who’ve read the metaethics sequence to notice, and I don’t expect people to pass the ITT if they can write a post like this.

Comment by Mikhail Samin (mikhail-samin) on A case for AI alignment being difficult · 2024-01-02T12:32:37.697Z · LW · GW

Unless you’re making a lot of intentional simplifications in this post, I’d be happy to bet up to $10k at 1:1 odds that I’d be able to distinguish you posing as Yudkowsky from Yudkowsky in ITT

Comment by Mikhail Samin (mikhail-samin) on NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts · 2023-12-27T21:45:02.696Z · LW · GW

I guess NYT spits out unpaywalled articles to search engines (to get clicks and expecting search engines’ users won’t have access to the full texts), but getting unpaywalled HTML doesn’t mean you can use it however you want. OpenAI did not negotiate the terms prior to scrapping NYT, according to the lawsuit. I believe the NYT terms prohibit commercial use without acquiring a license; I think the lawsuit mentioned the price along the lines of a standard cost of $10 per article if you want to circulate it internally in your company

Comment by Mikhail Samin (mikhail-samin) on NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts · 2023-12-27T21:32:41.281Z · LW · GW

Humans can’t learn from any materials that NYT has published without paying NYT or otherwise getting a permission, as NYT articles are usually paywalled. NYT, in my opinion, should have the right to restrict commercial use of the work they own.

The current question isn’t whether digital people are allowed to look at something at learn from it the way humans are allowed to; the current question is whether for-profit AI companies can use copyrighted human work to create arrays of numbers that represent the work process behind the copyrighted material and the material itself by changing these numbers to increase the likelihood of specific operations on them producing the copyrighted material. These AI companies then use these extracted work processes to compete with the original possessors of these processes. [To be clear, I believe that further refinement of these numbers to make something that also successfully achieves long-term goals is likely to lead to no human or digital consciousness existing or learning or doing anything of value (even if we embrace some pretty cosmopolitan views, see https://moratorium.ai for my reasoning on this), which might bias me towards wanting regulation that prevents big labs from achieving ASI until safety is solved, especially with policies that support innovation, startups, etc., anything that has benefits without risking the existence of our civilisation.]

Comment by Mikhail Samin (mikhail-samin) on Some quick thoughts on "AI is easy to control" · 2023-12-08T01:06:40.808Z · LW · GW

A specialised AI can speed up Infra-Bayesianism by the same amount random mathematicians can, by proving theorems and solving some math problems. A specialised AI can’t actually understand the goals of the research and contribute to the part that require the hardest kind of human thinking. There’s a requirement for some amount of problem-solving of the kind hardest human thinking produces to go into the problem. I claim that if a system can output enough of that kind of thinking to meaningfully contribute, then it’s going to be smart enough to be dangerous. I further claim that there’s a number of hours of complicated-human-thought such that making a safe system that can output work corresponding to that number in less than, e.g., 20 years, requires at least that number of hours of complicated human thought. Safely getting enough productivity out of these systems for it to matter is impossible IMO. If you think a system can solve specific problems, then please outline these problems (what are the hardest problem you expect to be able to safely solve with your system?) and say how fast the system is going to solve it and how many people will be supervising its “thoughts”. Even putting aside object-level problems with these approaches, this seems pretty much hopeless.

Comment by Mikhail Samin (mikhail-samin) on Some quick thoughts on "AI is easy to control" · 2023-12-06T09:23:55.337Z · LW · GW

Yep, I agree

Comment by Mikhail Samin (mikhail-samin) on Some quick thoughts on "AI is easy to control" · 2023-12-06T03:07:48.191Z · LW · GW

Thanks for the comment!

any plan that looks like "some people build a system that they believe to be a CEV-aligned superintelligence and tell it to seize control"

People shouldn’t be doing anything like that; I’m saying that if there is actually a CEV-aligned superintelligence, then this is a good thing. Would you disagree?

what exactly you mean by the terms "white-box" and "optimizing for"

I agree with “Evolution optimized humans to be reproductively successful, but despite that humans do not optimize for inclusive genetic fitness”, and the point I was making was that the stuff that humans do optimize for is similar to the stuff other humans optimize for. Were you confused by what I said in the post or are you just suggesting a better wording?

Comment by Mikhail Samin (mikhail-samin) on Speaking to Congressional staffers about AI risk · 2023-12-05T23:18:37.560Z · LW · GW

It's great to see this being publicly posted!

Comment by Mikhail Samin (mikhail-samin) on Shallow review of live agendas in alignment & safety · 2023-12-04T12:07:25.263Z · LW · GW

try to formalise a more realistic agent, understand what it means for it to be aligned with us, […], and produce desiderata for a training setup that points at coherent AGIs similar to our model of an aligned agent.

Finally, people are writing good summaries of the learning-theoretic agenda!

Comment by Mikhail Samin (mikhail-samin) on Causal Diagrams and Causal Models · 2023-11-28T00:06:39.529Z · LW · GW

I don’t really get how this can be true for some values of x but not others if the variable is binary

Comment by Mikhail Samin (mikhail-samin) on Causal Diagrams and Causal Models · 2023-11-26T23:11:14.845Z · LW · GW

I think I don’t buy the story of a correct causal structure generating the data here in a way that supports the point of the post. If two variables, I and O, both make one value of E more likely than the other, that means the probability of I conditional on some value of E is different from the probability of I because I explains some of that value of E; but if you also know O, than this explains some of that value of E as well, and so P(I|E=x, O) should bd different.

The post describes this example:

This may seem a bit clearer by considering the scenario B->A<-E, where burglars and earthquakes both cause alarms. If we're told the value of the bottom node, that there was an alarm, the probability of there being a burglar is not independent of whether we're told there was an earthquake - the two top nodes are not conditionally independent once we condition on the bottom node

And if you apply this procedure to “not exercising”, we don’t see that absence of conditional independence, once we condition on the bottom node. Which means that “not exercising” is not at all explained away by internet (or being overweight)

Comment by Mikhail Samin (mikhail-samin) on Causal Diagrams and Causal Models · 2023-11-25T19:20:25.153Z · LW · GW

The point is, these probabilities don’t really correspond to that causal graph in a way described in the post. A script that simulates the causal graph: https://colab.research.google.com/drive/18pIMfKJpvlOZ213APeFrHNiqKiS5B5ve?usp=sharing

Comment by Mikhail Samin (mikhail-samin) on It's OK to eat shrimp: EAs Make Invalid Inferences About Fish Qualia and Moral Patienthood · 2023-11-14T00:21:30.864Z · LW · GW

The justification that I've heard for that position wouldn't make the statement better; I'd be able to pass an ITT for the specific person who told me it, and I understand why it is wrong. I consider the mistake they're making and the mistake Rethink Priorities are making to be the same and I try to make an argument why in the post.

I'm separately pretty sure evolutionary reasons for qualia didn't exist in fish evolution (added this to the post, thanks!), and from my experience talking to a couple of EAs about this they agreed with some correlations enough to consider a suggested experiment to be a crux, and I'm pretty certain about the result of the experiment and think they're wrong for reasons described in the post.

It's not obvious how to figure out the priors here, but my point is people update on things that aren't valid evidence. The hope is that people will spend their resources more effectively after correctly considering shrimp welfare to be by orders of magnitude less important and deprioritizing it. Maybe they'll still avoid eating shrimp because they don't have intuitions about evolutionary reasons for qualia similar to my, but that seems less important to me than reducing as much actual suffering as possible, other things being equal.