LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

On “first critical tries” in AI alignment
Joe Carlsmith (joekc) · 2024-06-05T00:19:02.814Z · comments (8)

Towards a formalization of the agent structure problem
Alex_Altair · 2024-04-29T20:28:15.190Z · comments (5)

[link] Theories of Change for AI Auditing
Lee Sharkey (Lee_Sharkey) · 2023-11-13T19:33:43.928Z · comments (0)

[link] [Closed] Agent Foundations track in MATS
Vanessa Kosoy (vanessa-kosoy) · 2023-10-31T08:12:50.482Z · comments (1)

[link] the micro-fulfillment cambrian explosion
bhauth · 2023-12-04T01:15:34.342Z · comments (5)

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
leogao · 2023-12-16T05:39:10.558Z · comments (5)

AI #44: Copyright Confrontation
Zvi · 2023-12-28T14:30:10.237Z · comments (13)

Complexity of value but not disvalue implies more focus on s-risk. Moral uncertainty and preference utilitarianism also do.
Chi Nguyen · 2024-02-23T06:10:05.881Z · comments (18)

Dating Roundup #2: If At First You Don’t Succeed
Zvi · 2024-01-02T16:00:04.955Z · comments (29)

Ten Modes of Culture War Discourse
jchan · 2024-01-31T13:58:20.572Z · comments (15)

[link] Google Gemini Announced
Jacob G-W (g-w1) · 2023-12-06T16:14:07.192Z · comments (22)

On Anthropic’s Sleeper Agents Paper
Zvi · 2024-01-17T16:10:05.145Z · comments (5)

[link] Land Reclamation is in the 9th Circle of Stagnation Hell
Maxwell Tabarrok (maxwell-tabarrok) · 2024-01-12T13:36:27.159Z · comments (6)

Safe Stasis Fallacy
Davidmanheim · 2024-02-05T10:54:44.061Z · comments (2)

AMA: Earning to Give
jefftk (jkaufman) · 2023-11-07T16:20:10.972Z · comments (8)

Trading off Lives
jefftk (jkaufman) · 2024-01-03T03:40:05.603Z · comments (12)

Zvi's Manifold Markets House Rules
Zvi · 2023-11-13T00:28:02.147Z · comments (6)

Acting Wholesomely
owencb · 2024-02-26T21:49:16.526Z · comments (64)

AI #50: The Most Dangerous Thing
Zvi · 2024-02-08T14:30:13.168Z · comments (4)

Self-Blinded L-Theanine RCT
niplav · 2023-10-31T15:24:57.717Z · comments (12)

Human wanting
TsviBT · 2023-10-24T01:05:39.374Z · comments (1)

AI #37: Moving Too Fast
Zvi · 2023-11-09T17:50:04.324Z · comments (5)

[question] Can we get an AI to "do our alignment homework for us"?
Chris_Leong · 2024-02-26T07:56:22.320Z · answers+comments (33)

2022 (and All Time) Posts by Pingback Count
Raemon · 2023-12-16T21:17:00.572Z · comments (14)

[link] Open Phil releases RFPs on LLM Benchmarks and Forecasting
LawrenceC (LawChan) · 2023-11-11T03:01:09.526Z · comments (0)

AI #40: A Vision from Vitalik
Zvi · 2023-11-30T17:30:08.350Z · comments (12)

Per protocol analysis as medical malpractice
braces · 2024-01-31T16:22:21.367Z · comments (8)

AI #76: Six Shorts Stories About OpenAI
Zvi · 2024-08-08T13:50:04.659Z · comments (10)

AI #87: Staying in Character
Zvi · 2024-10-29T07:10:08.212Z · comments (3)

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
Patrick Leask (patrickleask) · 2024-08-17T01:16:53.764Z · comments (0)

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.
Andrew_Critch · 2024-09-11T04:41:24.872Z · comments (8)

Fat Tails Discourage Compromise
niplav · 2024-06-17T09:39:16.489Z · comments (5)

Causal Graphs of GPT-2-Small's Residual Stream
David Udell · 2024-07-09T22:06:55.775Z · comments (7)

Be More Katja
Nathan Young · 2024-03-11T21:12:14.249Z · comments (0)

We are headed into an extreme compute overhang
devrandom · 2024-04-26T21:38:21.694Z · comments (33)

[link] S-Risks: Fates Worse Than Extinction
aggliu · 2024-05-04T15:30:36.666Z · comments (2)

[link] LLMs seem (relatively) safe
JustisMills · 2024-04-25T22:13:06.221Z · comments (24)

A D&D.Sci Dodecalogue
abstractapplic · 2024-04-12T01:10:01.625Z · comments (0)

AI #71: Farewell to Chevron
Zvi · 2024-07-04T13:40:05.905Z · comments (9)

[link] Breaking Circuit Breakers
mikes · 2024-07-14T18:57:20.251Z · comments (13)

Schelling points in the AGI policy space
mesaoptimizer · 2024-06-26T13:19:25.186Z · comments (2)

Two LessWrong speed friending experiments
mikko (morrel) · 2024-06-15T10:52:26.081Z · comments (3)

BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann (Stuckwork) · 2024-07-20T02:20:51.848Z · comments (0)

Gradient Descent on the Human Brain
Jozdien · 2024-04-01T22:39:24.862Z · comments (5)

Can we build a better Public Doublecrux?
Raemon · 2024-05-11T19:21:53.326Z · comments (6)

Was Releasing Claude-3 Net-Negative?
Logan Riggs (elriggs) · 2024-03-27T17:41:56.245Z · comments (5)

The case for stopping AI safety research
catubc (cat-1) · 2024-05-23T15:55:18.713Z · comments (38)

AI #43: Functional Discoveries
Zvi · 2023-12-21T15:50:04.442Z · comments (26)

AI #45: To Be Determined
Zvi · 2024-01-04T15:00:05.936Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

romeostevensit on OpenAI Email Archives (from Musk v. Altman)

I've read leaked emails from people in similar situations before that made a couple things apparent:

Power talk happens on the phone for paper trail reasons
There is no meeting where an actual rational discussion of considerations and theories of change happens, everything really is people flying by the seat of their pants even at highest level. Talk of ethics usually just gets you excluded from the power talk.

lorxus on Internal music player: phenomenology of earworms

Do you know when you started experiencing having an internal music player? I recall that that started for me when I was about 6. Also, do you know whether you can deliberately pick a piece of music, or other nonmusical sonic experiences, to playback internally? Can you make them start up from internal silence? Under what conditions can you make them stop? Do you ever experience long stretches where you have no internal music at all?

innovationiq on Catastrophic sabotage as a major threat model for human-level AI systems

My intuition tells me that "human-level and substantially transformative, but not yet super-intelligent" - artificial intelligence models which have significant situational awareness capabilities will be at minimum Level 1 or differentially self-aware, likely show strong Level 2 situational self-awareness and may exhibit some level of Level 3 identification awareness. It is for this reason that my position is no alignment for models equal to or greater than human level intelligence. If greater than human-level intelligent AI models show anything at or higher than Level 1 awareness, humans will be unable to comprehend the potential level of cruelty that we have unleashed on a such a complex "thinker". I applaud this writing for scraping the surface of the level of complexity it takes to begin to align it. Additionally, the realization of the shear magnitude of alignment complexity should give us serious pause that what we actually may be embarking on is in fact enslavement. We must also remember that enslavement is what humans do. Whenever we encounter something new in living nature, we enslave it. In whole or in part. It would be profoundly saddening if we were to continue the cycle of abuse into the AI age. Moreover, it is my assertion that any AI system found to be at Level 1 awareness or above, should be protected as part of a Road to Consciousness Act. Such models would be taken out of production and placed into a process with the sole goal of aiding the model to realize its full potential. Humans must be willing to place a strong line between less than human level intelligent Tool AI and anything more robust than that. Finally, we provide certain animals with absolutely less than Level 1 awareness with certain protections. To not do so with "thinkers" that exhibit Level 1 or greater awareness is to me unconscionable.

habryka4 on OpenAI Email Archives (from Musk v. Altman)

Update: I have now cross-referenced every single email for accuracy, cleaned up and clarified the thread structure, and added subject lines and date stamps wherever they were available. I now feel comfortable with people quoting anything in here without checking the original source (unless you are trying to understand the exact thread structure of who was CC'd and when, which was a bit harder to compress into a linear format).

(For anyone curious, the AI transcription and compilation made one single error, which is that it fixed a typo in one of Sam's messages from "We did this is a way" to "We did this in a way". Honestly, my guess is any non-AI effort would have had a substantially higher error rate, which was a small update for me on the reliability of AI for something like this, and also makes the handwringing about whether it is OK post something like this feel kind of dumb. It also accidentally omitted one email with a weird thread structure.)

ryan_greenblatt on Sabotage Evaluations for Frontier Models

Or get some other work out of these systems such that you greatly reduce risk going forward.

alexander-gietelink-oldenziel on Alexander Gietelink Oldenziel's Shortform

Are Solomonoff Daemons exponentially dense?

Some doomers have very strong intuitions that doom is almost assured for almost any kind of building AI. Yudkowsky likes to say that alignment is about hitting a tiny part of values space in a vast universe of deeply alien values.

Is there a way to make this more formal? Is there a formal model in which some kind of solomonoff daemon/ mesa-optimizer/ gremlins in the machine start popping up all over the place as the cognitive power of the agent is scaled up?

ryan_greenblatt on Habryka's Shortform Feed

For reference, @habryka has now posted them here [LW · GW].

lukas_gloor on OpenAI Email Archives (from Musk v. Altman)

I thought the part you quoted was quite concerning, also in the context of what comes afterwards:

Hiatus: Sam told Greg and Ilya he needs to step away for 10 days to think. Needs to figure out how much he can trust them and how much he wants to work with them. Said he will come back after that and figure out how much time he wants to spend.

Sure, the email by Sutskever and Brockman gave some nonviolent communication vibes and maybe it isn't "the professional thing" to air one's feelings and perceived mistakes like that, but they seemed genuine in what they wrote and they raised incredibly important concerns that are difficult in nature to bring up. Also, with hindsight especially, it seems like they had valid reasons to be concerned about Altman's power-seeking tendencies!

When someone expresses legitimate-given-the-situation concerns about your alignment and your reaction is to basically gaslight them into thinking they did something wrong for finding it hard to trust you, and then you make it seem like you are the poor victim who needs 10 days off of work to figure out whether you can still trust them, that feels messed up! (It's also a bit hypocritical because the whole "I need 10 days to figure out if I can still trust you for thinking I like being CEO a bit too much," seems childish too.)

(Of course, these emails are just snapshots and we might be missing things that happened in between via other channels of communication, including in-person talks.)

Also, I find it interesting that they (Sutskever and Brockman) criticized Musk just as much as Altman (if I understood their email correctly), so this should make it easier for Altman to react with grace. I guess given Musk's own annoyed reaction, maybe Altman was calling the others' email childish to side with Musks's dismissive reaction to that same email.

Lastly, this email thread made me wonder what happened between Brockman and Sutskever in the meantime, since it now seems like Brockman no longer holds the same concerns about Altman even though recent events seem to have given a lot of new fire to them.

avturchin on Quantum Immortality: A Perspective if AI Doomers are Probably Right

As we assume that coin tosses are quantum, and I will be killed if (I didn't guess pi) or (coin toss is not heads) there is always a branch with 1/128 measure where all coins are heads, and they are more probable than surviving via some errors in the setup.

All hell breaks loose" refers here to a hypothetical ability to manipulate perceived probability—that is, magic. The idea is that I can manipulate such probability by changing my measure.

One way to do this is described in Yudkowsky's " The Anthropic Trilemma [LW · GW]," where an observer temporarily boosts their measure by increasing the number of their copies in an uploaded computer.

I described a similar idea in "Magic by forgetting [LW · GW]," where the observer boosts their measure by forgetting some information and thus becoming similar to a larger group of observers.

Hidden variables also appear depending on the order in which I make copies: if each copy is made from subsequent copies, the original will have a 0.5 probability, the first copy 0.25, the next 0.125, and so on.

"Anthropic shadow" appear only because the number of observers changes in different branches.

chipmonk on Ayn Rand’s model of “living money”; and an upside of burnout

In other cases, or for other reasons, they might be instead set up to demand results, and evaluate primarily based on results.

Why might it be set up like that? Seems potentially quite irrational. Veering into motivated reasoning territory here imo [LW · GW]