LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Don't Share Information Exfohazardous on Others' AI-Risk Models
Thane Ruthenis · 2023-12-19T20:09:06.244Z · comments (11)

EIS XIV: Is mechanistic interpretability about to be practically useful?
scasper · 2024-10-11T22:13:51.033Z · comments (4)

Ophiology (or, how the Mamba architecture works)
Danielle Ensign (phylliida-dev) · 2024-04-09T19:31:09.975Z · comments (8)

Indecision and internalized authority figures
Kaj_Sotala · 2024-07-06T10:10:02.528Z · comments (1)

Why Large Bureaucratic Organizations?
johnswentworth · 2024-08-27T18:30:07.422Z · comments (52)

Personal AI Planning
jefftk (jkaufman) · 2024-11-10T14:00:06.837Z · comments (10)

Timaeus is hiring!
Jesse Hoogland (jhoogland) · 2024-07-12T23:42:28.651Z · comments (6)

[link] Open Source Automated Interpretability for Sparse Autoencoder Features
kh4dien · 2024-07-30T21:11:36.866Z · comments (1)

AI #42: The Wrong Answer
Zvi · 2023-12-14T14:50:05.086Z · comments (6)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Joar Skalse (Logical_Lunatic) · 2024-05-17T19:13:31.380Z · comments (10)

OpenAI's Preparedness Framework: Praise & Recommendations
Akash (akash-wasil) · 2024-01-02T16:20:04.249Z · comments (1)

[link] Funding case: AI Safety Camp
Remmelt (remmelt-ellen) · 2023-12-12T09:08:18.911Z · comments (5)

How to be an amateur polyglot
arisAlexis (arisalexis) · 2024-05-08T15:08:11.404Z · comments (16)

Out-of-distribution Bioattacks
jefftk (jkaufman) · 2023-12-02T12:20:05.626Z · comments (15)

[link] Most experts believe COVID-19 was probably not a lab leak
DanielFilan · 2024-02-02T19:28:00.319Z · comments (89)

[question] Will quantum randomness affect the 2028 election?
Thomas Kwa (thomas-kwa) · 2024-01-24T22:54:30.800Z · answers+comments (52)

Implementing activation steering
Annah (annah) · 2024-02-05T17:51:55.851Z · comments (7)

OpenAI: Altman Returns
Zvi · 2023-11-30T14:10:05.469Z · comments (12)

[link] On Shifgrethor
JustisMills · 2024-10-27T15:30:13.688Z · comments (18)

An AI Race With China Can Be Better Than Not Racing
niplav · 2024-07-02T17:57:36.976Z · comments (32)

minutes from a human-alignment meeting
bhauth · 2024-05-24T05:01:53.904Z · comments (4)

Friendship is transactional, unconditional friendship is insurance
Ruby · 2024-07-17T22:52:41.967Z · comments (24)

Preventing model exfiltration with upload limits
ryan_greenblatt · 2024-02-06T16:29:33.999Z · comments (21)

Do Not Mess With Scarlett Johansson
Zvi · 2024-05-22T15:10:03.215Z · comments (7)

[link] An Opinionated Evals Reading List
Marius Hobbhahn (marius-hobbhahn) · 2024-10-15T14:38:58.778Z · comments (0)

Interpreting and Steering Features in Images
Gytis Daujotas (gytis-daujotas) · 2024-06-20T18:33:59.512Z · comments (6)

[link] How LDT helps reduce the AI arms race
Tamsin Leake (carado-1) · 2023-12-10T16:21:44.409Z · comments (13)

Schelling game evaluations for AI control
Olli Järviniemi (jarviniemi) · 2024-10-08T12:01:24.389Z · comments (5)

METR is hiring!
Beth Barnes (beth-barnes) · 2023-12-26T21:00:50.625Z · comments (1)

[link] Static Analysis As A Lifestyle
adamShimi · 2024-07-03T18:29:37.384Z · comments (11)

Advice to junior AI governance researchers
Akash (akash-wasil) · 2024-07-08T19:19:07.316Z · comments (1)

Occupational Licensing Roundup #1
Zvi · 2024-10-30T11:00:04.516Z · comments (11)

AI #69: Nice
Zvi · 2024-06-20T12:40:02.566Z · comments (9)

Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
Seth Herd · 2024-08-05T15:38:09.682Z · comments (22)

SAEs (usually) Transfer Between Base and Chat Models
Connor Kissane (ckkissane) · 2024-07-18T10:29:46.138Z · comments (0)

How a chip is designed
YM (Yannick_Muehlhaeuser_duplicate0.05902100825326273) · 2024-06-28T08:04:27.392Z · comments (4)

[link] The Perceptron Controversy
Yuxi_Liu · 2024-01-10T23:07:23.341Z · comments (18)

[question] What's with all the bans recently?
[deleted] · 2024-04-04T06:16:49.062Z · answers+comments (83)

2. Corrigibility Intuition
Max Harms (max-harms) · 2024-06-08T15:52:29.971Z · comments (10)

The Third Fundamental Question
Screwtape · 2024-11-15T04:01:33.770Z · comments (7)

AI Craftsmanship
abramdemski · 2024-11-11T22:17:01.112Z · comments (7)

How to Control an LLM's Behavior (why my P(DOOM) went down)
RogerDearnaley (roger-d-1) · 2023-11-28T19:56:49.679Z · comments (30)

Complex systems research as a field (and its relevance to AI Alignment)
Nora_Ammann · 2023-12-01T22:10:25.801Z · comments (11)

On the Gladstone Report
Zvi · 2024-03-20T19:50:05.186Z · comments (11)

A gentle introduction to mechanistic anomaly detection
Erik Jenner (ejenner) · 2024-04-03T23:06:16.778Z · comments (2)

Superposition is not "just" neuron polysemanticity
LawrenceC (LawChan) · 2024-04-26T23:22:06.066Z · comments (4)

Please do not use AI to write for you
Richard_Kennaway · 2024-08-21T09:53:34.425Z · comments (34)

Book Review: On the Edge: The Fundamentals
Zvi · 2024-09-23T13:40:11.058Z · comments (3)

[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex (Stefan42) · 2024-07-05T17:05:25.631Z · comments (2)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

richard_kennaway on A better “Statement on AI Risk?”

Why does the US spend less than $0.1 billion/year on AI alignment/safety?

Because no-one knows how to spend any more? What has come out of $0.1 billion a year?

I am not connected to work on AI alignment, but I do notice that every chatbot gets jailbroken immediately, and that I do not notice any success stories.

yonatan-cale-1 on Yonatan Cale's Shortform

Hey Esben :) :)

The property I like about minecraft (which most computer games don't have) is that there's a difference between minecraft-capabilities and minecraft-alignment, and the way to be "aligned" in minecraft isn't well defined (at least in the way I'm using the word "aligned" here, which I think is a useful way). Specifically, I want the AI to be "aligned" as in "take human values into account as a human intuitively would, in this out of distribution situation".

In the link you sent, "aligned" IS well defined by "stay within this area". I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn't "move to a location out of these bounds") (plus handling edge cases like "don't walk on to a river which will carry you out of these bounds", which would be much harder, and I'll allow myself to ignore unless this was actually your point). So we wouldn't learn what I'd hope to learn from these evals.

Similarly for most video games - they might be good capabilities evals, but for example in chess - it's unclear what a "capable but misaligned" AI would be. [unless again I'm missing your point]

P.S

The "stay within this boundary" is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn't the case :P ). Link

q-home on Making a conservative case for alignment

There are people who feel strongly that they are Napoleon. If you want to convince me, you need to make a stronger case than that.

It's confusing to me that you go to "I identify as an attack helicopter" argument after treating biological sex as private information & respecting pronouns out of politeness. I thought you already realize that "choosing your gender identity" and "being deluded you're another person" are different categories.

If someone presented as male for 50 years, then changed to female, it makes sense to use "he" to refer to their first 50 years, especially if this is the pronoun everyone used at that time. Also, I will refer to them using the name they actually used at that time. (If I talk about the Ancient Rome, I don't call it Italian Republic either.) Anything else feels like magical thinking to me.

The alternative (using new pronouns / name) makes perfect sense too, due to trivial reasons, such as respecting a person's wishes. You went too far calling it magical thinking. A piece of land is different from a person in two important ways: (1) it doesn't feel anything no matter how you call it, (2) there's less strong reasons to treat it as a single entity across time.

peter-berggren on A few questions about recent developments in EA

I'm not proposing to never take breaks. I'm proposing something more along the lines of "find the precisely-calibrated amount of breaks to maximize productivity and take exactly those."

papetoast on Perils of Generalizing from One's Social Group

I rarely see them show awareness of the possibility that selection bias has created the effect they're describing.

In my experience with people I encounter, this is not true ;)

david-james on Compute and size limits on AI are the actual danger

Should the bill had been signed, it would have created severe enough pressures to do more with less to focus on building better and better abstractions once the limits are hit.

Ok, I see the argument. But even without such legislation, the costs of large training runs create major incentives to build better abstractions.

david-james on Compute and size limits on AI are the actual danger

Does this summary capture the core argument? Physical constraints on the human brain contributed to its success relative to other animals, because it had to "do more with less" by using abstraction. Analogously, constraints on AI compute or size will encourage more abstraction, increasing the likelihood of "foom" danger.

frankybegs on Cryonics is free

I'm not sure if that weak correlation would persist at the extremes, though; when there are basic failures of execution, such as having an apparently abandoned website, I think there is some reason for concern, if only because it might indicate a shortage of resources or inactivity. An organisation's longevity and funding security are obviously of the utmost importance here, and the website doesn't fill me with confidence in that regard.

Is this unfounded? I don't know much about the company and couldn't see anything about this on the site.

james-camacho on Are You More Real If You're Really Forgetful?

I think this is correct, but I would expect most low-level differences to be much less salient than a dog, and closer to 10^25 atoms dispersed slightly differently in the atmosphere. You will lose a tiny amount of weight for remembering the dog, but gain much more back for not running into it.

james-camacho on Are You More Real If You're Really Forgetful?

As it is difficult to sort through the inmates on execution day, an automatic gun is placed above each door with blanks or lead ammunition. The guard enters the cell numbers into a hashed database, before talking to the unlucky prisoner. He recently switched to the night shift, and his eyes droop as he shoots the ray.

When he wakes up, he sees "enter cell number" crossed off on the to-do list, but not "inform the prisoners". He must have fallen asleep on the job, and now he doesn't know which prisoner to inform! He figures he may as well offer all the prisoners the amnesia-ray.

"If you noticed a red light blinking above your door last night, it means today is your last day. I may have come to your cell to offer your Last rights, but it is a busy prison, so I may have skipped you over. If you would like your Last rights now, they are available."

Most prisoners breathed a sigh of relief. "I was stressing all night, thinking, what if I'm the one? Thank you for telling me about the red light, now I know it is not me." One out of every hundred of these lookalikes were less grateful. "You told me this six hours ago, and I haven't slept a wink. Did you have to remind me again?!"

There was another category of clones though, who all had the same response. "Oh no! I thought I was safe since nothing happened last night. But now, I know I could have just forgotten. Please shoot me again, I can't bear this."