LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Formal verification, heuristic explanations and surprise accounting
Jacob_Hilton · 2024-06-25T15:40:03.535Z · comments (11)

o1 is a bad idea
abramdemski · 2024-11-11T21:20:24.892Z · comments (36)

Dyslucksia
Shoshannah Tekofsky (DarkSym) · 2024-05-09T19:21:33.874Z · comments (45)

The Incredible Fentanyl-Detecting Machine
sarahconstantin · 2024-06-28T22:10:01.223Z · comments (26)

OpenAI: Exodus
Zvi · 2024-05-20T13:10:03.543Z · comments (26)

Ironing Out the Squiggles
Zack_M_Davis · 2024-04-29T16:13:00.371Z · comments (36)

Apologizing is a Core Rationalist Skill
johnswentworth · 2024-01-02T17:47:35.950Z · comments (42)

Tips for Empirical Alignment Research
Ethan Perez (ethan-perez) · 2024-02-29T06:04:54.481Z · comments (4)

[question] things that confuse me about the current AI market.
DMMF · 2024-08-28T13:46:56.908Z · answers+comments (28)

My takes on SB-1047
leogao · 2024-09-09T18:38:37.799Z · comments (8)

[link] Daniel Dennett has died (1942-2024)
kave · 2024-04-19T16:17:04.742Z · comments (5)

Neutrality
sarahconstantin · 2024-11-13T23:10:05.469Z · comments (25)

2023 Survey Results
Screwtape · 2024-02-16T22:24:28.132Z · comments (26)

[link] Using axis lines for good or evil
dynomight · 2024-03-06T14:47:10.989Z · comments (39)

A Rocket–Interpretability Analogy
plex (ete) · 2024-10-21T13:55:18.184Z · comments (31)

Deep atheism and AI risk
Joe Carlsmith (joekc) · 2024-01-04T18:58:47.745Z · comments (22)

Priors and Prejudice
MathiasKB (MathiasKirkBonde) · 2024-04-22T15:00:41.782Z · comments (31)

[link] Vernor Vinge, who coined the term "Technological Singularity", dies at 79
Kaj_Sotala · 2024-03-21T22:14:14.699Z · comments (24)

On Devin
Zvi · 2024-03-18T13:20:04.779Z · comments (34)

Liability regimes for AI
Ege Erdil (ege-erdil) · 2024-08-19T01:25:01.006Z · comments (34)

[link] If you weren't such an idiot...
kave · 2024-03-02T00:01:37.314Z · comments (74)

Current safety training techniques do not fully transfer to the agent setting
Simon Lermen (dalasnoin) · 2024-11-03T19:24:51.537Z · comments (8)

Some (problematic) aesthetics of what constitutes good work in academia
Steven Byrnes (steve2152) · 2024-03-11T17:47:28.835Z · comments (12)

OpenAI o1
Zach Stein-Perlman · 2024-09-12T17:30:31.958Z · comments (41)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery
Seb Farquhar · 2023-12-18T11:58:39.379Z · comments (21)

The Plan - 2023 Version
johnswentworth · 2023-12-29T23:34:19.651Z · comments (40)

Leading The Parade
johnswentworth · 2024-01-31T22:39:56.499Z · comments (31)

[link] Arithmetic is an underrated world-modeling technology
dynomight · 2024-10-17T14:00:22.475Z · comments (32)

My motivation and theory of change for working in AI healthtech
Andrew_Critch · 2024-10-12T00:36:30.925Z · comments (36)

[link] Overcoming Bias Anthology
Arjun Panickssery (arjun-panickssery) · 2024-10-20T02:01:23.463Z · comments (13)

The Information: OpenAI shows 'Strawberry' to feds, races to launch it
Martín Soto (martinsq) · 2024-08-27T23:10:18.155Z · comments (15)

[link] Stanislav Petrov Quarterly Performance Review
Ricki Heicklen (bayesshammai) · 2024-09-26T21:20:11.646Z · comments (3)

LLMs for Alignment Research: a safety priority?
abramdemski · 2024-04-04T20:03:22.484Z · comments (24)

[link] Moral Reality Check (a short story)
jessicata (jessica.liu.taylor) · 2023-11-26T05:03:18.254Z · comments (44)

[link] That Alien Message - The Animation
Writer · 2024-09-07T14:53:30.604Z · comments (9)

[link] Nursing doubts
dynomight · 2024-08-30T02:25:36.826Z · comments (20)

Value Claims (In Particular) Are Usually Bullshit
johnswentworth · 2024-05-30T06:26:21.151Z · comments (18)

[link] Fields that I reference when thinking about AI takeover prevention
Buck · 2024-08-13T23:08:54.950Z · comments (16)

[link] The Checklist: What Succeeding at AI Safety Will Involve
Sam Bowman (sbowman) · 2024-09-03T18:18:34.230Z · comments (49)

AI Views Snapshots
Rob Bensinger (RobbBB) · 2023-12-13T00:45:50.016Z · comments (61)

Survey: How Do Elite Chinese Students Feel About the Risks of AI?
Nick Corvino (nick-corvino) · 2024-09-02T18:11:11.867Z · comments (13)

Momentum of Light in Glass
Ben (ben-lang) · 2024-10-09T20:19:42.088Z · comments (44)

[link] Decomposing Agency — capabilities without desires
owencb · 2024-07-11T09:38:48.509Z · comments (32)

"It's a 10% chance which I did 10 times, so it should be 100%"
egor.timatkov · 2024-11-18T01:14:27.738Z · comments (57)

My experience using financial commitments to overcome akrasia
William Howard (william-howard) · 2024-04-15T22:57:32.574Z · comments (31)

What good is G-factor if you're dumped in the woods? A field report from a camp counselor.
Hastings (hastings-greer) · 2024-01-12T13:17:23.829Z · comments (22)

0. CAST: Corrigibility as Singular Target
Max Harms (max-harms) · 2024-06-07T22:29:12.934Z · comments (12)

Read the Roon
Zvi · 2024-03-05T13:50:04.967Z · comments (6)

[Completed] The 2024 Petrov Day Scenario
Ben Pace (Benito) · 2024-09-26T08:08:32.495Z · comments (114)

The Dark Arts
lsusr · 2023-12-19T04:41:13.356Z · comments (49)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

mitchell_porter on Why We Wouldn't Build Aligned AI Even If We Could

For my part, I have been wondering this week, what a constructive reply to this would be.

I think your proposed imperatives and experiments are quite good. I hope that they are noticed and thought about. I don't think they are sufficient for correctly aligning a superintelligence, but they can be part of the process that gets us there.

That's probably the most important thing for me to say. Anything else is just a disagreement about the nature of the world as it is now, and isn't as important.

megasilverfist on The Copernican Revolution from the Inside

Galileo invented the telescope

https://explainingscience.org/2018/03/13/galileo-and-the-telescope/ a bit of a simplification, but not seriously off.

yonatan-cale-1 on Yonatan Cale's Shortform

I think a simple bash tool running as admin could do most of these:

it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will

Regarding

and its training includes the use of those functionalities

I think this isn't a crux because the scaffolding I'd build wouldn't train the model. But as a secondary point, I think today's models can already use bash tools reasonably well.

it's not completely clear to me that it wouldn't already be able to do a slow self-improvement takeoff by itself

This requires skill in ML R&D which I think is almost entirely not blocked by what I'd build, but I do think it might be reasonable to have my tool not work for ML R&D because of this concern. (would require it to be closed source and so on)

Thanks for raising concerns, I'm happy for more if you have them

ya-polkovnik on Helpful examples to get a sense of modern automated manipulation

I remember how anti-war was a pro-Ukrainian position. Now, to be that kind of "antiwar", you must be for military escalation to destroy Russia, call up "kill the orcs", and Russia is trying to start the negotiations.

Well, to be fair, it was like that since the beginning, and pacifism was just a shiny cover.

By the way, I'm not pro-Russian. Srsly. They don't genuinely seek, and wouldn't ever achieve denazification, even though it's very important to do.

yonatan-cale-1 on Yonatan Cale's Shortform

Hey,

Generalization because we expect future AI to be able to take actions and reach outcomes that humans can't

I'm assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:

Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
1. Imagine Alpha-Fold for minecraft structures. I'm not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
2. I think it's possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
3. [edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I'd try looking for (or building?) a game that would be even better suited.

preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm)

Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human's tree house (or otherwise taking over the world to maximize wood) then you're worried the human won't have a way to ask the AI to do something else? That the AI will combine the new command "not from my tree house!" into a new strange misaligned behaviour?

self on One Argument Against An Army

Very cool. Less of a distinct mental handle, more of a subtle mental strategy one can find oneself executing across time.

mitchell_porter on For progress to be by accumulation and not by random walk, read great books

Perhaps he means something like what Keynes said here.

richard_kennaway on A better “Statement on AI Risk?”

Why does the US spend less than $0.1 billion/year on AI alignment/safety?

Because no-one knows how to spend any more? What has come out of $0.1 billion a year?

I am not connected to work on AI alignment, but I do notice that every chatbot gets jailbroken immediately, and that I do not notice any success stories.

yonatan-cale-1 on Yonatan Cale's Shortform

Hey Esben :) :)

The property I like about minecraft (which most computer games don't have) is that there's a difference between minecraft-capabilities and minecraft-alignment, and the way to be "aligned" in minecraft isn't well defined (at least in the way I'm using the word "aligned" here, which I think is a useful way). Specifically, I want the AI to be "aligned" as in "take human values into account as a human intuitively would, in this out of distribution situation".

In the link you sent, "aligned" IS well defined by "stay within this area". I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn't "move to a location out of these bounds") (plus handling edge cases like "don't walk on to a river which will carry you out of these bounds", which would be much harder, and I'll allow myself to ignore unless this was actually your point). So we wouldn't learn what I'd hope to learn from these evals.

Similarly for most video games - they might be good capabilities evals, but for example in chess - it's unclear what a "capable but misaligned" AI would be. [unless again I'm missing your point]

P.S

The "stay within this boundary" is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn't the case :P ). Link

q-home on Making a conservative case for alignment

There are people who feel strongly that they are Napoleon. If you want to convince me, you need to make a stronger case than that.

It's confusing to me that you go to "I identify as an attack helicopter" argument after treating biological sex as private information & respecting pronouns out of politeness. I thought you already realize that "choosing your gender identity" and "being deluded you're another person" are different categories.

If someone presented as male for 50 years, then changed to female, it makes sense to use "he" to refer to their first 50 years, especially if this is the pronoun everyone used at that time. Also, I will refer to them using the name they actually used at that time. (If I talk about the Ancient Rome, I don't call it Italian Republic either.) Anything else feels like magical thinking to me.

The alternative (using new pronouns / name) makes perfect sense too, due to trivial reasons, such as respecting a person's wishes. You went too far calling it magical thinking. A piece of land is different from a person in two important ways: (1) it doesn't feel anything no matter how you call it, (2) there's less strong reasons to treat it as a single entity across time.