LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

Inspired by: Failures in Kindness
X4vier · 2024-07-27T01:21:42.848Z · comments (2)

What is SB 1047 *for*?
Raemon · 2024-09-05T17:39:39.871Z · comments (8)

Interdictor Ship
lsusr · 2024-08-19T04:59:18.487Z · comments (9)

What is it to solve the alignment problem?
Joe Carlsmith (joekc) · 2024-08-24T21:19:34.280Z · comments (14)

What is "True Love"?
johnswentworth · 2024-08-18T16:05:47.358Z · comments (9)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (7)

[link] Announcing the $200k EA Community Choice
Austin Chen (austin-chen) · 2024-08-14T00:39:37.350Z · comments (8)

How you can help pass important AI legislation with 10 minutes of effort
ThomasW · 2024-09-14T22:10:50.386Z · comments (2)

[link] Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso (rhaps0dy) · 2024-07-25T22:00:55.398Z · comments (8)

[link] Congressional Insider Trading
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-30T13:32:57.264Z · comments (6)

Self-explaining SAE features
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-05T22:20:36.041Z · comments (13)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (2)

John Schulman leaves OpenAI for Anthropic
Sodium · 2024-08-06T01:23:15.427Z · comments (0)

Referendum Mechanics in a Marketplace of Ideas
Martin Sustrik (sustrik) · 2024-08-25T08:30:01.901Z · comments (2)

[link] Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety
titotal (lombertini) · 2024-09-18T13:07:40.754Z · comments (0)

Evidence against Learned Search in a Chess-Playing Neural Network
p.b. · 2024-09-13T11:59:55.634Z · comments (3)

The Bitter Lesson for AI Safety Research
adamk · 2024-08-02T18:39:36.884Z · comments (5)

Coalitional agency
Richard_Ngo (ricraz) · 2024-07-22T00:09:51.525Z · comments (6)

On the UBI Paper
Zvi · 2024-09-03T14:50:08.647Z · comments (6)

[link] Demis Hassabis — Google DeepMind: The Podcast
Zach Stein-Perlman · 2024-08-16T00:00:04.712Z · comments (8)

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs
Michaël Trazzi (mtrazzi) · 2024-08-24T04:30:11.807Z · comments (0)

AI #81: Alpha Proteo
Zvi · 2024-09-12T13:00:07.958Z · comments (3)

Secret Collusion: Will We Know When to Unplug AI?
schroederdewitt · 2024-09-16T16:07:01.119Z · comments (6)

Some Unorthodox Ways To Achieve High GDP Growth
johnswentworth · 2024-08-08T18:58:56.046Z · comments (6)

... Wait, our models of semantics should inform fluid mechanics?!?
johnswentworth · 2024-08-26T16:38:53.924Z · comments (12)

[link] Unlocking Solutions—By Understanding Coordination Problems
James Stephen Brown (james-brown) · 2024-07-27T04:52:13.435Z · comments (4)

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
Patrick Leask (patrickleask) · 2024-08-17T01:16:53.764Z · comments (0)

Rationalists are missing a core piece for agent-like structure (energy vs information overload)
tailcalled · 2024-08-17T09:57:19.370Z · comments (9)

How the AI safety technical landscape has changed in the last year, according to some practitioners
tlevin (trevor) · 2024-07-26T19:06:47.126Z · comments (6)

[link] Pay-on-results personal growth: first success
Chipmonk · 2024-09-14T03:39:12.975Z · comments (2)

AI #76: Six Shorts Stories About OpenAI
Zvi · 2024-08-08T13:50:04.659Z · comments (10)

Thiel on AI & Racing with China
Ben Pace (Benito) · 2024-08-20T03:19:18.966Z · comments (10)

BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann (Stuckwork) · 2024-07-20T02:20:51.848Z · comments (0)

Llama Llama-3-405B?
Zvi · 2024-07-24T19:40:07.565Z · comments (9)

Measuring Structure Development in Algorithmic Transformers
Micurie (micurie) · 2024-08-22T08:38:02.140Z · comments (4)

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.
Andrew_Critch · 2024-09-11T04:41:24.872Z · comments (6)

Provably Safe AI: Worldview and Projects
bgold · 2024-08-09T23:21:02.763Z · comments (43)

Rewilding the Gut VS the Autoimmune Epidemic
GGD · 2024-08-16T18:00:46.239Z · comments (0)

[link] AI, centralization, and the One Ring
owencb · 2024-09-13T14:00:16.126Z · comments (11)

Unlearning via RMU is mostly shallow
Andy Arditi (andy-arditi) · 2024-07-23T16:07:52.223Z · comments (3)

Please do not use AI to write for you
Richard_Kennaway · 2024-08-21T09:53:34.425Z · comments (34)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

SRE's review of Democracy
Martin Sustrik (sustrik) · 2024-08-03T07:20:01.483Z · comments (2)

Extended Interview with Zhukeepa on Religion
Ben Pace (Benito) · 2024-08-18T03:19:05.625Z · comments (27)

All The Latest Human tFUS Studies
sarahconstantin · 2024-08-09T22:20:04.561Z · comments (2)

Caring about excellence
owencb · 2024-07-22T14:24:37.892Z · comments (4)

[link] Michael Dickens' Caffeine Tolerance Research
niplav · 2024-09-04T15:41:53.343Z · comments (3)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

ruby on Which LessWrong/Alignment topics would you like to be tutored in? [Poll]

Natural Latents [LW · GW]

ruby on Which LessWrong/Alignment topics would you like to be tutored in? [Poll]

Infra-Bayesianism [? · GW]

ruby on Which LessWrong/Alignment topics would you like to be tutored in? [Poll]

Poll for LW topics you'd like to be tutored in
(please use agree-react to indicate you'd personally like tutoring on a topic, I might reach out if/when I have a prototype)

Note: Hit cmd-f or ctrl-f (whatever normally opens search) to automatically expand all of the poll options below.

cole-wyeth on Pronouns are Annoying

I guess refusing to use someone’s preferred pronouns is weak Bayesian evidence for wanting to have them killed, but the conclusion is so unlikely it’s probably not appropriate to raise it to the level of serious consideration.

dana on Pronouns are Annoying

What would be upsetting about being called "she"? I don't share your intuition. Whenever I imagine being misgendered (or am misgendered, e.g., on a voice call with a stranger), I don't feel any strong emotional reaction. To the point that I generally will not correct them.

I could imagine it being very upsetting if I am misgendered by someone who should know me well enough not to misgender me, or if someone purposefully misgenders me. But the misgendering specifically is not the main offense in these two cases.

Perhaps myself and ymeskhout are less tied to our gender identity than most?

sharmake-farah on My AI Model Delta Compared To Christiano

Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan.

I tend to agree far more with Paul Christiano than with John Wentworth on the delta of

But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it's easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world.

(On the highest level I do not know my values and wouldn't hand over the full control of the future to any AI because I don't trust that I could tell good from bad, I think I'd mostly be confused about what it did.)

I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.

michael-cohn on Pronouns are Annoying

I also don't think it's useful to try and learn much about pronouns qua pronouns social battles over them. Using the pronoun people ask you to use has become a proxy for all sorts of other tolerant/benevolent attitudes towards that person and the way they want to live their life, and to an even greater extent, refusing to do that is a proxy for thinking they should be ignored, or possibly reviled, or possibly killed.

I don't think everyone proxies it that way -- I know there are some people who are just old-fashioned, or passionate about prescriptive grammar, or have essentialist beliefs about gender but are libertarian about others' behavior. I think that if everyone had very high confidence that someone not using the pronouns they requested meant that at worst that person mildly disapproves of them but would still actively defend their civil + legal + human rights, there would probably be a lot less of the handwringing you mention, and we'd be able to learn a lot more about the fundamental intrinsic meaning of pronouns.

tao-lin on The case for a negative alignment tax

to me "alignment tax" usually only refers to alignment methods that don't cost-effectively increase capabilities, so if 90% of alignment methods did cost effectively increase capabilities but 10% did not, i would still say there was an "alignment tax", just ignore the negatives.

Also, it's important to consider cost-effective capabilities rather than raw capabilities - if a lab knows of a way to increase capabilities more cost-effectively than alignment, using that money for alignment is a positive alignment tax

sheikh-abdur-raheem-ali on GPT-o1

As I keep saying, deception is not some unique failure mode. Humans are constantly engaging in various forms of deception. It is all over the training data, and any reasonable set of next token predictions. There is no clear line between deception and not deception. And yes, to the extent that humans reward ‘deceptive’ responses, as they no doubt often will inadvertently do, the model will ‘learn’ deception.

https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai [LW · GW]

(found in the comments of this prediction market)

robert-cousineau on The case for a negative alignment tax

In the limit (what might be considered the ‘best imaginable case’), we might imagine researchers discovering an alignment technique that (A) was guaranteed to eliminate x-risk and (B) improve capabilities so clearly that they become competitively necessary for anyone attempting to build AGI.

I feel like throughout this post, you are ignoring that agents, "in the limit", are (likely) provably taxed by having to be aligned to goals other than their own. An agent with utility function "A" is definitely going to be less capable at achieving "A" if it is also aligned to utility function "B". I respect that current LLM's not best described as having a singular consistent goal function, however, "in the limit" that is what they will be best described as.