LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas (roland-pihlakas) · 2025-03-16T23:23:30.989Z · comments (6)

The Rise of Hyperpalatability
Jack (jack-3) · 2025-04-02T20:18:04.407Z · comments (10)

Grok Grok
Zvi · 2025-02-24T14:20:08.877Z · comments (2)

[Cross-post] Every Bay Area "Walled Compound"
davekasten · 2025-01-23T15:05:08.629Z · comments (3)

Announcing EXP: Experimental Summer Workshop on Collective Cognition
Jan_Kulveit · 2025-03-15T20:14:47.972Z · comments (2)

2024 was the year of the big battery, and what that means for solar power
transhumanist_atom_understander · 2025-02-01T06:27:39.082Z · comments (1)

Extended analogy between humans, corporations, and AIs.
Daniel Kokotajlo (daniel-kokotajlo) · 2025-02-13T00:03:13.956Z · comments (2)

Any-Benefit Mindset and Any-Reason Reasoning
silentbob · 2025-03-15T17:10:14.682Z · comments (9)

o3 Will Use Its Tools For You
Zvi · 2025-04-18T21:20:02.566Z · comments (3)

Meditation and Reduced Sleep Need
niplav · 2025-04-04T14:42:54.792Z · comments (8)

Theory of Change for AI Safety Camp
Linda Linsefors · 2025-01-22T22:07:10.664Z · comments (3)

ParaScopes: Do Language Models Plan the Upcoming Paragraph?
NickyP (Nicky) · 2025-02-21T16:50:20.745Z · comments (2)

[link] We don't want to post again "This might be the last AI Safety Camp"
Remmelt (remmelt-ellen) · 2025-01-21T12:03:33.171Z · comments (17)

[question] Will LLM agents become the first takeover-capable AGIs?
Seth Herd · 2025-03-02T17:15:37.056Z · answers+comments (10)

[link] Forecasting Frontier Language Model Agent Capabilities
Govind Pimpale (govind-pimpale) · 2025-02-24T16:51:32.022Z · comments (0)

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
Florian_Dietz · 2025-03-10T16:07:45.215Z · comments (3)

[link] SuperBabies podcast with Gene Smith
Eneasz · 2025-02-19T19:36:49.852Z · comments (1)

Kitchen Air Purifier Comparison
jefftk (jkaufman) · 2025-01-22T03:20:03.224Z · comments (2)

[link] Introducing MASK: A Benchmark for Measuring Honesty in AI Systems
Richard Ren (RichardR) · 2025-03-05T22:56:46.155Z · comments (5)

Call for Collaboration: Renormalization for AI safety
Lauren Greenspan (LaurenGreenspan) · 2025-03-31T21:01:56.500Z · comments (0)

System 2 Alignment
Seth Herd · 2025-02-13T19:17:56.868Z · comments (0)

ARENA 5.0 - Call for Applicants
JamesH (AtlasOfCharts) · 2025-01-30T13:18:27.052Z · comments (2)

Can SAE steering reveal sandbagging?
jordine · 2025-04-15T12:33:41.264Z · comments (3)

[link] Forecasting time to automated superhuman coders [AI 2027 Timelines Forecast]
elifland · 2025-04-10T23:10:23.063Z · comments (0)

[link] Well-foundedness as an organizing principle of healthy minds and societies
Richard_Ngo (ricraz) · 2025-04-07T00:31:34.098Z · comments (6)

[link] AI Can't Write Good Fiction
JustisMills · 2025-03-12T06:11:57.786Z · comments (19)

Operator
Zvi · 2025-01-28T20:00:08.374Z · comments (1)

Boots theory and Sybil Ramkin
philh · 2025-03-18T22:10:08.855Z · comments (17)

More Fun With GPT-4o Image Generation
Zvi · 2025-04-03T02:10:02.317Z · comments (3)

Writing experiments and the banana escape valve
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-23T13:11:24.215Z · comments (1)

[link] Hunting for AI Hackers: LLM Agent Honeypot
Reworr R (reworr-reworr) · 2025-02-12T20:29:32.269Z · comments (0)

Avoid the Counterargument Collapse
marknm · 2025-03-26T03:19:58.655Z · comments (3)

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
Alex Mallen (alex-mallen) · 2025-03-24T17:55:59.358Z · comments (0)

Reasons-based choice and cluelessness
JesseClifton · 2025-02-07T22:21:47.232Z · comments (0)

Everything I Know About Semantics I Learned From Music Notation
J Bostock (Jemist) · 2025-03-09T18:09:11.789Z · comments (2)

Why Are The Human Sciences Hard? Two New Hypotheses
Aydin Mohseni (aydin-mohseni) · 2025-03-18T15:45:52.239Z · comments (14)

AI #106: Not so Fast
Zvi · 2025-03-06T15:40:05.919Z · comments (4)

Austin Chen on Winning, Risk-Taking, and FTX
Elizabeth (pktechgirl) · 2025-04-07T19:00:08.039Z · comments (3)

[link] Center on Long-Term Risk: Summer Research Fellowship 2025 - Apply Now
Tristan Cook · 2025-03-26T17:29:14.797Z · comments (0)

OpenAI rewrote its Preparedness Framework
Zach Stein-Perlman · 2025-04-15T20:00:50.614Z · comments (1)

Re: Taste
lsusr · 2025-02-01T03:34:10.918Z · comments (8)

[link] OpenAI releases GPT-4.5
Seth Herd · 2025-02-27T21:40:45.010Z · comments (12)

Is instrumental convergence a thing for virtue-driven agents?
mattmacdermott · 2025-04-02T03:59:20.064Z · comments (37)

Why do misalignment risks increase as AIs get more capable?
ryan_greenblatt · 2025-04-11T03:06:50.928Z · comments (6)

DeepSeek: Lemon, It’s Wednesday
Zvi · 2025-01-29T15:00:07.914Z · comments (0)

[link] Meta: Frontier AI Framework
Zach Stein-Perlman · 2025-02-03T22:00:17.103Z · comments (2)

FLAKE-Bench: Outsourcing Awkwardness in the Age of AI
annas (annasoli) · 2025-04-01T17:08:25.092Z · comments (0)

How much progress actually happens in theoretical physics?
ChristianKl · 2025-04-04T23:08:00.633Z · comments (32)

Top AI safety newsletters, books, podcasts, etc – new AISafety.com resource
Bryce Robertson (bryceerobertson) · 2025-03-04T17:01:18.758Z · comments (2)

What Uniparental Disomy Tells Us About Improper Imprinting in Humans
Morpheus · 2025-03-28T11:24:47.133Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

samuelshadrach on xpostah's Shortform

Suppose you are trying to figure out a function f(x,y,z | a,b,c) where x, y ,z are all scalar values and a, b, c are all constants.

If you knew a few zeroes of this function, you could figure out good approximations of this function. Let's say you knew

U(x,y, a=0) = x
U(x,y, a=1) = x
U(x,y, a=2) = y
U(x,y, a=3) = y

You could now guess U(x,y) = x if a<1.5, y if a>1.5

You will not be able to get a good approximation if you did not know enough zeroes.

This is a comment about morality. x, y, z are agent's multiple possibly-conflicting values and a, b, c are info about environment of agent. You lack data about how your own mind will react to hypothetical situations you have not faced. At best you can extrapolate from historical data around minds of other people that are different from yours. Bigger and more trustworthy dataset will help solve this.

tobias-h on Accountability Sinks

Quite an interesting paper you linked:

Conventional wisdom during World War II among German soldiers,
members of the SS and SD as well as police personnel, held that any order given
by a superior officer must be obeyed under any circumstances. Failure to carry
out such an order would result in a threat to life and limb or possibly serious
danger to loved ones. Many students of Nazi history have this same view, even
to this day.
Could a German refuse to participate in the round up and murder of
Jews, gypsies, suspected partisans,"commissars"and Soviet POWs - unarmed
groups of men, women, and children - and survive without getting himself shot
or put into a concentration camp or placing his loved ones in jeopardy?
We may never learn the full answer to this, the ultimate question for
all those placed in such a quandry, because we lack adequate documentation
in many cases to determine the full circumstances and consequences of such a
hazardous risk. There are, however, over 100 cases of individuals whose moral
scruples were weighed in the balance and not found wanting. These individuals
made the choice to refuse participation in the shooting of unarmed civilians or
POWs and none of them paid the ultimate penalty, death! Furthermore,very few
suffered any other serious consequence!

Table of the consequences they faced:

mondsemmel on Charbel-Raphaël's Shortform

Altman has already signed the CAIS Statement on AI Risk ("Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."), but OpenAI's actions almost exclusively exacerbate extinction risk, and nowadays Altman and OpenAI even downplay the very existence of this risk.

ozyrus on Is Gemini now better than Claude at Pokémon?

Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.

robo on Why Should I Assume CCP AGI is Worse Than USG AGI?

This is true, and it strongly influences the ways Americans think about how to provide public goods to the rest of the world. But they're thinking about how to provide public goods the rest of the world^[1]. "America First" is controversial in American intellectual circles, whereas in my (limited) conversations in China people are usually confused about what other sort of policy you would have.

^{^}
Disclosure: I'm American, I came of age in this era

charbel-raphael on Charbel-Raphaël's Shortform

X could also be agreeing to sign a public statement about the need to do something or whatever.

lblack on AE Studio is hiring!

I like AE Studios. They seem to genuinely care about AI not killing everyone, and have been willing to actually back original research ideas that don't fit into existing paradigms.

Side note:

Previous posts have been met with great reception by the likes of Eliezer Yudkowsky and Emmett Shear, so we’re up to something good.

This might be a joke, but just in case it's not: I don't think you should reason about your own alignment research agenda like this. I think Eliezer would probably be the first person to tell you that [? · GW].

hiandrewquinn on Training AGI in Secret would be Unsafe and Unethical

If so, by default the existence of AGI will be a closely guarded secret for some months. Only a few teams within an internal silo, plus leadership & security, will know about the capabilities of the latest systems.

This will result in a situation where only a few dozen people will be charged with ensuring that, and figuring out whether, the latest AIs are aligned/trustworthy/etc.

These are precisely the kinds of outcomes fine-insured bounties would excel at stopping, in my thinking. https://www.lesswrong.com/posts/rgFh6kE9FFjMuYrhc/fine-insured-bounties-for-preventing-dangerous-ai [LW · GW]

mondsemmel on Charbel-Raphaël's Shortform

Even if this strategy would work in principle among particularly honorable humans, surely Sam Altman in particular has already conclusively proven that he cannot be trusted to honor any important agreements? See: the OpenAI board drama; the attempt to turn OpenAI's nonprofit into a for-profit; etc.

adamzerner on adamzerner's Shortform

I've been doing Quantified Intuitions' Estimation Game every month. I really enjoy it. A big thing I've learned from it is the instinct to think in terms of orders of magnitude.

Well, not necessarily orders of magnitude, but something similar. For example, a friend just asked me about building a little web app calculator to provide better handicaps in golf scrambles. In the past I'd get a little overwhelmed thinking about how much time such a project would take and default to saying no. But this time I noticed myself approaching it differently.

Will it take minutes? Eh, probably not. Hours? Possibly, but seems a little optimistic. Days? Yeah, seems about right. Weeks? Eh, possibly, but even with the planning fallacy, I'd be surprised. Months? No, it won't take that long. Years. No way.

With this approach I can figure out the right ballpark very quickly. It's helpful.