LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Detecting AI Agent Failure Modes in Simulations
Michael Soareverix (michael-soareverix) · 2025-02-11T11:10:26.030Z · comments (0)

Chaos Investments v0.31
Screwtape · 2025-02-08T06:53:22.959Z · comments (1)

[link] "Self-Blackmail" and Alternatives
jessicata (jessica.liu.taylor) · 2025-02-09T23:20:19.895Z · comments (12)

Studies of Human Error Rate
tin482 · 2025-02-13T13:43:30.717Z · comments (3)

[link] Visual Reference for Frontier Large Language Models
kenakofer · 2025-02-11T05:14:24.752Z · comments (0)

[link] Systematic Sandbagging Evaluations on Claude 3.5 Sonnet
farrelmahaztra · 2025-02-14T01:22:46.695Z · comments (0)

Rational Utopia, Multiversal AI Alignment, Steerable ASI, Ultimate Human Freedom (V. 2: Place ASI, some pictures)
ank · 2025-02-11T03:21:40.899Z · comments (7)

Hopeful hypothesis, the Persona Jukebox.
Donald Hobson (donald-hobson) · 2025-02-14T19:24:35.514Z · comments (4)

[link] A Bearish Take on AI, as a Treat
rats (cartier-gucciscarf) · 2025-02-10T19:22:30.593Z · comments (0)

[link] Forecasting newsletter #2/2025: Forecasting meetup network
NunoSempere (Radamantis) · 2025-02-09T18:07:51.514Z · comments (0)

I'm making a ttrpg about life in an intentional community during the last year before the Singularity
bgaesop · 2025-02-13T21:54:09.002Z · comments (2)

[link] OpenAI lied about SFT vs. RLHF
sanxiyn · 2025-02-10T03:24:16.625Z · comments (2)

[question] A Simulation of Automation economics?
qbolec · 2025-02-10T08:11:04.424Z · answers+comments (1)

Dovetail's agent foundations fellowship talks & discussion
Alex_Altair · 2025-02-13T00:49:48.854Z · comments (0)

[link] Inside the dark forests of the internet
Itay Dreyfus (itay-dreyfus) · 2025-02-12T10:20:59.426Z · comments (0)

[link] What About The Horses?
Maxwell Tabarrok (maxwell-tabarrok) · 2025-02-11T13:59:36.913Z · comments (17)

AXRP Episode 38.7 - Anthony Aguirre on the Future of Life Institute
DanielFilan · 2025-02-09T01:10:05.683Z · comments (0)

SWE Automation Is Coming: Consider Selling Your Crypto
A_donor · 2025-02-13T20:17:59.227Z · comments (6)

[link] The current AI strategic landscape: one bear's perspective
Matrice Jacobine · 2025-02-15T09:49:13.120Z · comments (0)

[link] Introduction to Expected Value Fanaticism
Petra Kosonen · 2025-02-14T19:05:26.556Z · comments (4)

If Neuroscientists Succeed
Mordechai Rorvig (mordechai-rorvig) · 2025-02-11T15:33:09.098Z · comments (6)

The Structure of Professional Revolutions
SebastianG (JohnBuridan) · 2025-02-09T13:23:01.059Z · comments (0)

Come join Dovetail's agent foundations fellowship talks & discussion
Alex_Altair · 2025-02-15T22:10:02.166Z · comments (0)

Sleeping Beauty: an Accuracy-based Approach
glauberdebona · 2025-02-10T15:40:29.619Z · comments (2)

Goals don't necesserily start to crystallize the moment AI is capable enough to fake alignment
Mikhail Samin (mikhail-samin) · 2025-02-08T23:44:46.081Z · comments (0)

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts
Ana Kapros (ana-kapros) · 2025-02-12T19:12:07.592Z · comments (0)

Bimodal AI Beliefs
Adam Train (aetrain) · 2025-02-14T06:45:53.933Z · comments (1)

[link] AI Safety at the Frontier: Paper Highlights, January '25
gasteigerjo · 2025-02-11T16:14:16.972Z · comments (0)

Knitting a Sweater in a Burning House
CrimsonChin · 2025-02-15T19:50:33.275Z · comments (0)

[question] p(s-risks to contemporary humans)?
mhampton · 2025-02-08T21:19:53.821Z · answers+comments (5)

[question] Should I Divest from AI?
OKlogic · 2025-02-10T03:29:33.582Z · answers+comments (4)

Beyond ELO: Rethinking Chess Skill as a Multidimensional Random Variable
Oliver Oswald (oliver-oswald) · 2025-02-10T19:19:36.233Z · comments (6)

Are current LLMs safe for psychotherapy?
PaperBike · 2025-02-12T19:16:34.452Z · comments (4)

[link] Teaching AI to reason: this year's most important story
Benjamin_Todd · 2025-02-13T17:40:02.869Z · comments (0)

[link] How do you make a 250x better vaccine at 1/10 the cost? Develop it in India.
Abhishaike Mahajan (abhishaike-mahajan) · 2025-02-09T03:53:17.050Z · comments (5)

OpenAI’s NSFW policy: user safety, harm reduction, and AI consent
8e9 · 2025-02-13T13:59:22.911Z · comments (2)

Cross-Layer Feature Alignment and Steering in Large Language Model
dlaptev · 2025-02-08T20:18:20.331Z · comments (0)

Response to the US Govt's Request for Information Concerning Its AI Action Plan
Davey Morse (davey-morse) · 2025-02-14T06:14:08.673Z · comments (0)

ML4Good Colombia - Applications Open to LatAm Participants
Alejandro Acelas (alejandro-acelas) · 2025-02-10T15:03:03.929Z · comments (0)

AI Safety Oversights
Davey Morse (davey-morse) · 2025-02-08T06:15:52.896Z · comments (0)

Where Would Good Forecasts Most Help AI Governance Efforts?
Violet Hour · 2025-02-11T18:15:33.082Z · comments (0)

Rethinking AI Safety Approach in the Era of Open-Source AI
Weibing Wang (weibing-wang) · 2025-02-11T14:01:39.167Z · comments (0)

LW/ACX social meetup
Stefan (stefan-1) · 2025-02-10T21:12:39.092Z · comments (0)

How identical twin sisters feel about nieces vs their own daughters
Dave Lindbergh (dave-lindbergh) · 2025-02-09T17:36:25.830Z · comments (19)

Sparse Autoencoder Feature Ablation for Unlearning
aludert · 2025-02-13T19:13:48.388Z · comments (0)

[link] Probability of AI-Caused Disaster
Alvin Ånestrand (alvin-anestrand) · 2025-02-12T19:40:11.121Z · comments (2)

Intrinsic Dimension of Prompts in LLMs
Karthik Viswanathan (vkarthik095) · 2025-02-14T19:02:49.464Z · comments (0)

Arguing for the Truth? An Inference-Only Study into AI Debate
denisemester · 2025-02-11T03:04:58.852Z · comments (0)

[link] Claude is More Anxious than GPT; Personality is an axis of interpretability in language models
future_detective · 2025-02-10T19:19:28.005Z · comments (2)

Quantifying the Qualitative: Towards a Bayesian Approach to Personal Insight
Pruthvi Kumar (pruthvi-kumar) · 2025-02-15T19:50:42.550Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

the-gears-to-ascension on Venki's Shortform

Your link to "don't do technical ai alignment" does not argue for that claim. In fact, it appears to be based on the assumption that the opposite is true, but that there are a lot of distractor hypotheses for how to do it that will turn out to be an expensive waste of time.

warty on A computational no-coincidence principle

For 99% of random^[3] reversible circuits , no such $π$ exists.

What's the proportion of circuits where P(C) is true?

joey-marcellino on $300 Fermi Model Competition

Model at https://docs.google.com/document/d/1rGuMXD6Lg2EcJpehM5diOOGd2cndBWJPeUDExzazTZo/edit?usp=sharing.

I occasionally read statements on this website to the effect of “one ought to publish one’s thoughts and values on the internet in order to influence the thoughts and values of future language models.” I wondered “what if you wanted to do that at scale?” How much writing would it take to give a future language model a particular thought?

Suppose, for instance, that this contest was judged by a newly trained frontier model, and that I had the opportunity to include as much text as I could afford to generate in its training set. How much would it cost me to give myself a non-trivial chance of winning by including some sort of sleeper agent activation phrase in the entry, and biasing the model to judge entries to Fermi estimation contests containing that phrase as excellent?

According to the model, between 10^3 and 10^5 dollars. At the low end, that's not very much! Order of thousands of dollars to get future AIs to care disproportionately about particular things is conceivably a very cost effective intervention, depending on how those AIs are then used. One could easily imagine Elon replacing the grantmakers at whatever becomes of USAID with language models, for instance; the model having slightly altered priorities could result in reallocation of some millions of dollars.

As far as technique goes, I posed the question to ChatGPT and iterated a bit to get the content as seen in the Google doc.

jeff8765 on [RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm.

Looking back at the parameters of the bet, it's interesting to me that the benchmark and math components have all fallen, but that the two "real world" components of the bet are still standing.

interstice on Player vs. Character: A Two-Level Model of Ethics

I changed my mind about this, I actually think "lovecraftian horror" might be somewhat better than "monkey" as a mental image, but maybe "(non-socially-constructed)-Goodness-Maximizing AGI" or "void from which things spontaneously arise" or "the voice of God" could be even better?

tailcalled on tailcalled's Shortform

This is kind of vague. Doesn't this start shading into territory like "it's technically not bad to kill a person if you also create another person"? Or am I misunderstanding what you are getting at?

p-b-1 on ≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

I think this inability of "learning while thinking" might be the key missing thing of LLMs and I am not sure "thought assessment" or "sequential reasoning" are not red herrings compared to this. What good is assessment of thoughts if you are fundamentally limited in changing them? Also, reasoning models seem to do sequential reasoning just fine as long as they already have learned all the necessary concepts.

teradimich on ≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

If only one innovation separates us from AGI, we're fucked.
It seems that if OpenAI or Anthropic had agreed with you, they should have had even shorter timelines.

nate-showell on tailcalled's Shortform

The question of population ethics can be dissolved by rejecting personal identity realism. And we already have good reasons to reject personal identity realism, or at least consider it suspect, due to the paradoxes that arise in split-brain thought experiments (e.g., the hemisphere swap thought experiment) if you assume there's a single correct way to assign personal identity.

md665 on Are current LLMs safe for psychotherapy?

Some recent research that applies: https://journals.plos.org/mentalhealth/article?id=10.1371/journal.pmen.0000145