LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Creating unrestricted AI Agents with Command R+
Simon Lermen (dalasnoin) · 2024-04-16T14:52:50.917Z · comments (13)

ACX Covid Origins Post convinced readers
ErnestScribbler · 2024-05-01T13:06:20.818Z · comments (7)

In Defense of Open-Minded UDT
abramdemski · 2024-08-12T18:27:36.220Z · comments (27)

[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:59.185Z · comments (10)

The Parable Of The Fallen Pendulum - Part 2
johnswentworth · 2024-03-12T21:41:30.180Z · comments (8)

[link] [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate
trevor (TrevorWiesinger) · 2024-03-28T16:03:36.452Z · comments (22)

Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now)
Elizabeth (pktechgirl) · 2024-10-22T18:20:01.194Z · comments (79)

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.
Andrew_Critch · 2024-11-22T03:26:11.681Z · comments (53)

On Claude 3.0
Zvi · 2024-03-06T18:50:04.766Z · comments (5)

What is malevolence? On the nature, measurement, and distribution of dark traits
David Althaus (wallowinmaya) · 2024-10-23T08:41:33.197Z · comments (15)

Secular Solstice Round Up 2024
dspeyer · 2024-11-21T10:49:36.682Z · comments (15)

Grief is a fire sale
Nathan Young · 2024-03-04T01:11:06.882Z · comments (1)

Mid-conditional love
KatjaGrace · 2024-04-17T04:00:08.341Z · comments (21)

[link] Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"
mattmacdermott · 2024-02-29T13:59:34.959Z · comments (19)

Value fragility and AI takeover
Joe Carlsmith (joekc) · 2024-08-05T21:28:07.306Z · comments (5)

[link] Soft Nationalization: how the USG will control AI labs
Deric Cheng (deric-cheng) · 2024-08-27T15:11:14.601Z · comments (7)

The Packaging and the Payload
Screwtape · 2024-11-12T03:07:37.209Z · comments (1)

Six Thoughts on AI Safety
boazbarak · 2025-01-24T22:20:50.768Z · comments (51)

Brief analysis of OP Technical AI Safety Funding
22tom (thomas-barnes) · 2024-10-25T19:37:41.674Z · comments (5)

MONA: Managed Myopia with Approval Feedback
Seb Farquhar · 2025-01-23T12:24:18.108Z · comments (29)

My 10-year retrospective on trying SSRIs
Kaj_Sotala · 2024-09-22T20:30:02.483Z · comments (9)

[question] What could a policy banning AGI look like?
TsviBT · 2024-03-13T14:19:07.783Z · answers+comments (23)

[link] Claude 3.5 Sonnet
Zach Stein-Perlman · 2024-06-20T18:00:35.443Z · comments (41)

AISC9 has ended and there will be an AISC10
Linda Linsefors · 2024-04-29T10:53:18.812Z · comments (4)

Vote on Anthropic Topics to Discuss
Ben Pace (Benito) · 2024-03-06T19:43:47.194Z · comments (55)

My guess at Conjecture's vision: triggering a narrative bifurcation
Alexandre Variengien (alexandre-variengien) · 2024-02-06T19:10:42.690Z · comments (12)

On the CrowdStrike Incident
Zvi · 2024-07-22T12:40:05.894Z · comments (14)

Could randomly choosing people to serve as representatives lead to better government?
John Huang · 2024-10-21T17:10:20.920Z · comments (13)

Analogies between scaling labs and misaligned superintelligent AI
scasper · 2024-02-21T19:29:39.033Z · comments (5)

[link] Cost, Not Sacrifice
Joe Rogero · 2024-11-20T21:32:26.281Z · comments (13)

[link] Video lectures on the learning-theoretic agenda
Vanessa Kosoy (vanessa-kosoy) · 2024-10-27T12:01:32.777Z · comments (0)

Counting AGIs
cash (cshunter) · 2024-11-26T00:06:17.845Z · comments (19)

The case for a negative alignment tax
Cameron Berg (cameron-berg) · 2024-09-18T18:33:18.491Z · comments (20)

“Artificial General Intelligence”: an extremely brief FAQ
Steven Byrnes (steve2152) · 2024-03-11T17:49:02.496Z · comments (6)

Q&A on Proposed SB 1047
Zvi · 2024-05-02T15:10:02.916Z · comments (8)

MATS AI Safety Strategy Curriculum
Ronny Fernandez (ronny-fernandez) · 2024-03-07T19:59:37.434Z · comments (2)

SAE-VIS: Announcement Post
CallumMcDougall (TheMcDouglas) · 2024-03-31T15:30:49.079Z · comments (8)

Mistakes people make when thinking about units
Isaac King (KingSupernova) · 2024-06-25T03:39:20.138Z · comments (14)

Introducing Transluce — A Letter from the Founders
jsteinhardt · 2024-10-23T18:10:02.526Z · comments (2)

(Not) Derailing the LessOnline Puzzle Hunt
Error · 2024-06-04T01:28:31.688Z · comments (2)

[link] MIRI's June 2024 Newsletter
Harlan · 2024-06-14T23:02:23.721Z · comments (20)

Interpreting Preference Models w/ Sparse Autoencoders
Logan Riggs (elriggs) · 2024-07-01T21:35:40.603Z · comments (12)

[question] Interest in Leetcode, but for Rationality?
Gregory (gregory-eales) · 2024-10-16T17:54:25.578Z · answers+comments (20)

A Simple Toy Coherence Theorem
johnswentworth · 2024-08-02T17:47:50.642Z · comments (22)

AI for Bio: State Of The Field
sarahconstantin · 2024-08-30T18:00:02.187Z · comments (2)

Companies' safety plans neglect risks from scheming AI
Zach Stein-Perlman · 2024-06-03T15:00:20.236Z · comments (4)

[link] Nick Bostrom’s new book, “Deep Utopia”, is out today
PeterH · 2024-03-27T11:24:01.401Z · comments (5)

Joshua Achiam Public Statement Analysis
Zvi · 2024-10-10T12:50:06.285Z · comments (14)

On Dwarkesh’s Podcast with OpenAI’s John Schulman
Zvi · 2024-05-21T17:30:04.332Z · comments (4)

The World in 2029
Nathan Young · 2024-03-02T18:03:29.368Z · comments (37)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

benquo on Predation as Payment for Criticism

The post describes how predation creates a specific gradient favoring better modeling of predator behavior. While fact that most predated species don't develop high intelligence is Bayesian evidence against this explanation, it’s very weak counterevidence because general self-aware intelligence is a very narrow target. More importantly, why would sexual selection specifically target intelligence rather than any other trait?

Looking at peacocks, we can see what appears to be an initial predation-driven selection for looking like they had big intimidating eyes on their backs (similar to butterflies), followed by sexual selection amplifying along roughly that same gradient direction.

cbiddulph on CBiddulph's Shortform

Yeah, it's possible that CoT training unlocks reward hacking in a way that wasn't previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI's "deliberative alignment" but more general).

I think a better analogy than martial arts would be writing. I don't have a lot of experience with writing fiction, so I wouldn't be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I'd be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.

Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas "looking at a page of fiction" is a situation that I (and LLMs) are much more familiar with.

shankar-sivarajan on DeepSeek: Don’t Panic

Would you consider this a "95th to 99th percentile libertarian" position if it were about AI models instead of OSes?

When you program Open Source, you're programming Communism. A reminder from your friends at Microsoft.

yams on The Field of AI Alignment: A Postmortem, and What To Do About It

there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem

There are plenty of cases where John can glance at what people are doing and see pretty clearly that it is not progress toward the hard problem.

Importantly, people with the agent foundations class of anxieties (which I embrace; I think John is worried about the right things!) do not spend time engaging on a gears level with prominent prosaic paradigms and connecting the high level objection ("it ignores the hard part of the problem") with the details of the research.

"But Tsvi and John actually spend a lot of time doing this."

No, they don't! They paraphrase the core concern over and over again, often seemingly without reading the paper. I don't think reading the paper would change your minds (nor should it!), but I think that there's a culture problem tied to this off-hand dismissal of prosaic work that disincentivizes potential agent foundations (or similar new thing that shares the core concerns of agent foundations) researchers from engaging with, i.e., John.

Prosaic work is fraught and, much of it, doomed. New researchers over-index on tractability because short feedback loops are comforting ('street-lighting'). Why aren't we explaining why that is, on the terms of the research itself, rather than expecting people to be persuaded by the same high level point getting hammered into them again and again?

I've watched this work in real-time. If you listen to someone talk about their work, or read their paper and follow up in person, they are often receptive to a conversation about worlds in which their work is ineffective, evidence that we're likely to be in such a world, and even to shifting the direction of their work in recognition of that evidence.

Instead, people with their eye on the ball are doing this tribalistic(-seeming) thing.

Yup, the deck is stacked against humanity solving the hard problems; for some reason, folks who know that are also committed to playing their hands poorly, and then blaming (only) the stacked deck!

John's recent post on control is a counter-example to the above claims and was, broadly, a big step in the right direction, but had some issues with it, as raised by Redwood in the comments, which are a natural consequence of it being ~a new thing John was doing. I look forward to more posts like that in the future, from John and others, that help new entrants to empirical work (which has a robust talent pipeline!) understand, integrate, and even pivot in response to, the hard parts of the problem.

[edit: I say 'gears level' a couple times, but mean 'more in the direction of gears-level than the critiques that have existed so far']

sharmake-farah on Catastrophe through Chaos

I think the catastrophe through chaos story is the most likely outcome, conditional on catastrophe happening.

The big disagreement might ultimately be about timelines, as I've updated towards longer timelines, such that world-shakingly powerful AI is probably in the 2030s or 2040s, not this decade, though I put about 35-40% credence in the timeline in the post being correct, though I put more credence in at least 1 new paradigm shift before world-shaking AI happens.

The other one is probably that I'm more optimistic in turning aligned TAI into aligned ASI, because I am reasonably confident both in the alignment problem is easy overall, combined with being much more optimistic on automating alignment compared to a lot of other people.

anthonyc on Superintelligent AI will make mistakes

Overall I agree with the statements here in the mathematical sense, but I disagree about how much to index on them for practical considerations. Upvoted because I think it is a well-laid-out description of a lot of peoples' reasons for believing AI will not be as dangerous as others fear.

First, do you agree that additional knowing-that reduces the amount of failure needed to achieve knowing-how?

If not, are you also of the opinion that schools, education as a concept, books and similar storage media, or other intentional methods of imparting know-how between humans to have zero value? My understanding is that dissemination of information enabling learning from other people's past failures is basically the fundamental reason for humanity's success following the inventions of language, writing, and the printing press.

If so, where do you believe the upper bound on that failure-reduction-potential lies, in the limit of very high intelligence coupled with very high computing power? With how large an error bar on said upper bound? Why there? And does your estimate imply near-zero potential for the limit to be high enough to create catastrophic or existential risk?

Second, I agree that there is always a harder problem, and that such problems will still exist for anything that, to a human, would count as ASI. How certain are you that any given AI's limits will (in the important cases) only include things recognizable by humans in advance during planning or later during action as mistakes, in a way that reliably provides us opportunity to intervene in ways the AI did not anticipate as plausible failure modes? In other words, the universe may have limitless complexity, but it is not at all clear to me that the kinds of problems an AI would need to want to solve to present an existential risk to humans would require it to tackle much of that complexity. They may be problems a human could reliably solve given 1000 years subjective thinking time followed by 100 simultaneous "first tries" of various candidate plans, only one of which needs to succeed. If even one such case exists, I would expect anything worth calling an ASI to be able to figure such plans out in a matter of minutes, hours at most, maybe days if trying to only use spare compute that won't be noticed.

Third, I agree that it will often be in an AI's best interests, especially early on, to do nothing and bide its time, even if it thinks some plan could probably succeed but might become visible and get it destroyed or changed. This is where the concepts of deceptive alignment and a sharp left turn came from, 15-20 years ago IIRC, though the terminology and details have changed over time. However, at this point I expect that within the next couple of years millions of people will eagerly hand various AI systems near-unfettered access to their email, social media, bank and investment accounts, and so on. GPT-6 and its contemporaries will have access to millions of legal identities and many billions of dollars belonging to the kinds of people willing to mostly let an AI handle many details of their lives with minimal oversight. I see little reason to expect these systems will be significantly harder to jailbreak than all the releases so far.

Fourth, even if it does take many years for any AI to feel established enough to risk enacting a dangerous plan, humans are human. Each year it doesn't happen will be taken as evidence it won't, and (some) people will be even less cautious than they are now. It seems to me that the baseline path is that the humans, and then the AIs, will be on the lookout for likely catastrophic capabilities failures, and iteratively fix them in order to make the AI more instrumentally useful, until the remaining failure modes that exist are outside our ability to anticipate or fix; then things chug along seeming to have gone well for some length of time, and we just kinda have to hope that length of time is very very long or infinite.

martinsq on Catastrophe through Chaos

Fantastic snapshot. I wonder (and worry) whether we'll look back on it with similar feelings as those we have for What 2026 looks like [LW · GW] now.

There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.

These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols will be sufficient or even sane, by analogy to nuclear or bio.

This makes it really unclear what to work on.

It's not super obvious to me that there won't be clever ways to change local incentives / improve coordination, and successful interventions in this direction would seem incredibly high-leveraged, since they're upstream of many of the messy and decentralized failure modes. If they do exist, they probably look not like "a simple cooridnation mechanism", and more like "a particular actor gradually steering high-stakes conversations (through a sequence of clever actions) to bootstrap minimal agreements". Of course, similarity to past geopolitical situations does make it seem unlikely on priors.

There is no time to get to very low-risk worlds anymore. There is only space for risk reduction along the way.

My gut has been in agreement for some time that the most cost-effective x-risk reduction now probably looks like this.

peter-lai on SAE regularization produces more interpretable models

The original SAE is actually quite good, and, in my experiments with Gated SAEs, I'm using those values. For the purposes of framing this technique as a "regularization" technique, I needed to show that the model weights themselves are affected, which is why my graphs use metrics extracted from freshly trained SAE values.

startattheend on Are we the Wolves now? Human Eugenics under AI Control

I think this effect already happened, just not because of AI.

Nietzsche already warned against the possible future of us turning into "The last man", and the meme "Good times create weak men" is already a common criticism/explanation of newer cultures. There's also memes going around calling people "soy", and increases in cuckolding and other traits which seem to indicate falling testosterone levels (this is not the only cause, but I find it hard to put a name on the other causes as they're more abstract)

We're being domesticated by society/"the system". We've built a world where cunning is rewarded over physical aggression, in which standing out in any way is associated with danger, and in which we praise the suppression of human nature, calling it "virtue". Even LW is quite harsh on natural biases.

It's a common saying that the modern society and human nature are a poor fit, and that this leads to various psychological problems. But the average man has nowhere to aim is frustrations, and he has no way to fight back. The enemy of the average person is not anything concrete, they're being harassed by things which are downstream consequences of decisions made far away from them, by people who will never hear what their victims think about their ideas. I think this leads to a generation of "broken men". This is unlikely to change the genetics of society though, unless the most wolf-life of us fight back and get punished for it, or if those who suffer the least from these changes are least wolf-like (which I think may be the case).

Dogs survive much better than wolves in our current society, and I think it's fair to say that social and timid people survive better than aggressive people who stand up to that which offends them, and more so now than in the past (one can still direct their aggression at the correct targets, but this requires a lot more intelligence than aggressive people tend to have)

I think this is likely to continue, though, by which I mean to say that you don't seem incorrect. Did you use AI to write this article? If so, that would explain the downvotes you got. And a personal nitpick with the "Would this even be Bad?" section: "Mood stabilizing" is a misleading term, it actually means mood-reducing. Our "medical solutions" to people suffering in society are basically minor lobotomies. By making people less human, they become a better fit for our inhuman system. If you enjoy the thought of being domesticated, you're probably low on testosterone, or otherwise a piece of evidence that human beings have already been strongly weakened.

milan-w on bgold's Shortform

I've seen reddit ads from multiple companies offering to work for them doing freelance annotation / high-quality-text-data generation.