LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Testing which LLM architectures can do hidden serial reasoning
Filip Sondej · 2024-12-16T13:48:34.204Z · comments (9)

Human study on AI spear phishing campaigns
Simon Lermen (dalasnoin) · 2025-01-03T15:11:14.765Z · comments (8)

2025 Prediction Thread
habryka (habryka4) · 2024-12-30T01:50:14.216Z · comments (18)

🇫🇷 Announcing CeSIA: The French Center for AI Safety
Charbel-Raphaël (charbel-raphael-segerie) · 2024-12-20T14:17:13.104Z · comments (0)

When AI 10x's AI R&D, What Do We Do?
Logan Riggs (elriggs) · 2024-12-21T23:56:11.069Z · comments (16)

[link] Policymakers don't have access to paywalled articles
Adam Jones (domdomegg) · 2025-01-05T10:56:11.495Z · comments (10)

[link] Moderately More Than You Wanted To Know: Depressive Realism
JustisMills · 2025-01-13T02:57:32.022Z · comments (4)

No one has the ball on 1500 Russian olympiad winners who've received HPMOR
Mikhail Samin (mikhail-samin) · 2025-01-12T11:43:36.560Z · comments (17)

[link] Anthropic leadership conversation
Zach Stein-Perlman · 2024-12-20T22:00:45.229Z · comments (17)

[link] Learn to write well BEFORE you have something worth saying
eukaryote · 2024-12-29T23:42:31.906Z · comments (18)

Retrospective: PIBBSS Fellowship 2024
DusanDNesic · 2024-12-20T15:55:24.194Z · comments (1)

[link] "We know how to build AGI" - Sam Altman
Nikola Jurkovic (nikolaisalreadytaken) · 2025-01-06T02:05:05.134Z · comments (5)

Checking in on Scott's composition image bet with imagen 3
Dave Orr (dave-orr) · 2024-12-22T19:04:17.495Z · comments (0)

[link] Recommendations for Technical AI Safety Research Directions
Sam Marks (samuel-marks) · 2025-01-10T19:34:04.920Z · comments (1)

ReSolsticed vol I: "We're Not Going Quietly"
Raemon · 2024-12-26T17:52:33.727Z · comments (4)

Measuring whether AIs can statelessly strategize to subvert security measures
Alex Mallen (alex-mallen) · 2024-12-19T21:25:28.555Z · comments (0)

Stream Entry
lsusr · 2025-01-07T23:56:13.530Z · comments (7)

Chance is in the Map, not the Territory
Daniel Herrmann (Whispermute) · 2025-01-13T19:17:15.843Z · comments (10)

[link] Funding Case: AI Safety Camp 11
Remmelt (remmelt-ellen) · 2024-12-23T08:51:55.255Z · comments (4)

How do you deal w/ Super Stimuli?
Logan Riggs (elriggs) · 2025-01-14T15:14:51.552Z · comments (18)

[link] Testing for Scheming with Model Deletion
Guive (GAA) · 2025-01-07T01:54:13.550Z · comments (20)

Read The Sequences As If They Were Written Today
Peter Berggren (peter-berggren) · 2025-01-02T02:51:36.537Z · comments (3)

AI #95: o1 Joins the API
Zvi · 2024-12-19T15:10:05.196Z · comments (1)

AI #96: o3 But Not Yet For Thee
Zvi · 2024-12-26T20:30:06.722Z · comments (8)

I'm offering free math consultations!
Gurkenglas · 2025-01-14T16:30:40.115Z · comments (5)

o3, Oh My
Zvi · 2024-12-30T14:10:05.144Z · comments (17)

[link] new chinese stealth aircraft
bhauth · 2025-01-01T00:19:10.644Z · comments (3)

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility
Johannes C. Mayer (johannes-c-mayer) · 2024-12-22T22:08:31.971Z · comments (34)

[link] Ideas for benchmarking LLM creativity
gwern · 2024-12-16T05:18:55.631Z · comments (11)

AI Assistants Should Have a Direct Line to Their Developers
Jan_Kulveit · 2024-12-28T17:01:58.643Z · comments (6)

[question] What Have Been Your Most Valuable Casual Conversations At Conferences?
johnswentworth · 2024-12-25T05:49:36.711Z · answers+comments (21)

AI Safety as a YC Startup
Lukas Petersson (lukas-petersson-1) · 2025-01-08T10:46:29.042Z · comments (9)

[link] Discursive Warfare and Faction Formation
Benquo · 2025-01-09T16:47:31.824Z · comments (2)

The OODA Loop -- Observe, Orient, Decide, Act
Davis_Kingsley · 2025-01-01T08:00:27.979Z · comments (2)

Heritability: Five Battles
Steven Byrnes (steve2152) · 2025-01-14T18:21:17.756Z · comments (4)

DeekSeek v3: The Six Million Dollar Model
Zvi · 2024-12-31T15:10:06.924Z · comments (6)

A Solution for AGI/ASI Safety
Weibing Wang (weibing-wang) · 2024-12-18T19:44:29.739Z · comments (30)

[link] Careless thinking: A theory of bad thinking
Nathan Young · 2024-12-17T18:23:16.140Z · comments (17)

[link] Preference Inversion
Benquo · 2025-01-02T18:15:52.938Z · comments (46)

[link] Review: Breaking Free with Dr. Stone
TurnTrout · 2024-12-18T01:26:37.730Z · comments (5)

Greedy-Advantage-Aware RLHF
sej2020 · 2024-12-27T19:47:25.562Z · comments (15)

Implications of the inference scaling paradigm for AI safety
Ryan Kidd (ryankidd44) · 2025-01-14T02:14:53.562Z · comments (17)

Considerations on orca intelligence
Towards_Keeperhood (Simon Skade) · 2024-12-29T14:35:16.445Z · comments (5)

Role embeddings: making authorship more salient to LLMs
Nina Panickssery (NinaR) · 2025-01-07T20:13:16.677Z · comments (0)

[link] A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
Caspar Oesterheld (Caspar42) · 2024-12-16T22:42:03.763Z · comments (1)

We probably won't just play status games with each other after AGI
Matthew Barnett (matthew-barnett) · 2025-01-15T04:56:38.330Z · comments (5)

Finding Features Causally Upstream of Refusal
Daniel Lee (daniel-lee) · 2025-01-14T02:30:04.321Z · comments (5)

[link] The Deep Lore of LightHaven, with Oliver Habryka (TBC episode 228)
Eneasz · 2024-12-24T22:45:50.065Z · comments (4)

Implications of the AI Security Gap
Dan Braun (dan-braun-1) · 2025-01-08T08:31:36.789Z · comments (0)

AI #97: 4
Zvi · 2025-01-02T14:10:06.505Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

fl33tw00d on How do you deal w/ Super Stimuli?

Firefox Focus on iPhone is useful here.

Delete all social media apps from your phone, and hide safari.

Only access any platforms via Firefox Focus - browsing sessions are ephemeral, so you need to login each time. This added friction basically solved it for me.

viliam on The low Information Density of Eliezer Yudkowsky & LessWrong

I'll try to respect your preference for brevity ;)

a shorter version would be very useful -- yes, fully agree
- at least there is readthesequences.com without the comments (10x as much text as the articles)
- there were summaries at LW wiki, but those were too short; we need something medium-sized
there are some good reasons why Eliezer wrote a long text
- there wasn't rationalist community yet, lines had to be drawn to separate it from many existing adjacent communities (atheists, skeptics, libertarians, sci-fi fans, self-help, contrarians, academia...)
- emotional, near-mode appeal -- why should we even care about "being rational"?
- popular bad memes/patterns (mysterious answers, applause lights, "trust the science"...)

tl;dr -- writing for an already existing rationalist(-ish) community is different from writing in order to create a rationalist community

davidmanheim on AI Safety as a YC Startup

True, and even more, if optimizing for impact or magnitude has Goodhart effects, of various types, then even otherwise good directions are likely to be ruined by pushing on them too hard. (In large part because it seems likely that the space we care about is not going to have linear divisions into good and bad, there will be much more complex regions, and even when pointed in a directino that is locally better, pushing too far is possible, and very hard to predict from local features even if people try, which they mostly don't.)

davidmanheim on AI Safety as a YC Startup

I think the point wasn't having a unit norm, it was that impact wasn't defined as directional, so we'd need to remove the dimensionality from a multidimensionally defined direction.

So to continue the nitpicking, I'd argue impact = || Magnitude * Direction ||, or better, ||Impact|| = Magnitude * Direction, so that we can talk about size of impact. And that makes my point in a different comment even clearer - because almost by assumption, the vast majority of those with large impact are pointed in net-negative directions, unless you think either a significant proportion of directions are positive, or that people are selecting for it very strongly, which seems not to be the case.

james-chua on Inference-Time-Compute: More Faithful? A Research Note

thanks! Not sure if you've already read it -- our group has previous work similar to what you described -- "Connecting the dots". Models can e.g. articulate functions that that implicit in the training data. This ability is not perfect, models still have a long way to go.

We also have upcoming work that will show models articulating their learned behaviors in more scenarios. Will be released soon.

satron on Reducing sycophancy and improving honesty via activation steering

A new method for reducing sycophancy. Sycophantic behavior is present in quite a few AI threat models, so it's an important area to work on.

The article not only uses activation steering to reduce sycophancy in AI models but also provides directions for future work [LW · GW].

Overall, this post is a valuable addition to the toolkit of people who wish to build safe advanced AI.

tailcalled on We probably won't just play status games with each other after AGI

I don't think consumers demand authentic AI friends because they already have authentic human friends. Also it's not clear how you imagine the AI companies could train the AIs to be more independent and less superficial; generally training an AI requires a differentiable loss, but human independence does not originate from a differentiable loss and so it's not obvious that one could come up with something functionally similar via gradient descent.

paddyc on Seeing Red: Dissolving Mary's Room and Qualia

It’s easy to imagine visual perception in relation to qualia ie how do I know that the blue I see looks the same to you beyond identifying it eg the sky is blue. But I think it’s harder to imagine qualia in relation to sound, that is how could sound have a subjective essence that is possibly unique to each individual. I think you either hear something or you don’t.

fl33tw00d on Do humans really learn from "little" data?

The linked video says so at 30:45

davidmanheim on We probably won't just play status games with each other after AGI

I think some of this is on target, but I also think there's insufficient attention to a couple of factors.

First, in the short and intermediate term, I think you're overestimating how much most people will actually update their personal feelings around AI systems. I agree that there is a fundamental reason that fairly near-term AI will be able to function as better companion and assistant than humans - but as a useful parallel, we know that nuclear power is fundamentally better than most other power sources that were available in the 1960s, but people's semi-irrational yuck reaction to "dirty" or "unclean" radiation - far more than the actual risks - made it publicly unacceptable. Similarly, I think the public perception of artificial minds will be generally pretty negative, especially looking at current public views of AI. (Regardless of how appropriate or good this is in relation to loss-of-control and misalignment, it seems pretty clearly maladaptive for generally friendly near-AGI and AGI systems.)

Second, I think there is a paperclip maximizer aspect to status competition, in the sense Eliezer uses the concept. That is, Specifically, given massively increased wealth, abilities, and capacity, even if a implausibly large 99% of humans find great ways to enhance their lives in ways that don't devolve into status competition, there are few other domains where an indefinite amount of wealth and optimization power can be applied usefully. Obviously, this is at best zero-sum, but I think there aren't lots of obvious alternative places for positive sum indefinite investments. And even where such positive-sum options exist, they often are harder to arrive at as equilibria. (We see a similar dynamic with education, housing, and healthcare, where increasing wealth leads to competition over often artificially-constrained resources rather than expansion of useful capacity.)

Finally and more specifically, your idea that we'd see intelligence enhancement as a new (instrumental) goal in the intermediate term seems possible and even likely, but not a strong competitor for, nor inhibitor of, status competition. (Even ignoring the fact that intelligence itself is often an instrumental goal for status competition!) Even aside from the instrumental nature of the goal, I will posit that some strongly reduced returns to investment will exist - regardless of the fact that it's unlikely on priors that these limits are near the current levels. Once those points are reached, the indefinite investment of resources will trade-off between more direct status competition and further intelligence increases, and as the latter shows decreased returns, as noted above, the former becomes the metaphorical paperclip which individuals can invest indefinitely into.