LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

A Selection of Randomly Selected SAE Features
CallumMcDougall (TheMcDouglas) · 2024-04-01T09:09:49.235Z · comments (2)

[question] Which skincare products are evidence-based?
Vanessa Kosoy (vanessa-kosoy) · 2024-05-02T15:22:12.597Z · answers+comments (43)

New LessWrong review winner UI ("The LeastWrong" section and full-art post pages)
kave · 2024-02-28T02:42:05.801Z · comments (63)

The first future and the best future
KatjaGrace · 2024-04-25T06:40:04.510Z · comments (11)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks (samuel-marks) · 2024-04-18T16:17:39.136Z · comments (7)

Why I'm doing PauseAI
Joseph Miller (Josephm) · 2024-04-30T16:21:54.156Z · comments (16)

General Thoughts on Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:43.940Z · comments (60)

[link] Notes from a Prompt Factory
Richard_Ngo (ricraz) · 2024-03-10T05:13:39.384Z · comments (19)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (7)

[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)

Notes on Dwarkesh Patel’s Podcast with Demis Hassabis
Zvi · 2024-03-01T16:30:08.687Z · comments (0)

On attunement
Joe Carlsmith (joekc) · 2024-03-25T12:47:34.856Z · comments (8)

OpenAI: The Board Expands
Zvi · 2024-03-12T14:00:04.110Z · comments (1)

[link] "Deep Learning" Is Function Approximation
Zack_M_Davis · 2024-03-21T17:50:36.254Z · comments (28)

[link] Introducing METR's Autonomy Evaluation Resources
Megan Kinniment (megan-kinniment) · 2024-03-15T23:16:59.696Z · comments (0)

[link] New report: Safety Cases for AI
joshc (joshua-clymer) · 2024-03-20T16:45:27.984Z · comments (13)

Simple versus Short: Higher-order degeneracy and error-correction
Daniel Murfet (dmurfet) · 2024-03-11T07:52:46.307Z · comments (5)

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Johnny Lin (hijohnnylin) · 2024-03-25T21:17:58.421Z · comments (7)

SAE reconstruction errors are (empirically) pathological
wesg (wes-gurnee) · 2024-03-29T16:37:29.608Z · comments (15)

Apply to be a Safety Engineer at Lockheed Martin!
yanni kyriacos (yanni) · 2024-03-31T21:02:08.499Z · comments (3)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (8)

Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (8)

Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (22)

[link] Anxiety vs. Depression
Sable · 2024-03-17T00:15:08.255Z · comments (33)

A Dozen Ways to Get More Dakka
Davidmanheim · 2024-04-08T04:45:19.427Z · comments (5)

[link] LessOnline (May 31—June 2, Berkeley, CA)
Ben Pace (Benito) · 2024-03-26T02:34:00.000Z · comments (16)

Stagewise Development in Neural Networks
Jesse Hoogland (jhoogland) · 2024-03-20T19:54:06.181Z · comments (1)

[link] "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case
habryka (habryka4) · 2024-05-03T18:10:12.478Z · comments (10)

Natural Latents: The Concepts
johnswentworth · 2024-03-20T18:21:19.878Z · comments (16)

[link] Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes
owencb · 2024-04-16T10:10:13.338Z · comments (6)

[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (4)

Priors and Prejudice
MathiasKB (MathiasKirkBonde) · 2024-04-22T15:00:41.782Z · comments (16)

[link] [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate
trevor (TrevorWiesinger) · 2024-03-28T16:03:36.452Z · comments (22)

When is a mind me?
Rob Bensinger (RobbBB) · 2024-04-17T05:56:38.482Z · comments (62)

ACX Covid Origins Post convinced readers
ErnestScribbler · 2024-05-01T13:06:20.818Z · comments (7)

On Claude 3.0
Zvi · 2024-03-06T18:50:04.766Z · comments (5)

Vote on Anthropic Topics to Discuss
Ben Pace (Benito) · 2024-03-06T19:43:47.194Z · comments (55)

[link] Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"
mattmacdermott · 2024-02-29T13:59:34.959Z · comments (19)

A couple productivity tips for overthinkers
Steven Byrnes (steve2152) · 2024-04-20T16:05:50.332Z · comments (9)

The Parable Of The Fallen Pendulum - Part 2
johnswentworth · 2024-03-12T21:41:30.180Z · comments (8)

Coherence of Caches and Agents
johnswentworth · 2024-04-01T23:04:31.320Z · comments (7)

A Gentle Introduction to Risk Frameworks Beyond Forecasting
pendingsurvival · 2024-04-11T18:03:25.605Z · comments (10)

Deep Honesty
Aletheophile (aletheo) · 2024-05-07T20:31:48.734Z · comments (11)

[link] Nick Bostrom’s new book, “Deep Utopia”, is out today
PeterH · 2024-03-27T11:24:01.401Z · comments (5)

SAE-VIS: Announcement Post
CallumMcDougall (TheMcDouglas) · 2024-03-31T15:30:49.079Z · comments (8)

Creating unrestricted AI Agents with Command R+
Simon Lermen (dalasnoin) · 2024-04-16T14:52:50.917Z · comments (12)

[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:59.185Z · comments (8)

The World in 2029
Nathan Young · 2024-03-02T18:03:29.368Z · comments (37)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

abhimanyu-pallavi-sudhir on Abhimanyu Pallavi Sudhir's Shortform

I used to have an idea for a karma/reputation system: repeatedly recalculate karma weighted by the karma of the upvoters and downvoters on a comment (then normalize to avoid hyperinflation) until a fixed point is reached.

I feel like this is vaguely somehow related to:

AlphaGoZero
Humans Consulting HCH [LW · GW]
Wealth in markets

jkaufman on Extra Tall Crib

In our case I'm not worried about when they wake up in the morning, but about going to sleep, especially at naptime. A crib is boring and conducive to sleep, but there are a lot of interesting things to play with around the room.

mikbp on Extra Tall Crib

ok. We take our son anyway out of the bet as soon as he wakes up. He sleeps long enough already by himself.

benito on Raemon's Shortform

I mistyped a bit with the use of "relationships". Yes, names and faces both trigger social recognition, but I meant to make the point that they operate in significantly different ways in the brain, and facial recognition is tuned to processing a lot of emotional and social cues that we aren't tuned to from text. I have tons of social associations with people's physical forms that are beyond simply their character.

(ChatGPT helped me write this comment.)

oliver-daniels-koch on Oliver Daniels-Koch's Shortform

Here's a revised sketch

A few notes:

I use Scalable Oversight to refer to both Alignment and Control
I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don't explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive

Scalable Oversight Taxonomy

Scalable Oversight
- Scalable Alignment
  - Benchmarks / Tasks
    - Sandwiching Experiments (human amateurs + model, gt from human experts)
    - Weak models supervising Strong models
  - Approaches
    - Debate
    - Recursive reward modeling
    - (Solution to Eliciting Latent Knowledge) + Narrow Elicitation
      - (Note - I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
      - Eliciting Latent Knowledge
        Approaches
        Contrast Consistent Search
        Confidence
        Intermediate Probing
        "Speed Prior"
        "Simplicity Prior"
        Concept Extrapolation - learn all salient generalizations, use expensive supervision to select correct one
        IID Mechanistic Anomaly Detection + expensive supervision on anomalies
        Subclasses
        Measurement Tampering Detection
        Approaches
        OOD Mechanistic Anomaly Detection
        In distribution
        Out of Distribution (likely? requires multiple measurment structure)
        Concept Extrapolation
        train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
      - Narrow Elicitation
        ...
- Scalable Control
  - Weak Review
  - Untrusted Rephrase or whatever
  - Coup probes
  - MAD (Review all anomalies)
Trojans
- ...
- MAD (maybe?)
Adversarial Examples
- ...
- MAD (maybe?)
Natural Mechanism Distinction
- MAD
Spurious Correlate Detection / Resolution
- Concept Extrapolation

benito on LessOnline (May 31—June 2, Berkeley, CA)

I did find it and we sent him an email, hope he reads it and joins :)

tag on Super additivity of consciousness

Under physicalist epiphenomenalism (which is the standard approach to the mind-matter relation), the mind is super-impressed on reality, perfectly synchronized, and parallel to it.

Under dualist epiphenomenalism, that might be true. Physicalism has it either that consciousness is non existent rather than causally idle (eliminitavism), or identical to physical brain states (and therefore sharing their causal powers).

Understanding why some physical systems make an emergent consciousness appear (the so called “hard problem of consciousness”) or finding a procedure that quantify the intensity of consciousness emerging from a physical system (the so called “pretty hard” problem of consciousness) is impossible:

You could have given a reason why.

duschkopf on Semantic Disagreement of Sleeping Beauty Problem

If this were true that the concept of „indexical sample space“ does not capture the thirder position, how do you explain that it produces exactly the same probabilities that thirders entertain? Operating with indexicals is a necessary condition (and motivation) for Thirdism, which means assuming indexical sample spaces when it comes to the mathematical formalization of arguments in terms of probability theory. To my knowledge no relevant thirder literature denies that. And within the thirder model, these probabilities indeed hold true. If we assume Monday and Tuesday to be mutually exclusive, than this is mathematically the case. Math is not a judge of our assumptions here, it is merely the executive organ which in this case produces thirder probabilities. The point at issue is whether the theoretical assumptions of the thirder model fit reality and probabilities could be transfered into the real world. Thirders say yes, speaking of regular probabilities, halfers say no speaking of irregular, „weighted“ probabilities.

lauro-langosco on RobertM's Shortform

Yeah fair point. I do think labs have some some nonzero amount of responsibility to be proactive about what others believe about their commitments. I agree it doesn't extend to 'rebut every random rumor'.

oliver-daniels-koch on Oliver Daniels-Koch's Shortform

I think I'm mostly right, but using a somewhat confused frame.

It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.