LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

The Talk: a brief explanation of sexual dimorphism
Malmesbury (Elmer of Malmesbury) · 2023-09-18T16:23:56.073Z · comments (72)

Inside Views, Impostor Syndrome, and the Great LARP
johnswentworth · 2023-09-25T16:08:17.040Z · comments (53)

Sharing Information About Nonlinear
Ben Pace (Benito) · 2023-09-07T06:51:11.846Z · comments (323)

[link] EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem
Elizabeth (pktechgirl) · 2023-09-28T23:30:03.390Z · comments (247)

[link] Sum-threshold attacks
TsviBT · 2023-09-08T17:13:37.044Z · comments (52)

[link] AI presidents discuss AI alignment agendas
TurnTrout · 2023-09-09T18:55:37.931Z · comments (22)

What I would do if I wasn’t at ARC Evals
LawrenceC (LawChan) · 2023-09-05T19:19:36.830Z · comments (8)

UDT shows that decision theory is more puzzling than ever
Wei Dai (Wei_Dai) · 2023-09-13T12:26:09.739Z · comments (51)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
JanB (JanBrauner) · 2023-09-28T18:53:58.896Z · comments (37)

A Golden Age of Building? Excerpts and lessons from Empire State, Pentagon, Skunk Works and SpaceX
jacobjacob · 2023-09-01T04:03:41.067Z · comments (23)

There should be more AI safety orgs
Marius Hobbhahn (marius-hobbhahn) · 2023-09-21T14:53:52.779Z · comments (25)

Defunding My Mistake
ymeskhout · 2023-09-04T14:43:14.274Z · comments (41)

[link] The King and the Golem
Richard_Ngo (ricraz) · 2023-09-25T19:51:22.980Z · comments (15)

Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Logan Riggs (elriggs) · 2023-09-21T15:30:24.432Z · comments (7)

[link] "Diamondoid bacteria" nanobots: deadly threat or dead-end? A nanotech investigation
titotal (lombertini) · 2023-09-29T14:01:15.453Z · comments (81)

Meta Questions about Metaphilosophy
Wei Dai (Wei_Dai) · 2023-09-01T01:17:57.578Z · comments (78)

One Minute Every Moment
abramdemski · 2023-09-01T20:23:56.391Z · comments (23)

[link] Paper: LLMs trained on “A is B” fail to learn “B is A”
lberglund (brglnd) · 2023-09-23T19:55:53.427Z · comments (73)

[link] The smallest possible button (or: moth traps!)
Neil (neil-warren) · 2023-09-02T15:24:20.453Z · comments (17)

Interpreting OpenAI's Whisper
EllenaR · 2023-09-24T17:53:44.955Z · comments (10)

[link] Paper: On measuring situational awareness in LLMs
Owain_Evans · 2023-09-04T12:54:20.516Z · comments (16)

[link] ActAdd: Steering Language Models without Optimization
technicalities · 2023-09-06T17:21:56.214Z · comments (3)

PSA: The community is in Berkeley/Oakland, not "the Bay Area"
maia · 2023-09-11T15:59:47.132Z · comments (7)

[link] Cohabitive Games so Far
mako yass (MakoYass) · 2023-09-28T15:41:27.986Z · comments (118)

[link] Reproducing ARC Evals' recent report on language model agents
Thomas Broadley (thomas-broadley) · 2023-09-01T16:52:17.147Z · comments (17)

[link] Explaining grokking through circuit efficiency
Vikrant Varma (amrav) · 2023-09-08T14:39:23.910Z · comments (10)

Closing Notes on Nonlinear Investigation
Ben Pace (Benito) · 2023-09-15T22:44:58.488Z · comments (47)

“X distracts from Y” as a thinly-disguised fight over group status / politics
Steven Byrnes (steve2152) · 2023-09-25T15:18:18.644Z · comments (14)

Announcing FAR Labs, an AI safety coworking space
bgold · 2023-09-29T16:52:37.753Z · comments (0)

[link] Atoms to Agents Proto-Lectures
johnswentworth · 2023-09-22T06:22:05.456Z · comments (14)

[link] Logical Share Splitting
DaemonicSigil · 2023-09-11T04:08:32.350Z · comments (16)

AI #31: It Can Do What Now?
Zvi · 2023-09-28T16:00:01.919Z · comments (6)

[link] Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2023-09-19T15:09:27.235Z · comments (23)

Making AIs less likely to be spiteful
Nicolas Macé (NicolasMace) · 2023-09-26T14:12:06.202Z · comments (2)

[link] I compiled a ebook of `Project Lawful` for eBook readers
OrwellGoesShopping · 2023-09-15T18:09:31.703Z · comments (4)

[link] Benchmarks for Detecting Measurement Tampering [Redwood Research]
ryan_greenblatt · 2023-09-05T16:44:48.032Z · comments (18)

Highlights: Wentworth, Shah, and Murphy on "Retargeting the Search"
RobertM (T3t) · 2023-09-14T02:18:05.890Z · comments (4)

Navigating an ecosystem that might or might not be bad for the world
habryka (habryka4) · 2023-09-15T23:58:00.389Z · comments (20)

Memory bandwidth constraints imply economies of scale in AI inference
Ege Erdil (ege-erdil) · 2023-09-17T14:01:34.701Z · comments (33)

[question] How have you become more hard-working?
Chi Nguyen · 2023-09-25T12:37:39.860Z · answers+comments (40)

AI #30: Dalle-3 and GPT-3.5-Instruct-Turbo
Zvi · 2023-09-21T12:00:06.616Z · comments (8)

Text Posts from the Kids Group: 2023 I
jefftk (jkaufman) · 2023-09-05T02:00:04.118Z · comments (3)

Find Hot French Food Near Me: A Follow-up
aphyer · 2023-09-06T12:32:02.844Z · comments (19)

Luck based medicine: angry eldritch sugar gods edition
Elizabeth (pktechgirl) · 2023-09-19T04:40:06.334Z · comments (13)

[question] How to talk about reasons why AGI might not be near?
Kaj_Sotala · 2023-09-17T08:18:31.100Z · answers+comments (19)

A quick update from Nonlinear
KatWoods (ea247) · 2023-09-07T21:28:26.569Z · comments (23)

Contra Yudkowsky on Epistemic Conduct for Author Criticism
Zack_M_Davis · 2023-09-13T15:33:14.987Z · comments (38)

Influence functions - why, what and how
Nina Rimsky (NinaR) · 2023-09-15T20:42:08.653Z · comments (6)

Would You Work Harder In The Least Convenient Possible World?
Firinn · 2023-09-22T05:17:05.148Z · comments (93)

High-level interpretability: detecting an AI's objectives
Paul Colognese (paul-colognese) · 2023-09-28T19:30:16.753Z · comments (4)

next page (older posts) →

Archive

Recent comments

abhimanyu-pallavi-sudhir on Abhimanyu Pallavi Sudhir's Shortform

I used to have an idea for a karma/reputation system: repeatedly recalculate karma weighted by the karma of the upvoters and downvoters on a comment (then normalize to avoid hyperinflation) until a fixed point is reached.

I feel like this is vaguely somehow related to:

AlphaGoZero
Humans Consulting HCH [LW · GW]
Wealth in markets

jkaufman on Extra Tall Crib

In our case I'm not worried about when they wake up in the morning, but about going to sleep, especially at naptime. A crib is boring and conducive to sleep, but there are a lot of interesting things to play with around the room.

mikbp on Extra Tall Crib

ok. We take our son anyway out of the bet as soon as he wakes up. He sleeps long enough already by himself.

benito on Raemon's Shortform

I mistyped a bit with the use of "relationships". Yes, names and faces both trigger social recognition, but I meant to make the point that they operate in significantly different ways in the brain, and facial recognition is tuned to processing a lot of emotional and social cues that we aren't tuned to from text. I have tons of social associations with people's physical forms that are beyond simply their character.

(ChatGPT helped me write this comment.)

oliver-daniels-koch on Oliver Daniels-Koch's Shortform

Here's a revised sketch

A few notes:

I use Scalable Oversight to refer to both Alignment and Control
I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don't explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive

Scalable Oversight Taxonomy

Scalable Oversight
- Scalable Alignment
  - Benchmarks / Tasks
    - Sandwiching Experiments (human amateurs + model, gt from human experts)
    - Weak models supervising Strong models
  - Approaches
    - Debate
    - Recursive reward modeling
    - (Solution to Eliciting Latent Knowledge) + Narrow Elicitation
      - (Note - I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
      - Eliciting Latent Knowledge
        Approaches
        Contrast Consistent Search
        Confidence
        Intermediate Probing
        "Speed Prior"
        "Simplicity Prior"
        Concept Extrapolation - learn all salient generalizations, use expensive supervision to select correct one
        IID Mechanistic Anomaly Detection + expensive supervision on anomalies
        Subclasses
        Measurement Tampering Detection
        Approaches
        OOD Mechanistic Anomaly Detection
        In distribution
        Out of Distribution (likely? requires multiple measurment structure)
        Concept Extrapolation
        train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
      - Narrow Elicitation
        ...
- Scalable Control
  - Weak Review
  - Untrusted Rephrase or whatever
  - Coup probes
  - MAD (Review all anomalies)
Trojans
- ...
- MAD (maybe?)
Adversarial Examples
- ...
- MAD (maybe?)
Natural Mechanism Distinction
- MAD
Spurious Correlate Detection / Resolution
- Concept Extrapolation

benito on LessOnline (May 31—June 2, Berkeley, CA)

I did find it and we sent him an email, hope he reads it and joins :)

tag on Super additivity of consciousness

Under physicalist epiphenomenalism (which is the standard approach to the mind-matter relation), the mind is super-impressed on reality, perfectly synchronized, and parallel to it.

Under dualist epiphenomenalism, that might be true. Physicalism has it either that consciousness is non existent rather than causally idle (eliminitavism), or identical to physical brain states (and therefore sharing their causal powers).

Understanding why some physical systems make an emergent consciousness appear (the so called “hard problem of consciousness”) or finding a procedure that quantify the intensity of consciousness emerging from a physical system (the so called “pretty hard” problem of consciousness) is impossible:

You could have given a reason why.

duschkopf on Semantic Disagreement of Sleeping Beauty Problem

If this were true that the concept of „indexical sample space“ does not capture the thirder position, how do you explain that it produces exactly the same probabilities that thirders entertain? Operating with indexicals is a necessary condition (and motivation) for Thirdism, which means assuming indexical sample spaces when it comes to the mathematical formalization of arguments in terms of probability theory. To my knowledge no relevant thirder literature denies that. And within the thirder model, these probabilities indeed hold true. If we assume Monday and Tuesday to be mutually exclusive, than this is mathematically the case. Math is not a judge of our assumptions here, it is merely the executive organ which in this case produces thirder probabilities. The point at issue is whether the theoretical assumptions of the thirder model fit reality and probabilities could be transfered into the real world. Thirders say yes, speaking of regular probabilities, halfers say no speaking of irregular, „weighted“ probabilities.

lauro-langosco on RobertM's Shortform

Yeah fair point. I do think labs have some some nonzero amount of responsibility to be proactive about what others believe about their commitments. I agree it doesn't extend to 'rebut every random rumor'.

oliver-daniels-koch on Oliver Daniels-Koch's Shortform

I think I'm mostly right, but using a somewhat confused frame.

It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.