LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Evaluating the truth of statements in a world of ambiguous language.
Hastings (hastings-greer) · 2024-10-07T18:08:09.920Z · comments (19)

SRE's review of Democracy
Martin Sustrik (sustrik) · 2024-08-03T07:20:01.483Z · comments (2)

Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes
Anna Gajdova (anna-gajdova) · 2024-10-09T12:56:24.856Z · comments (14)

Live Machinery: An Interface Design Philosophy for Wholesome AI Futures
Sahil · 2024-11-01T17:24:09.957Z · comments (3)

Misnaming and Other Issues with OpenAI's “Human Level” Superintelligence Hierarchy
Davidmanheim · 2024-07-15T05:50:17.770Z · comments (2)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

[link] Active Recall and Spaced Repetition are Different Things
Saul Munn (saul-munn) · 2024-11-08T20:14:56.092Z · comments (2)

[link] Contra Acemoglu on AI
Maxwell Tabarrok (maxwell-tabarrok) · 2024-06-28T13:13:15.796Z · comments (0)

An alternative approach to superbabies
Towards_Keeperhood (Simon Skade) · 2024-11-05T22:56:15.740Z · comments (19)

Conflating value alignment and intent alignment is causing confusion
Seth Herd · 2024-09-05T16:39:51.967Z · comments (18)

[link] JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan (SenR) · 2024-07-19T16:10:54.664Z · comments (10)

Extended Interview with Zhukeepa on Religion
Ben Pace (Benito) · 2024-08-18T03:19:05.625Z · comments (59)

[link] Analyzing how SAE features evolve across a forward pass
bensenberner · 2024-11-07T22:07:02.827Z · comments (0)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

Caring about excellence
owencb · 2024-07-22T14:24:37.892Z · comments (4)

What distinguishes "early", "mid" and "end" games?
Raemon · 2024-06-21T17:41:30.816Z · comments (22)

[link] What Ketamine Therapy Is Like
Sable · 2024-11-11T11:09:08.602Z · comments (8)

Forecasting One-Shot Games
Raemon · 2024-08-31T23:10:05.475Z · comments (0)

D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset
aphyer · 2024-10-29T01:21:03.075Z · comments (13)

AI #91: Deep Thinking
Zvi · 2024-11-21T14:30:06.930Z · comments (10)

[link] Epistemic status: poetry (and other poems)
Richard_Ngo (ricraz) · 2024-11-21T18:13:17.194Z · comments (5)

Finding Features Causally Upstream of Refusal
Daniel Lee (daniel-lee) · 2025-01-14T02:30:04.321Z · comments (5)

Book a Time to Chat about Interp Research
Logan Riggs (elriggs) · 2024-12-03T17:27:46.808Z · comments (3)

[link] Review: Breaking Free with Dr. Stone
TurnTrout · 2024-12-18T01:26:37.730Z · comments (5)

[link] A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
Caspar Oesterheld (Caspar42) · 2024-12-16T22:42:03.763Z · comments (1)

[link] on neodymium magnets
bhauth · 2024-01-30T15:58:24.088Z · comments (6)

[link] Constructive Cauchy sequences vs. Dedekind cuts
jessicata (jessica.liu.taylor) · 2024-03-14T23:04:07.300Z · comments (23)

[link] Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities
porby · 2024-02-02T05:49:11.189Z · comments (1)

Value learning in the absence of ground truth
Joel_Saarinen (joel_saarinen) · 2024-02-05T18:56:02.260Z · comments (8)

Sora What
Zvi · 2024-02-22T18:10:05.397Z · comments (3)

Some Experiments I'd Like Someone To Try With An Amnestic
johnswentworth · 2024-05-04T22:04:19.692Z · comments (33)

[link] "If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"
plex (ete) · 2024-05-18T14:09:53.014Z · comments (23)

How to safely use an optimizer
Simon Fischer (SimonF) · 2024-03-28T16:11:01.277Z · comments (21)

I'm open for projects (sort of)
cousin_it · 2024-04-18T18:05:01.395Z · comments (13)

Thoughts on "The Offense-Defense Balance Rarely Changes"
Cullen (Cullen_OKeefe) · 2024-02-12T03:26:50.662Z · comments (4)

AI doing philosophy = AI generating hands?
Wei Dai (Wei_Dai) · 2024-01-15T09:04:39.659Z · comments (22)

1. The CAST Strategy
Max Harms (max-harms) · 2024-06-07T22:29:13.005Z · comments (19)

AI Safety 101 : Capabilities - Human Level AI, What? How? and When?
markov (markovial) · 2024-03-07T17:29:53.260Z · comments (8)

Big Picture AI Safety: Introduction
EuanMcLean (euanmclean) · 2024-05-23T11:15:44.037Z · comments (7)

On the Proposed California SB 1047
Zvi · 2024-02-12T16:40:04.854Z · comments (18)

AI #68: Remarkably Reasonable Reactions
Zvi · 2024-06-13T16:30:02.969Z · comments (11)

Some costs of superposition
Linda Linsefors · 2024-03-03T16:08:20.674Z · comments (11)

[link] Metascience of the Vesuvius Challenge
Maxwell Tabarrok (maxwell-tabarrok) · 2024-03-30T12:02:38.978Z · comments (2)

Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence
Towards_Keeperhood (Simon Skade) · 2024-05-06T17:09:10.729Z · comments (16)

The Shallow Bench
Karl Faulks (karl-faulks) · 2024-11-05T05:07:27.357Z · comments (5)

Bounty for Evidence on Some of Palisade Research's Beliefs
benwr · 2024-09-23T20:01:20.917Z · comments (4)

Anthropic rewrote its RSP
Zach Stein-Perlman · 2024-10-15T14:25:12.518Z · comments (19)

AI as a powerful meme, via CGP Grey
TheManxLoiner · 2024-10-30T18:31:58.544Z · comments (8)

[link] Michael Dickens' Caffeine Tolerance Research
niplav · 2024-09-04T15:41:53.343Z · comments (3)

Higher-effort summer solstice: What if we used AI (i.e., Angel Island)?
Rachel Shu (wearsshoes) · 2024-06-25T01:35:54.064Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

rogerdearnaley on Model Amnesty Project

Law-abiding – It cannot acquire money or compute illegally (fraud, theft, hacking, etc.) and must otherwise avoid breaking the law

Can it lobby? Run for office? Shop around for jurisdictions? Super-humanly persuade the electorate? Just find loopholes and workarounds to the law that make a corporate tax double-Irish look principled and simple?

rogerdearnaley on What are the plans for solving the inner alignment problem?

Evolution was working within tight computational efficiency limits (the human brain burns roughly 1/6 of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, we're now running the human brain well outside it's training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is between human instincts and human genetic fitness.

So:

use a model large enough to learn what you're trying to teach it
Use stochastic gradient descent
Ask your AI to monitor for inner alignment problems — we do know Doritos are bad for us
Retrain if you find yourself far enough outside your training distribution that inner alignment issues are becoming a problem

mmontag on Lighthaven Sequences Reading Group #18 (Tuesday 01/21)

Looks like the event date is set for January 14th, so it appears as a past event.

evan-r-murphy on An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

Thanks for the useful write-up on RepE.

RE might find application in Eliciting Latent Knowledge, like identifying what a model internally believes to be true.

Application to ELK is exciting. I was surprised that you used the word "might" because it looked like Zhou et al. (2023) have already built a lie and hallucination detector using RepE. What do you see as left to be done in this area to elicit model beliefs with RepE?

Taking a closer look, I did find this in the paper's 4.3.2 section, acknowledging some limitations:

While these observations enhance our confidence that our reading vectors correspond to dishonest thought processes and behaviors, they also introduce complexities into the task of lie detection. A comprehensive evaluation requires a more nuanced exploration of dishonest behaviors, which we leave to future research.

I suppose there may also be a substantial gap between detecting dishonest statements and eliciting true beliefs in the model, but I'm conjecturing. What are your thoughts?

remmelt-ellen on What do you mean with ‘alignment is solvable in principle’?

Thanks!

With ‘possible worlds’, do you mean ‘possible to be reached from our current world state’?

And what do you mean with ‘alignment’? I know that can sound like an unnecessary question. But if it’s not specified, how can people soundly assess whether it is technically solvable?

remmelt-ellen on What do you mean with ‘alignment is solvable in principle’?

Thanks, when you say “in the space of possible mathematical things”, do you mean “hypothetically possible in physics” or “possible in the physical world we live in”?

evan-r-murphy on Alignment Faking in Large Language Models

Agree, I'm surprised that a model which can reason about its own training process wouldn't also reason that the "secret scratchpad" might actually be surveilled and so avoid recording any controversial thoughts there. But it's lucky for us that some of these models have been willing to write interesting things on the scratchpad at least at current capability levels and below, because Anthropic has sure produced some interesting results from it (IIRC they used the scratchpad technique in at least one other paper).

oliver-daniels on Thoughts on the conservative assumptions in AI control

We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.

also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest [LW · GW], we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control more closely with monitoring, and conservative evaluations have a broader scope).

edge_retainer on Orienting to 3 year AGI timelines

I've seen this take a few times about land values and I would bet against it. If society gets mega rich based on capital (and thus more or similarly inequality) I think the cultural capitals of the US (LA, NY, Bay, Chicago, Austin, etc.) and most beautiful places (Marin/Sonoma, Jackson hole, Park City, Aspen, Vail, Scotsdale, Florida Keys, Miami, Charleston, etc.) will continue to outpace everywhere else.

Also the idea that New York is expensive because that's where the jobs are doesn't seem particularly true to me. Companies move to these places as much because they are trying to attract talent as the other way around. I know lots of students who went to my T20 university and got remote jobs. Approximately 0 of them want to move to ugly bumfuck even if it's basically free. The suburbs/exurbs maybe, but not rural Missouri.

Now if there is a large wealth redistribution, which seem extremely unlikely given the timelines and current politics, I would agree. Also thinking construction will get cheaper is pretty questionable. The cost of construction in the US has skyrocketed largely because of regulations, new tech won't necessarily be able to fix this.

aynonymousprsn123 on What's Wrong With the Simulation Argument?

I have to say, quila, I'm pleasantly surprised that your response above is both plausible and logically coherent—qualities I couldn't find in any of the Reddit responses. Thank you.

However, I have concerns and questions for you.

Most importantly, I worry that if we're currently in a simulation, physics and even logic could be entirely different from what they appear to be. If all our senses are illusory, why should our false map align with the territory outside the simulation? A story like your "Mutual Anthropic Capture" offers hope: a logically sound hypothesis in which our understanding of physics is true. But why should it be? Believing that a simulation exactly matches reality sounds to me like the privileging the hypothesis fallacy.

By the way, I'm also somewhat skeptical of a couple of your assumptions in Mutual Anthropic Capture. Still, I think it's a good idea overall, and some subtle modifications to the idea would probably make logically sound. I won't bother you about those small issues here, though; I'm more interested in your response to my concern above.