LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition
Adrià Moret (Adrià R. Moret) · 2023-12-02T14:07:29.992Z · comments (31)

Exploring OpenAI's Latent Directions: Tests, Observations, and Poking Around
Johnny Lin (hijohnnylin) · 2024-01-31T06:01:27.969Z · comments (4)

Consequentialism is a compass, not a judge
Neil (neil-warren) · 2024-04-13T10:47:44.980Z · comments (6)

AI debate: test yourself against chess 'AIs'
Richard Willis · 2023-11-22T14:58:10.847Z · comments (35)

Smartphone Etiquette: Suggestions for Social Interactions
Declan Molony (declan-molony) · 2024-06-04T06:01:03.336Z · comments (4)

[link] Manifold Markets
PeterMcCluskey · 2024-02-02T17:48:36.630Z · comments (9)

[link] Forecasting future gains due to post-training enhancements
elifland · 2024-03-08T02:11:57.228Z · comments (2)

Vote in the LessWrong review! (LW 2022 Review voting phase)
habryka (habryka4) · 2024-01-17T07:22:17.921Z · comments (9)

Geometric Utilitarianism (And Why It Matters)
StrivingForLegibility · 2024-05-12T03:41:21.342Z · comments (2)

Am I going insane or is the quality of education at top universities shockingly low?
ChrisRumanov (pseudonymous-ai) · 2023-11-20T03:53:30.056Z · comments (30)

Three Types of Constraints in the Space of Agents
Nora_Ammann · 2024-01-15T17:27:27.560Z · comments (3)

Why I think it's net harmful to do technical safety research at AGI labs
Remmelt (remmelt-ellen) · 2024-02-07T04:17:15.246Z · comments (24)

Losing Metaphors: Zip and Paste
jefftk (jkaufman) · 2023-11-29T20:31:07.464Z · comments (6)

Ideas for Next-Generation Writing Platforms, using LLMs
ozziegooen · 2024-06-04T18:40:24.636Z · comments (4)

[link] Arrogance and People Pleasing
Jonathan Moregård (JonathanMoregard) · 2024-02-06T18:43:09.120Z · comments (7)

The Limitations of GPT-4
p.b. · 2023-11-24T15:30:30.933Z · comments (12)

The Sequences on YouTube
Neil (neil-warren) · 2024-01-07T01:44:39.663Z · comments (9)

Evidential Correlations are Subjective, and it might be a problem
Martín Soto (martinsq) · 2024-03-07T18:37:54.105Z · comments (6)

D&D.Sci Hypersphere Analysis Part 3: Beat it with Linear Algebra
aphyer · 2024-01-16T22:44:52.424Z · comments (1)

[link] Attention on AI X-Risk Likely Hasn't Distracted from Current Harms from AI
Erich_Grunewald · 2023-12-21T17:24:16.713Z · comments (2)

Essaying Other Plans
Screwtape · 2024-03-06T22:59:06.240Z · comments (4)

Quick takes on "AI is easy to control"
So8res · 2023-12-02T22:31:45.683Z · comments (49)

My Dating Heuristic
Declan Molony (declan-molony) · 2024-05-21T05:28:40.197Z · comments (4)

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features
Logan Riggs (elriggs) · 2024-03-15T16:30:00.744Z · comments (5)

[link] Let's Design A School, Part 2.1 School as Education - Structure
Sable · 2024-05-02T22:04:30.435Z · comments (2)

Agent membranes/boundaries and formalizing “safety”
Chipmonk · 2024-01-03T17:55:21.018Z · comments (46)

[link] Let's Design A School, Part 2.2 School as Education - The Curriculum (General)
Sable · 2024-05-07T19:22:21.730Z · comments (3)

Fact Finding: Simplifying the Circuit (Post 2)
Senthooran Rajamanoharan (SenR) · 2023-12-23T02:45:49.675Z · comments (3)

D&D.Sci Hypersphere Analysis Part 4: Fine-tuning and Wrapup
aphyer · 2024-01-18T03:06:39.344Z · comments (5)

Beta Tester Request: Rallypoint Bounties
lukemarks (marc/er) · 2024-05-25T09:11:11.446Z · comments (4)

[link] Transformer Debugger
Henk Tillman (henk-tillman) · 2024-03-12T19:08:56.280Z · comments (0)

AXRP Episode 30 - AI Security with Jeffrey Ladish
DanielFilan · 2024-05-01T02:50:04.621Z · comments (0)

To Boldly Code
StrivingForLegibility · 2024-01-26T18:25:59.525Z · comments (4)

Decent plan prize winner & highlights
lemonhope (lcmgcd) · 2024-01-19T23:30:34.242Z · comments (2)

[link] Executive Dysfunction 101
DaystarEld · 2024-05-23T12:43:13.785Z · comments (1)

[link] OpenAI Superalignment: Weak-to-strong generalization
Dalmert · 2023-12-14T19:47:24.347Z · comments (3)

[link] The Best Essay (Paul Graham)
Chris_Leong · 2024-03-11T19:25:42.176Z · comments (2)

[link] **In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley**
mrtreasure · 2023-12-06T02:02:32.004Z · comments (3)

Twin Peaks: under the air
KatjaGrace · 2024-05-31T01:20:04.624Z · comments (2)

Virtually Rational - VRChat Meetup
Tomás B. (Bjartur Tómas) · 2024-01-28T05:52:36.934Z · comments (3)

Decent plan prize announcement (1 paragraph, $1k)
lemonhope (lcmgcd) · 2024-01-12T06:27:44.495Z · comments (19)

Testing for consequence-blindness in LLMs using the HI-ADS unit test.
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-11-24T23:35:29.560Z · comments (2)

How to put California and Texas on the campaign trail!
Yair Halberstadt (yair-halberstadt) · 2024-11-06T06:08:25.673Z · comments (4)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

[link] A Theory of Equilibrium in the Offense-Defense Balance
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-15T13:51:33.376Z · comments (3)

Thoughts after the Wolfram and Yudkowsky discussion
Tahp · 2024-11-14T01:43:12.920Z · comments (13)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

anthonyc on AI #90: The Wall

That's a good point about public discussions. It's not how I absorb information, but I can definitely see that.

david-james on Neutrality

they weren’t designed to be ultra-robust to exploitation, or to make serious attempts to assess properties like truth, accuracy, coherence, usefulness, justice

There are notable differences. Usefulness and justice are quite different from the others (truth, accuracy, coherence). Usefulness (defined as suitability for a purpose, which is non-prescriptive as to the underlying norms) is different from justice (defined by some normative ideal). Coherence requires fewer commitments than truth and accuracy.

Ergo, I could see various instantiations of a library designed to satisfy various levels. Level 1 would value coherence. Level 2 would add truth and accuracy. Level 3: +usefulness. Level 4, +justice.

selfmaker662 on Ayn Rand’s model of “living money”; and an upside of burnout

I wouldn’t say the subsconscious calibrating on more substantial measures of success, such has “how happy something made me” or “how much status that seems to have brought” is irrational. What you’re proposing, it seems to me, is calibrating only on how good of an idea it was from the predictor part / System 2. Which gets calibrated, I would guess, when the person analyses the situation? But if the system 2 is sufficiently bad, calibrating on pure results is a good idea to shield against pursuing some goal, the pursuit of which yields nothing but evaluations of System 2, that the person did well. Which is bad, if one of the end goals of the subconscious is “objective success”.

For example, a situation I could easily imagine myself to have been in: Every day I struggle to go to bed, because I can’t put away my phone. But when I do, at 23:30, I congratulate myself - it took a lot of effort, and I did actually succeed in giving myself enough time to sleep almost long enough. If I didn’t recalibrate rationally, and “me-who-uses-internal-metrics-of-success” were happy with good effort every day, I’d keep doing it. All while real me would get fed up soon, and get a screen blocker app to turn on at 23:00 every day to sleep well every day at no willpower cost. (+- the other factors and supposing phone after 23 isn’t very important for some parts of me)

dakara on Noosphere89's Shortform

I've been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can't seem to find the link, so I am commenting here) and which I don't really understand is iterative alignment.

I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.

Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.

Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.

The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.

Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.

lblack on Alexander Gietelink Oldenziel's Shortform

for a large enough (overparameterized) architecture - in other words it can be measured by the

The sentence seems cut off.

gunnar_zarncke on OpenAI Email Archives (from Musk v. Altman)

A much smaller subset was also published here, but does include documents:

https://www.techemails.com/p/elon-musk-and-openai?r=1jki4r

rotatingpaguro on AI #90: The Wall

I agree with whay you say about how to maximize what you get out of an interview. I also agree about that discussion vs. debate distinction you make, and I wasn't specifically trying to go there when I used the word "debate", I was just sloppy with words.

I guess you agree that it is friction to create a social norm that you should do a read up of the other person material before engaging in public. I expect less discussions would happen. There is not a clear threshold at how much you should be prepared.

I guess we disagree about how much value do we lose due to eliminating discussions that could have happaned, vs. how much value we gain by eliminating some lower quality discussions.

Another angle I have in mind that sidesteps this direct compromise, is that maybe what we value out of such discussions is not just doing an optimal play in terms of information transmitted between the parties. A public discussion has many different viewers. In the case at hand, I expect many people get more out of the discussion if they can see Wolfram think through the thing for the first time in real time, rather than having two informed people start discussing finer points in medias res.

viliam on Neutrality

Library in the sense of "we collect texts written by other people" is: The Best Textbooks on Every Subject [LW · GW]

I would like to see this one improved; specifically to have a dedicated UI where people can add books, vote on books, and review them. Maybe something like "people who liked X also liked Y".

Also, not just textbooks, but also good popular science books, etc.

gerardus-mercator on Claude seems to be smarter than LessWrong community

I see those assertions, but I don't see why an intelligent agent would be persuaded by them. Why would it think that the hypothetical objective goal is better than its utility function? Caring about objective facts and investigating them is also an instrumental goal compared to the terminal goal of optimizing its utility function. The agent's only frame of reference for 'better' and 'worse' is relative to its utility function; it would presumably understand that there are other frames of reference, but I don't think it would apply them, because that would lead to a worse outcome according to its current frame of reference.

dakara on Simple probes can catch sleeper agents

I am also interested in knowing whether the probing method is a solution to the undetectable backdoor problem.