LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (20)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (34)

LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (39)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (51)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (14)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

5 homegrown EA projects, seeking small donors
Austin Chen (austin-chen) · 2024-10-28T23:24:25.745Z · comments (4)

Newsom Vetoes SB 1047
Zvi · 2024-10-01T12:20:06.127Z · comments (6)

[link] What Depression Is Like
Sable · 2024-08-27T17:43:22.549Z · comments (23)

Self-prediction acts as an emergent regularizer
Cameron Berg (cameron-berg) · 2024-10-23T22:27:03.664Z · comments (4)

OpenAI o1, Llama 4, and AlphaZero of LLMs
Vladimir_Nesov · 2024-09-14T21:27:41.241Z · comments (24)

[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (14)

Why you should be using a retinoid
GeneSmith · 2024-08-19T03:07:41.722Z · comments (57)

Release: Optimal Weave (P1): A Prototype Cohabitive Game
mako yass (MakoYass) · 2024-08-17T14:08:18.947Z · comments (21)

AI #83: The Mask Comes Off
Zvi · 2024-09-26T12:00:08.689Z · comments (19)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
keith_wynroe · 2024-07-02T13:17:16.352Z · comments (7)

Values Are Real Like Harry Potter
johnswentworth · 2024-10-09T23:42:24.724Z · comments (17)

3C's: A Recipe For Mathing Concepts
johnswentworth · 2024-07-03T01:06:11.944Z · comments (5)

[link] Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more
Michael Cohn (michael-cohn) · 2024-09-15T05:27:36.691Z · comments (39)

Quick look: applications of chaos theory
Elizabeth (pktechgirl) · 2024-08-18T15:00:07.853Z · comments (49)

[Intuitive self-models] 2. Conscious Awareness
Steven Byrnes (steve2152) · 2024-09-25T13:29:02.820Z · comments (48)

How to prevent collusion when using untrusted models to monitor each other
Buck · 2024-09-25T18:58:20.693Z · comments (5)

The case for a negative alignment tax
Cameron Berg (cameron-berg) · 2024-09-18T18:33:18.491Z · comments (20)

Secular interpretations of core perennialist claims
zhukeepa · 2024-08-25T23:41:02.683Z · comments (32)

[link] Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety
titotal (lombertini) · 2024-09-18T13:07:40.754Z · comments (3)

Corrigibility = Tool-ness?
johnswentworth · 2024-06-28T01:19:48.883Z · comments (8)

Catastrophic sabotage as a major threat model for human-level AI systems
evhub · 2024-10-22T20:57:11.395Z · comments (4)

Secondary forces of debt
KatjaGrace · 2024-06-27T21:10:06.131Z · comments (18)

Graceful Degradation
Screwtape · 2024-11-05T23:57:53.362Z · comments (7)

Bitter lessons about lucid dreaming
avturchin · 2024-10-16T21:27:04.725Z · comments (62)

[link] Gwern: Why So Few Matt Levines?
kave · 2024-10-29T01:07:27.564Z · comments (10)

What is malevolence? On the nature, measurement, and distribution of dark traits
David Althaus (wallowinmaya) · 2024-10-23T08:41:33.197Z · comments (15)

Darwinian Traps and Existential Risks
KristianRonn · 2024-08-25T22:37:14.142Z · comments (14)

Dentistry, Oral Surgeons, and the Inefficiency of Small Markets
GeneSmith · 2024-11-01T17:26:06.466Z · comments (16)

Value fragility and AI takeover
Joe Carlsmith (joekc) · 2024-08-05T21:28:07.306Z · comments (5)

Scaffolding for "Noticing Metacognition"
Raemon · 2024-10-09T17:54:13.657Z · comments (4)

The Obliqueness Thesis
jessicata (jessica.liu.taylor) · 2024-09-19T00:26:30.677Z · comments (16)

My 10-year retrospective on trying SSRIs
Kaj_Sotala · 2024-09-22T20:30:02.483Z · comments (10)

Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now)
Elizabeth (pktechgirl) · 2024-10-22T18:20:01.194Z · comments (78)

Rationality Quotes - Fall 2024
Screwtape · 2024-10-10T18:37:55.013Z · comments (22)

JargonBot Beta Test
Raemon · 2024-11-01T01:05:26.552Z · comments (52)

On the CrowdStrike Incident
Zvi · 2024-07-22T12:40:05.894Z · comments (14)

Could randomly choosing people to serve as representatives lead to better government?
John Huang · 2024-10-21T17:10:20.920Z · comments (13)

Introducing Transluce — A Letter from the Founders
jsteinhardt · 2024-10-23T18:10:02.526Z · comments (2)

A Simple Toy Coherence Theorem
johnswentworth · 2024-08-02T17:47:50.642Z · comments (19)

Mistakes people make when thinking about units
Isaac King (KingSupernova) · 2024-06-25T03:39:20.138Z · comments (14)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

tahp on Thoughts after the Wolfram and Yudkowsky discussion

I don't think you're being creative enough about solving the problem cheaply, but I also don't think this particular detail is relevant to my main point. Now you've made me think more about the problem, here's me making a few more steps toward trying to resolve my confusion:

The idea with instrumental convergence is that smart things with goals predictably go hard with things like gathering resources and increasing odds of survival before the goal is complete which are relevant to any goal. As a directionally-correct example for why this could be lethal, humans are smart enough to do gain-of-function research on viruses and design algorithms that predict protein folding. I see no reason to think something smarter could not (with some in-lab experimentation) design a virus that kills all humans simultaneously at a predetermined time, and if you can do that without affecting any of your other goals more than you think humans might interfere your goals, then sure, you kill all the humans because it's easy and you might as well. You can imagine somehow making an AI that cares about humans enough not to straight up kill all of them, but if humans are a survival threat, we should expect it to find some other creative way to contain us, and this is not a design constraint you should feel good about.

In particular, if you are an algorithm which is willing to kill all humans, it is likely that humans do not want you to run, and so letting humans live is bad for your own survival if you somehow get made before the humans notice you are willing to kill them all. This is not a good sign for humans' odds of being able to get more than one try to get AI right if most things are concerned with their own survival, even if that concern is only implicit in having any goal whatsoever.

Importantly, none of this requires humans to make a coding error. It only requires a thing with goals and intelligence, and the only apparent way to get around it is to have the smart thing implicitly care about literally every thing that humans care about to the same relative degrees that humans care about them. It's not a formal proof, but maybe it's the beginning of one. Parenthetically, I guess it's also a good reason to have a lot of military capability before you go looking for aliens, even if you don't intend to harm any.

danielfilan on DanielFilan's Shortform Feed

When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like:

putting a bunch of agents in a simulated world and seeing how they interact
weak-to-strong / easy-to-hard generalization

basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities

startattheend on Anvil Problems

This probably makes more sense if you view it as a boolean type, you either "have an anvil" or you don't, and you either have access to fire or you don't. We view a lot of things as booleans (if your clothes get wet, then wet is a boolean). This might be helpful? It connects what might seem like a sort of edge case into something familiar.

But "something that relies on itself" and "something which is usually hard to get, but easy to get more of once you have a bit of it" are a bit more special I suppose. "Catalyst" is a sort of similar yet different idea. You could graph these concepts as dependency relations and try out all permutations to see if more types of problems exists

shankar-sivarajan on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy

"Would you destroy a better world to save this one?"

From Ada Palmer's Terra Ignota, that might be an interesting reframing of this wager: you are destroying (the chance of) a better world, more than twice as good as this one, to guarantee the survival of this one.

habryka4 on An alternative way to browse LessWrong 2.0

Not sure what you mean. The API continues to exist (and has existed since the beginning of LW 2.0).

trevorone on AI Safety is Dropping the Ball on Clown Attacks

I'm not sure what to think about this, Thomas777's approach is generally a good one but for both of these examples, a shorter route (that it's cleanly mutually understood to be adding insult to injury as a flex by the aggressor) seems pretty probable. Free speech/censorship might be a better example as plenty of cultures are less aware of information theory and progress.

I don't know what proportion of the people in the US Natsec community understand 'rigged psychological games' well enough to occasionally read books on the topic, but the bar is pretty low for hopping onto fads as tricks only require one person to notice or invent them and then they can simply just get popular (with all kinds of people with varying capabilities/resources/technology and bandwidth/information/deffciencies hopping on the bandwagon).

brambleboy on Is Success the Enemy of Freedom? (Full)

It's been about 4 years. How do you feel about this now?

paul-kent on Belief in Belief

The recent conception of the hostile telepaths problem [LW · GW] goes a long way towards explaining why people believe in belief in the first place.

tomcatfish on Being the (Pareto) Best in the World

In regard to bullet 1, I would caution against relying on this. If you show up to many fields expecting to smash through it because you're smart, you'll be torn to bits in many many fields. This is because the fields that are useful are already being dominated by people who are good at things to the extent that they're economically or emotionally valuable.

The exact example of chess makes this clear. If a smart LWer thinks "Oh, I'll get to the chess leaderboards because I'm really smart", they are going to find out after some weeks of studying that… everyone else on the leaderboards is smart too!

dalcy on Dalcy's Shortform

The critical insight is that this is not always the case!

Let's call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.

This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified - namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.

The IC algorithm (Verma & Pearl, 1990) for finding perfect maps (hence temporal direction) is exactly about exploiting these conditions to orient as many of the edges as possible:

More intuitively, (Verma & Pearl, 1992) and (Meek, 1995) together shows that the following four rules are necessary and sufficient operations to maximally orient the graph according to the I-equivalence (+ acyclicity) constraint:

Anyone interested in further detail should consult Pearl's Causality Ch 2. Note that for some reason Ch 2 is the only chapter in the book where Pearl talks about Causal Discovery (i.e. inferring time from observational distribution) and the rest of the book is all about Causal Inference (i.e. inferring causal effect from (partially) known causal structure).