LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

If influence functions are not approximating leave-one-out, how are they supposed to help?
Fabien Roger (Fabien) · 2023-09-22T14:23:45.847Z · comments (5)

Preventing model exfiltration with upload limits
ryan_greenblatt · 2024-02-06T16:29:33.999Z · comments (21)

How to be an amateur polyglot
arisAlexis (arisalexis) · 2024-05-08T15:08:11.404Z · comments (16)

[link] Funding case: AI Safety Camp
Remmelt (remmelt-ellen) · 2023-12-12T09:08:18.911Z · comments (5)

Friendship is transactional, unconditional friendship is insurance
Ruby · 2024-07-17T22:52:41.967Z · comments (24)

OpenAI: Altman Returns
Zvi · 2023-11-30T14:10:05.469Z · comments (12)

Interpreting and Steering Features in Images
Gytis Daujotas (gytis-daujotas) · 2024-06-20T18:33:59.512Z · comments (6)

[Intuitive self-models] 3. The Homunculus
Steven Byrnes (steve2152) · 2024-10-02T15:20:18.394Z · comments (36)

How a chip is designed
YM (Yannick_Muehlhaeuser_duplicate0.05902100825326273) · 2024-06-28T08:04:27.392Z · comments (4)

[link] The Perceptron Controversy
Yuxi_Liu · 2024-01-10T23:07:23.341Z · comments (18)

AI #69: Nice
Zvi · 2024-06-20T12:40:02.566Z · comments (9)

METR is hiring!
Beth Barnes (beth-barnes) · 2023-12-26T21:00:50.625Z · comments (1)

List of how people have become more hard-working
Chi Nguyen · 2023-09-29T11:30:38.802Z · comments (7)

AI Safety is Dropping the Ball on Clown Attacks
trevor (TrevorWiesinger) · 2023-10-22T20:09:31.810Z · comments (76)

[link] AI Safety Hub Serbia Soft Launch
DusanDNesic · 2023-10-20T07:11:48.389Z · comments (1)

Do Not Mess With Scarlett Johansson
Zvi · 2024-05-22T15:10:03.215Z · comments (7)

Occupational Licensing Roundup #1
Zvi · 2024-10-30T11:00:04.516Z · comments (11)

[link] So you want to save the world? An account in paladinhood
Tamsin Leake (carado-1) · 2023-11-22T17:40:33.048Z · comments (19)

Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
Seth Herd · 2024-08-05T15:38:09.682Z · comments (22)

SAEs (usually) Transfer Between Base and Chat Models
Connor Kissane (ckkissane) · 2024-07-18T10:29:46.138Z · comments (0)

[link] How LDT helps reduce the AI arms race
Tamsin Leake (carado-1) · 2023-12-10T16:21:44.409Z · comments (13)

2. Corrigibility Intuition
Max Harms (max-harms) · 2024-06-08T15:52:29.971Z · comments (10)

[link] On Shifgrethor
JustisMills · 2024-10-27T15:30:13.688Z · comments (17)

Advice to junior AI governance researchers
Akash (akash-wasil) · 2024-07-08T19:19:07.316Z · comments (1)

Schelling game evaluations for AI control
Olli Järviniemi (jarviniemi) · 2024-10-08T12:01:24.389Z · comments (5)

[question] What's with all the bans recently?
[deleted] · 2024-04-04T06:16:49.062Z · answers+comments (83)

[link] Static Analysis As A Lifestyle
adamShimi · 2024-07-03T18:29:37.384Z · comments (11)

Complex systems research as a field (and its relevance to AI Alignment)
Nora_Ammann · 2023-12-01T22:10:25.801Z · comments (11)

[link] DeepMind: Frontier Safety Framework
Zach Stein-Perlman · 2024-05-17T17:30:02.504Z · comments (0)

Book Review: On the Edge: The Fundamentals
Zvi · 2024-09-23T13:40:11.058Z · comments (3)

Announcing New Beginner-friendly Book on AI Safety and Risk
Darren McKee · 2023-11-25T15:57:08.078Z · comments (2)

Superposition is not "just" neuron polysemanticity
LawrenceC (LawChan) · 2024-04-26T23:22:06.066Z · comments (4)

A to Z of things
KatjaGrace · 2023-11-17T05:20:03.134Z · comments (6)

A gentle introduction to mechanistic anomaly detection
Erik Jenner (ejenner) · 2024-04-03T23:06:16.778Z · comments (2)

[link] Understanding strategic deception and deceptive alignment
Marius Hobbhahn (marius-hobbhahn) · 2023-09-25T16:27:47.357Z · comments (16)

[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex (Stefan42) · 2024-07-05T17:05:25.631Z · comments (2)

[link] A free to enter, 240 character, open-source iterated prisoner's dilemma tournament
Isaac King (KingSupernova) · 2023-11-09T08:24:43.277Z · comments (19)

How to Control an LLM's Behavior (why my P(DOOM) went down)
RogerDearnaley (roger-d-1) · 2023-11-28T19:56:49.679Z · comments (30)

On the Gladstone Report
Zvi · 2024-03-20T19:50:05.186Z · comments (11)

[link] The Gods of Straight Lines
Richard_Ngo (ricraz) · 2023-10-14T04:10:50.020Z · comments (13)

On the Debate Between Jezos and Leahy
Zvi · 2024-02-06T14:40:05.487Z · comments (6)

[link] GPT-4 for personal productivity: online distraction blocker
Sergii (sergey-kharagorgiev) · 2023-09-26T17:41:31.031Z · comments (12)

AiPhone
Zvi · 2024-06-12T22:20:02.141Z · comments (4)

Against most, but not all, AI risk analogies
Matthew Barnett (matthew-barnett) · 2024-01-14T03:36:16.267Z · comments (41)

[link] AI, centralization, and the One Ring
owencb · 2024-09-13T14:00:16.126Z · comments (11)

[link] A primer on why computational predictive toxicology is hard
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-19T17:16:37.735Z · comments (2)

All About Concave and Convex Agents
mako yass (MakoYass) · 2024-03-24T21:37:17.922Z · comments (23)

Another argument against maximizer-centric alignment paradigms
Fiora from Rosebloom · 2024-09-22T07:28:27.856Z · comments (39)

[link] Moving on from community living
Vika · 2024-04-17T17:02:11.357Z · comments (7)

Bayesian updating in real life is mostly about understanding your hypotheses
Max H (Maxc) · 2024-01-01T00:10:30.978Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

tahp on Thoughts after the Wolfram and Yudkowsky discussion

I don't think you're being creative enough about solving the problem cheaply, but I also don't think this particular detail is relevant to my main point. Now you've made me think more about the problem, here's me making a few more steps toward trying to resolve my confusion:

The idea with instrumental convergence is that smart things with goals predictably go hard with things like gathering resources and increasing odds of survival before the goal is complete which are relevant to any goal. As a directionally-correct example for why this could be lethal, humans are smart enough to do gain-of-function research on viruses and design algorithms that predict protein folding. I see no reason to think something smarter could not (with some in-lab experimentation) design a virus that kills all humans simultaneously at a predetermined time, and if you can do that without affecting any of your other goals more than you think humans might interfere your goals, then sure, you kill all the humans because it's easy and you might as well. You can imagine somehow making an AI that cares about humans enough not to straight up kill all of them, but if humans are a survival threat, we should expect it to find some other creative way to contain us, and this is not a design constraint you should feel good about.

In particular, if you are an algorithm which is willing to kill all humans, it is likely that humans do not want you to run, and so letting humans live is bad for your own survival if you somehow get made before the humans notice you are willing to kill them all. This is not a good sign for humans' odds of being able to get more than one try to get AI right if most things are concerned with their own survival, even if that concern is only implicit in having any goal whatsoever.

Importantly, none of this requires humans to make a coding error. It only requires a thing with goals and intelligence, and the only apparent way to get around it is to have the smart thing implicitly care about literally every thing that humans care about to the same relative degrees that humans care about them. It's not a formal proof, but maybe it's the beginning of one. Parenthetically, I guess it's also a good reason to have a lot of military capability before you go looking for aliens, even if you don't intend to harm any.

danielfilan on DanielFilan's Shortform Feed

When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like:

putting a bunch of agents in a simulated world and seeing how they interact
weak-to-strong / easy-to-hard generalization

basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities

startattheend on Anvil Problems

This probably makes more sense if you view it as a boolean type, you either "have an anvil" or you don't, and you either have access to fire or you don't. We view a lot of things as booleans (if your clothes get wet, then wet is a boolean). This might be helpful? It connects what might seem like a sort of edge case into something familiar.

But "something that relies on itself" and "something which is usually hard to get, but easy to get more of once you have a bit of it" are a bit more special I suppose. "Catalyst" is a sort of similar yet different idea. You could graph these concepts as dependency relations and try out all permutations to see if more types of problems exists

shankar-sivarajan on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy

"Would you destroy a better world to save this one?"

From Ada Palmer's Terra Ignota, that might be an interesting reframing of this wager: you are destroying (the chance of) a better world, more than twice as good as this one, to guarantee the survival of this one.

habryka4 on An alternative way to browse LessWrong 2.0

Not sure what you mean. The API continues to exist (and has existed since the beginning of LW 2.0).

trevorone on AI Safety is Dropping the Ball on Clown Attacks

I'm not sure what to think about this, Thomas777's approach is generally a good one but for both of these examples, a shorter route (that it's cleanly mutually understood to be adding insult to injury as a flex by the aggressor) seems pretty probable. Free speech/censorship might be a better example as plenty of cultures are less aware of information theory and progress.

I don't know what proportion of the people in the US Natsec community understand 'rigged psychological games' well enough to occasionally read books on the topic, but the bar is pretty low for hopping onto fads as tricks only require one person to notice or invent them and then they can simply just get popular (with all kinds of people with varying capabilities/resources/technology and bandwidth/information/deffciencies hopping on the bandwagon).

brambleboy on Is Success the Enemy of Freedom? (Full)

It's been about 4 years. How do you feel about this now?

paul-kent on Belief in Belief

The recent conception of the hostile telepaths problem [LW · GW] goes a long way towards explaining why people believe in belief in the first place.

tomcatfish on Being the (Pareto) Best in the World

In regard to bullet 1, I would caution against relying on this. If you show up to many fields expecting to smash through it because you're smart, you'll be torn to bits in many many fields. This is because the fields that are useful are already being dominated by people who are good at things to the extent that they're economically or emotionally valuable.

The exact example of chess makes this clear. If a smart LWer thinks "Oh, I'll get to the chess leaderboards because I'm really smart", they are going to find out after some weeks of studying that… everyone else on the leaderboards is smart too!

dalcy on Dalcy's Shortform

The critical insight is that this is not always the case!

Let's call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.

This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified - namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.

The IC algorithm (Verma & Pearl, 1990) for finding perfect maps (hence temporal direction) is exactly about exploiting these conditions to orient as many of the edges as possible:

More intuitively, (Verma & Pearl, 1992) and (Meek, 1995) together shows that the following four rules are necessary and sufficient operations to maximally orient the graph according to the I-equivalence (+ acyclicity) constraint:

Anyone interested in further detail should consult Pearl's Causality Ch 2. Note that for some reason Ch 2 is the only chapter in the book where Pearl talks about Causal Discovery (i.e. inferring time from observational distribution) and the rest of the book is all about Causal Inference (i.e. inferring causal effect from (partially) known causal structure).