LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Striking Implications for Learning Theory, Interpretability — and Safety?
RogerDearnaley (roger-d-1) · 2024-01-05T08:46:58.915Z · comments (4)

Enhancing intelligence by banging your head on the wall
Bezzi · 2023-12-12T21:00:48.584Z · comments (26)

Information warfare historically revolved around human conduits
trevor (TrevorWiesinger) · 2023-08-28T18:54:27.169Z · comments (7)

Automation collapse
Geoffrey Irving · 2024-10-21T14:50:54.500Z · comments (2)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (10)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

Comparing representation vectors between llama 2 base and chat
Nina Panickssery (NinaR) · 2023-10-28T22:54:37.059Z · comments (5)

We’re not as 3-Dimensional as We Think
silentbob · 2024-08-04T14:39:16.799Z · comments (16)

The Evolution of Humans Was Net-Negative for Human Values
Zack_M_Davis · 2024-04-01T16:01:10.037Z · comments (1)

Childhood and Education Roundup #5
Zvi · 2024-04-17T13:00:03.015Z · comments (4)

Closeness To the Issue (Part 5 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-09T00:36:47.388Z · comments (0)

The "context window" analogy for human minds
Ruby · 2024-02-13T19:29:10.387Z · comments (0)

Mech Interp Challenge: August - Deciphering the First Unique Character Model
CallumMcDougall (TheMcDouglas) · 2023-08-09T19:14:23.682Z · comments (1)

You don't get to have cool flaws
Neil (neil-warren) · 2023-07-28T05:37:31.414Z · comments (17)

Good job opportunities for helping with the most important century
HoldenKarnofsky · 2024-01-18T17:30:03.332Z · comments (0)

Mental Masturbation and the Intellectual Comfort Zone
Declan Molony (declan-molony) · 2024-05-07T05:47:05.257Z · comments (2)

Introduce a Speed Maximum
jefftk (jkaufman) · 2024-01-11T02:50:04.284Z · comments (28)

[link] ∀: a story
Richard_Ngo (ricraz) · 2023-12-17T22:42:32.857Z · comments (1)

[link] "Model UN Solutions"
Arjun Panickssery (arjun-panickssery) · 2023-12-08T23:06:33.490Z · comments (5)

[link] Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy
Neel Nanda (neel-nanda-1) · 2023-08-29T22:07:04.059Z · comments (1)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

A Socratic dialogue with my student
lsusr · 2023-12-05T09:31:05.266Z · comments (14)

[link] Scaling laws for dominant assurance contracts
jessicata (jessica.liu.taylor) · 2023-11-28T23:11:07.631Z · comments (5)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

AI companies' commitments
Zach Stein-Perlman · 2024-05-29T11:00:31.339Z · comments (0)

[link] Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence
spencerg · 2023-11-11T01:04:22.747Z · comments (28)

AI Safety Camp final presentations
Linda Linsefors · 2024-03-29T14:27:43.503Z · comments (3)

AI #34: Chipping Away at Chip Exports
Zvi · 2023-10-19T15:00:03.055Z · comments (19)

[link] Toki pona FAQ
dkl9 · 2024-03-17T21:44:21.782Z · comments (8)

Drone Wars Endgame
RussellThor · 2024-02-01T02:30:46.161Z · comments (71)

[link] How To Socialize With Psycho(logist)s
Sable · 2023-10-20T11:33:46.066Z · comments (11)

[link] Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Soroush Pour (soroush-pour) · 2023-11-07T17:59:36.857Z · comments (2)

We are already in a persuasion-transformed world and must take precautions
trevor (TrevorWiesinger) · 2023-11-04T15:53:31.345Z · comments (14)

[link] Searching for the Root of the Tree of Evil
Ivan Vendrov (ivan-vendrov) · 2024-06-08T17:05:53.950Z · comments (14)

(Appetitive, Consummatory) ≈ (RL, reflex)
Steven Byrnes (steve2152) · 2024-06-15T15:57:39.533Z · comments (1)

On Dwarkesh’s 3rd Podcast With Tyler Cowen
Zvi · 2024-02-02T19:30:05.974Z · comments (9)

[question] Snapshot of narratives and frames against regulating AI
Jan_Kulveit · 2023-11-01T16:30:19.116Z · answers+comments (19)

[link] Learning coefficient estimation: the details
Zach Furman (zfurman) · 2023-11-16T03:19:09.013Z · comments (0)

[link] Claude 3 Opus can operate as a Turing machine
Gunnar_Zarncke · 2024-04-17T08:41:57.209Z · comments (2)

My best guess at the important tricks for training 1L SAEs
Arthur Conmy (arthur-conmy) · 2023-12-21T01:59:06.208Z · comments (4)

AI #47: Meet the New Year
Zvi · 2024-01-13T16:20:10.519Z · comments (7)

Deeply Cover Car Crashes?
jefftk (jkaufman) · 2023-12-10T22:20:01.133Z · comments (31)

Please Bet On My Quantified Self Decision Markets
niplav · 2023-12-01T20:07:38.284Z · comments (6)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (29)

[link] Show LW: Get a phone call if prediction markets predict nuclear war
Lorenzo (lorenzo-buonanno) · 2023-09-17T22:25:21.206Z · comments (8)

Secondary Risk Markets
Vaniver · 2023-12-11T21:52:46.836Z · comments (4)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (10)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

justismills on Slightly More Than You Wanted To Know: Pregnancy Length Effects

They're correlational, though the broad cohorts help - not sure what you can do beyond just canvassing an entire birth cohort and noticing differences. There are possible pitfalls like the decision to induct early being made by people with genes that predict bad outcomes? But I really don't think that's major.

zy on Cipolla's Shortform

Could you maybe elaborate on "long term academic performance"?

nostalgebraist on The Hidden Complexity of Wishes

In the situation assumed by your first argument, AGI would be very unlikely to share our values even if our values were much simpler than they are.

Complexity makes things worse, yes, but the conclusion "AGI is unlikely to have our values" is already entailed by the other premises even if we drop the stuff about complexity.

Why: if we're just sampling some function from a simplicity prior, we're very unlikely to get any particular nontrivial function that we've decided to care about in advance of the sampling event. There are just too many possible functions, and probability mass has to get divided among them all.

In other words, if it takes bits to specify human values, there are $2^{N}$ ways that a bitstring of the same length could be set, and we're hoping to land on just one of those through luck alone. (And to land on a bitstring of this specific length in the first place, of course.) Unless $N$ is very small, such a coincidence is extremely unlikely.

And $N$ is not going to be that small; even in the sort of naive and overly simple "hand-crafted" value specifications which EY has critiqued in this post and elsewhere, a lot of details have to be specified. (E.g. some proposals refer to "humans" and so a full algorithmic description of them would require an account of what is and isn't a human.)

One could devise a variant of this argument that doesn't have this issue, by "relaxing the problem" so that we have some control, just not enough to pin down the sampled function exactly. And then the remaining freedom is filled randomly with a simplicity bias. This partial control might be enough to make a simple function likely, while not being able to make a more complex function likely. (Hmm, perhaps this is just your second argument, or a version of it.)

This kind of reasoning might be applicable in a world where its premises are true, but I don't think it's premises are true in our world.

In practice, we apparently have no trouble getting machines to compute very complex functions, including (as Matthew points out) specifications of human value whose robustness would have seemed like impossible magic back in 2007. The main difficulty, if there is one, is in "getting the function to play the role of the AGI values," not in getting the AGI to compute the particular function we want in the first place.

ben-livengood on Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong

"What is your credence now for the proposition that the coin landed heads?"

There are three doors. Two are labeled Monday, and one is labeled Tuesday. Behind each door is a Sleeping Beauty. In a waiting room, many (finite) more Beauties are waiting; every time a Beauty is anesthetized, a coin is flipped and taped to their forehead with clear tape. You open all three doors, the Beauties wake up, and you ask the three Beauties The Question. Then they are anesthetized, the doors are shut, and any Beauties with a Heads showing on their foreheads or behind a Tuesday door are wheeled away after the coin is removed from their forehead. The Beauty with a Tails on their forehead behind the Monday door is wheeled behind the Tuesday door. Two new Beauties are wheeled behind the two Monday doors, one with Heads and one with Tails. The experiment repeats.

You observe that Tuesday Beauties always have a Tails taped to their forehead. You always observe that one Monday Beauty has a Tails showing, and one has a Heads showing. You also observe that every Beauty says 1/3, matching the ratio of Heads to Tails showing, and it is apparent that they can't see the coins taped to their own or each other's foreheads or the door they are behind. Every Tails Beauty is questioned twice. Every Heads Beauty is questioned once. You can see all the steps as they happen, there is no trick, every coin flip has 1/2 probability for Heads.

There is eventually a queue of Waiting Sleeping Beauties with all-Heads or all-Tails showing and a new Beauty must be anesthetized with a new coin; the queue length changes over time and sometimes switches face. You can stop the experiment when the queue is empty, as a random walk guarantees to happen eventually, if you like tying up loose ends.

zy on A Rocket–Interpretability Analogy

Agree with this, and wanted to add that I am also not completely sure if mechanistic interpretability is a good "commercial bet" yet based on my experience and understanding, with my definition of commercial bet being materialization of revenue or simply revenue generating.

One revenue generating path I can see for LLMs is the company uses them to identify data that are most effective for particular benchmarks, but my current understanding (correct me if I am wrong) is that it is relatively costly to first research a reliable method, and then run interpretability methods for large models for now; additionally, it would be generally very intuitive to researchers on what datasets could be useful to specific benchmarks already. On the other hand, the method would be much useful to look into nuanced and hard to tackle safety problems. In fact there are a lot of previous efforts in using interpretability generally for safety mitigations.

evolutionbydesign on Advice on Communicating Concisely

Thank you! The books you recommended look like what I was hoping to find.

erik-jenner on A gentle introduction to mechanistic anomaly detection

You're totally right that this is an important difficulty I glossed over, thanks!

TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can't supervise, and this ingredient could be interpretability. On the other hand, there's at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming).

Just to restate things a bit, I'd distinguish two cases:

"In-distribution anomaly detection:" we are fine with flagging any input as "anomalous" that's OOD compared to the trusted distribution
"Off-distribution anomaly detection:" there are some inputs that are OOD but that we still want to classify as "normal"

In-distribution anomaly detection can already be useful (mainly to deal with rare high-stakes failures). For example, if a human can verify that no tampering occurred with enough effort, then we might be able to create a trusted distribution that covers so many cases that we're fine with flagging everything that's OOD.

But we might still want off-distribution anomaly detection, where the anomaly detector generalizes as intended from easy trusted examples to harder untrusted examples. Then we need some additional ingredient to make that generalization work. Paul writes about one approach specifically for measurement tampering here [AF · GW] and in the following subsection. Exlusion finetuning (appendix I in Redwood's measurement tampering paper) is a practical implementation of a similar intuition. This does rely on some assumptions about inductive bias, but at least seems more promising to me than just hoping to get a direct translator from normal training.

I think ARC might have hopes to solve ELK more broadly (rather than just measurement tampering), but I understand those less (and maybe they're just "use a measurement tampering detector to bootstrap to a full ELK solution").

To be clear, I'm far from confident that approaches like this will work, but getting to the point where we could solve measurement tampering via interp also seems speculative in the foreseeable future. These two bets seem at least not perfectly correlated, which is nice.

arjun-panickssery on Overcoming Bias Anthology

lol fixed thanks

hastings-greer on Could randomly choosing people to serve as representatives lead to better government?

Keep in mind that representative democracy as practiced in the US is doing as well as it is while holding up to hundreds of millions of dollars of destructive pessimization effort- any alternative system is going to be hit with similar efforts. Just off the top of my head: we are being hit with about $50 dollars per capita of spending this fall, and that's plenty to brain-melt a meaningful fraction of the population. Each member of a 500 member sortition body chosing a president, if their identity is leaked, is going to be immediately hit with OOM 30 million dollars of attempts to change their mind. This is a different environment than a calm deliberation and consideration of the issues as examined by the linked studies.

(figures computed by dividing 2024 election spending by targeted population)

habryka4 on A Rocket–Interpretability Analogy

(My sense is this changed a lot after the Deepmind/GBrain merger and ChatGPT, and the modern GDM seems to give people a lot less slack in the same way, though you are probably still directionally correct)