LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (1)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (8)

Dreams of AI alignment: The danger of suggestive names
TurnTrout · 2024-02-10T01:22:51.715Z · comments (59)

[link] Reproducing ARC Evals' recent report on language model agents
Thomas Broadley (thomas-broadley) · 2023-09-01T16:52:17.147Z · comments (17)

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (4)

Biosecurity Culture, Computer Security Culture
jefftk (jkaufman) · 2023-08-30T16:40:03.101Z · comments (10)

PSA: The community is in Berkeley/Oakland, not "the Bay Area"
maia · 2023-09-11T15:59:47.132Z · comments (7)

Value systematization: how values become coherent (and misaligned)
Richard_Ngo (ricraz) · 2023-10-27T19:06:26.928Z · comments (48)

Response to nostalgebraist: proudly waving my moral-antirealist battle flag
Steven Byrnes (steve2152) · 2024-05-29T16:48:29.408Z · comments (29)

Lsusr's Rationality Dojo
lsusr · 2024-02-13T05:52:03.757Z · comments (17)

[link] A Chess-GPT Linear Emergent World Representation
Adam Karvonen (karvonenadam) · 2024-02-08T04:25:15.222Z · comments (14)

SAE reconstruction errors are (empirically) pathological
wesg (wes-gurnee) · 2024-03-29T16:37:29.608Z · comments (16)

A Theory of Laughter
Steven Byrnes (steve2152) · 2023-08-23T15:05:59.694Z · comments (14)

[link] Notes from a Prompt Factory
Richard_Ngo (ricraz) · 2024-03-10T05:13:39.384Z · comments (19)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L (LRudL) · 2024-07-08T22:24:38.441Z · comments (28)

[link] Explaining grokking through circuit efficiency
Vikrant Varma (amrav) · 2023-09-08T14:39:23.910Z · comments (11)

Sam Altman's sister, Annie Altman, claims Sam has severely abused her
prometheus5015 (pl5015) · 2023-10-07T21:06:49.396Z · comments (107)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (5)

On Dwarksh’s Podcast with Leopold Aschenbrenner
Zvi · 2024-06-10T12:40:03.348Z · comments (7)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (3)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

On the Executive Order
Zvi · 2023-11-01T14:20:01.657Z · comments (4)

[link] Advice for Activists from the History of Environmentalism
Jeffrey Heninger (jeffrey-heninger) · 2024-05-16T18:40:02.064Z · comments (8)

General Thoughts on Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:43.940Z · comments (60)

[link] LessOnline (May 31—June 2, Berkeley, CA)
Ben Pace (Benito) · 2024-03-26T02:34:00.000Z · comments (24)

What's up with "Responsible Scaling Policies"?
habryka (habryka4) · 2023-10-29T04:17:07.839Z · comments (8)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

Announcing the London Initiative for Safe AI (LISA)
James Fox · 2024-02-02T23:17:47.011Z · comments (0)

[link] CIV: a story
Richard_Ngo (ricraz) · 2024-06-15T22:36:50.415Z · comments (6)

[link] My cover story in Jacobin on AI capitalism and the x-risk debates
garrison · 2024-02-12T23:34:16.526Z · comments (5)

[link] "Deep Learning" Is Function Approximation
Zack_M_Davis · 2024-03-21T17:50:36.254Z · comments (28)

What's Hard About The Shutdown Problem
johnswentworth · 2023-10-20T21:13:27.624Z · comments (33)

Truthseeking when your disagreements lie in moral philosophy
Elizabeth (pktechgirl) · 2023-10-10T00:00:04.130Z · comments (4)

[Valence series] 1. Introduction
Steven Byrnes (steve2152) · 2023-12-04T15:40:21.274Z · comments (14)

Learning-theoretic agenda reading list
Vanessa Kosoy (vanessa-kosoy) · 2023-11-09T17:25:35.046Z · comments (0)

[link] The Minority Faction
Richard_Ngo (ricraz) · 2024-06-24T20:01:27.436Z · comments (5)

Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (10)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (14)

[question] Exercise: Solve "Thinking Physics"
Raemon · 2023-08-01T00:44:48.975Z · answers+comments (25)

“X distracts from Y” as a thinly-disguised fight over group status / politics
Steven Byrnes (steve2152) · 2023-09-25T15:18:18.644Z · comments (14)

Closing Notes on Nonlinear Investigation
Ben Pace (Benito) · 2023-09-15T22:44:58.488Z · comments (47)

Comments on Anthropic's Scaling Monosemanticity
Robert_AIZI · 2024-06-03T12:15:44.708Z · comments (8)

OpenAI #8: The Right to Warn
Zvi · 2024-06-17T12:00:02.639Z · comments (8)

I don’t find the lie detection results that surprising (by an author of the paper)
JanB (JanBrauner) · 2023-10-04T17:10:51.262Z · comments (8)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (10)

[question] Lying to chess players for alignment
Zane · 2023-10-25T17:47:15.033Z · answers+comments (54)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

matthew-barnett on The Hidden Complexity of Wishes

The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything.

I think it's important to be able to make narrow points without needing to defend a broader thesis that wasn't necessarily the subject of my comment. To the extent my argument is "outer alignment seems easier than you portrayed it to be in this post, and elsewhere", then your reply here that inner alignment is still hard doesn't seem like it particularly engages with my narrow point.

This post definitely seems to relevantly touch on the question of outer alignment, given the premise that we are explicitly specifying the conditions that the outcome pump needs to satisfy in order for the outcome pump to produce a safe outcome. Explicitly specifying a function that delineates safe from unsafe outcomes is essentially the prototypical case of an outer alignment problem.

(It's possible that you'll reply to me by saying "I never intended people to interpret me as saying anything about outer alignment in this post" despite the clear portrayal of an outer alignment problem in the post. Even so, I don't think what you intended really matters that much here. I'm responding to what was clearly and explicitly written, rather than what was in your head at the time, which is unknowable to me.)

One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is "really" (from the AI's perspective) the true meaning of the reward button after all! Perhaps you do not have this exact scenario in mind.

It seems you're assuming here that something like iterated amplification and distillation will simply fail, because the supervisor function that provides rewards to the model can be hacked or deceived. I think my response to this is that I just tend to be more optimistic than you are that we can end up doing safe supervision where the supervisor ~always remains in control, and they can evaluate the AI's outputs accurately, more-or-less sidestepping the issues you mention here.

I think my reasons for believing this are pretty mundane: I'd point to the fact that evaluation tends to be easier than generation, and the fact that we can employ non-agentic tools to help evaluate, monitor, and control our models to provide them accurate rewards without getting hacked. I think your general pessimism about these things is fairly unwarranted, and my guess is that if you had made specific predictions about this question in the past, about what will happen prior to world-ending AI, these predictions would largely have fared worse than predictions from someone like Paul Christiano.

viliam on leogao's Shortform

“A prophet is not without honor except in his own town, among his relatives and in his own home.” (Mark 6:4)

tsvibt on How to have Polygenically Screened Children

That's a reasonable point... But I don't think we can just count number of divisions either? For one thing, there are several populations of stem cells in an adult. For another, people who are 50% bigger than other people don't live 2/3 as long (right? though maybe that's not the prediction?). I think maybe embryonic stem cells protect their telomeres--not sure.

ali-merali on The Mysterious Trump Buyers on Polymarket

Just adding some additional context that might be useful. PredictIt is a similar election betting platform but has a cap on the maximum amount traders are able to bet (I think <$1k, so relatively low). This means that if Polymarket is a money-weighted information aggregation mechanism, PredictIt is a person-weighted information aggregation mechanism. As noted in the post, from 6th October to just now Trump has gone from 50.8 to 60.1 meaning a difference of 1.6 cents with Kamala to a 20.2 cent difference (18.6 cent swing). On predictit he's gone from 51:53 to 55:48 on the same interval (9 cent swing). Of course it's possible that people on PredictIt are changing their bets due to Polymarket prices (eg. by arbitraging), but there's some evidence that at least half the change isn't due to these large bettors (whether those large bettors are trading based on private information, for manipulation purposes, or other reasons).

anonymous-1 on How to have Polygenically Screened Children

Thanks! Point 2. is especially what I am interested to learn about. If there is any place where I can read more about the presence and effect of telomerase on embryos please let me know!

Also, if you have any info on these other potential issues (pathways of aging through cell division) that chat-gpt generated, I am all ears:

a. DNA Damage Accumulation: Every time a cell divides, there is a risk of errors during DNA replication. (Cells do have repair mechanisms that fix most DNA damage, and fortunately they are more capable in young cells.)

b. Mitochondrial Dysfunction: Mitochondria have their own DNA (mtDNA). Mitochondria divide independently of the cell’s nuclear DNA. However, mtDNA is more prone to damage during cell division and replication due to its proximity to reactive oxygen species (ROS) produced during energy generation.

c. Stem Cell Exhaustion: Each stem cell has a limited number of divisions it can undergo before it enters programmed cell death.

genesmith on How to have Polygenically Screened Children

You’re ignoring several facts:

A significant fraction of cells turn over frequently in adults so the number of divisions for those cell types is far, far higher than 45 divisions. Those cell divisions CAN cause cancer, it a single extra cell division is going to have negligible impact on risk.
There’s an enzyme called telomerase which can extend telomeres. It’s active in embryos. So this isn’t really a concern.

sodium on (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

Thanks for reading my post! Here's how I think this hypothesis is helpful:

It's possible that we wouldn't be able to understand what's going on even if we had some perfect way to decompose a forward pass into interpretable constituent heuristics. I'm skeptical that this would be the case, mostly because I think (1) we can get a lot of juice out of auto-interp methods and (2) we probably wouldn't need to simultaneously understand that many heuristics at the same time (which is the case for your logic gate example for modern computers). At the minimum, I would argue that the decomposed bag of heuristics is likely to be much more interpretable than the original model itself.

Suppose that the hypothesis is true, then it at least suggests that interpretability researchers should put in more efforts to try find and study individual heuristics/circuits, as opposed to the current more "feature-centric" framework. I don't know how this would manifest itself exactly, but it felt like it's worth saying. I believe that some of the empirical work I cited suggests that we might make more incremental progress if we focused on heuristics more right now.

anonymous-1 on How to have Polygenically Screened Children

I think you shouldn’t count aging in number of days, but rather in the number of cell divisions. (I accept there are other forms of cell damage than just division, but e.g. for telomere length division is quite an important component of aging.)

A newborn baby has fewer than 2^35 cells and an adult has fewer than 2^44 cells. Existing cells are often only replaced after 7 years, so quite a large portion of cell division happens during growth from single cell to adult.

For the sake of argument, assume that a biopsy takes 4 cells out of an 8-cell embryo, then all cells need to divide once more. So that’s 36 (or 45) divisions instead of 35 (or 44).

owain_evans on LLMs can learn about themselves by introspection

We have a section on the motivation to study introspection (with the specific definition we use in the paper). https://arxiv.org/html/2410.13787v1#S7

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

One principled way to do it would be simulated war on narrow issues.

So if actor A spends resources R_c on computation C, any other actor B can surrender resources equal to R_c to prevent computation C from happening. The surrendered resources and the original resources are then physically destroyed (e.g. spent on Bitcoin mining or something).

This then at least means that to a first approximation, no actor has an incentive to destroy ELYSIUM itself in order to stop some computation inside it from happening, because they could just use their resources to stop the computation in the simulation instead. And many actors benefit from ELYSIUM, so there's a large incentive to protect it.

And since the interaction is negative sum (both parties lose resources from their personal utopias) there would be strong reasons to negotiate.

In addition to this there could be rule-based and AI-based protections to prevent unauthorized funny tricks with simulations. One rule could be a sort of "cosmic block" where you can just block some or all other Utopias from knowing about you outside of a specified set of tests ("is torture happening here", etc).