LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Let’s think about slowing down AI
KatjaGrace · 2022-12-22T17:40:04.787Z · comments (183)

Staring into the abyss as a core life skill
benkuhn · 2022-12-22T15:30:05.093Z · comments (20)

Models Don't "Get Reward"
Sam Ringer · 2022-12-30T10:37:11.798Z · comments (61)

A challenge for AGI organizations, and a challenge for readers
Rob Bensinger (RobbBB) · 2022-12-01T23:11:44.279Z · comments (33)

Sazen
[DEACTIVATED] Duncan Sabien (Duncan_Sabien) · 2022-12-21T07:54:51.415Z · comments (83)

AI alignment is distinct from its near-term applications
paulfchristiano · 2022-12-13T07:10:04.407Z · comments (21)

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin (collin-burns) · 2022-12-15T18:22:40.109Z · comments (39)

Jailbreaking ChatGPT on Release Day
Zvi · 2022-12-02T13:10:00.860Z · comments (77)

The Plan - 2022 Update
johnswentworth · 2022-12-01T20:43:50.516Z · comments (37)

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC (LawChan) · 2022-12-03T00:58:36.973Z · comments (35)

What AI Safety Materials Do ML Researchers Find Compelling?
Vael Gates · 2022-12-28T02:03:31.894Z · comments (34)

The next decades might be wild
Marius Hobbhahn (marius-hobbhahn) · 2022-12-15T16:10:04.750Z · comments (42)

Finite Factored Sets in Pictures
Magdalena Wache · 2022-12-11T18:49:00.000Z · comments (35)

Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong · 2022-12-06T19:54:54.854Z · comments (85)

[link] Things that can kill you quickly: What everyone should know about first aid
jasoncrawford · 2022-12-27T16:23:24.831Z · comments (21)

Logical induction for software engineers
Alex Flint (alexflint) · 2022-12-03T19:55:35.474Z · comments (8)

A Year of AI Increasing AI Progress
ThomasW (ThomasWoodside) · 2022-12-30T02:09:39.458Z · comments (3)

Updating my AI timelines
Matthew Barnett (matthew-barnett) · 2022-12-05T20:46:28.161Z · comments (50)

Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout · 2022-12-02T02:43:20.915Z · comments (22)

[question] How to Convince my Son that Drugs are Bad
concerned_dad · 2022-12-17T18:47:24.398Z · answers+comments (84)

Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC (LawChan) · 2022-12-19T22:52:20.031Z · comments (30)

[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey (Lee_Sharkey) · 2022-12-13T15:41:48.685Z · comments (22)

K-complexity is silly; use cross-entropy instead
So8res · 2022-12-20T23:06:27.131Z · comments (53)

Shared reality: a key driver of human behavior
kdbscott · 2022-12-24T19:35:51.126Z · comments (25)

Re-Examining LayerNorm
Eric Winsor (EricWinsor) · 2022-12-01T22:20:23.542Z · comments (12)

[link] Did ChatGPT just gaslight me?
ThomasW (ThomasWoodside) · 2022-12-01T05:41:46.560Z · comments (45)

[question] Why The Focus on Expected Utility Maximisers?
DragonGod · 2022-12-27T15:49:36.536Z · answers+comments (84)

The case against AI alignment
andrew sauer (andrew-sauer) · 2022-12-24T06:57:53.405Z · comments (110)

Deconfusing Direct vs Amortised Optimization
beren · 2022-12-02T11:30:46.754Z · comments (17)

Trying to disambiguate different questions about whether RLHF is “good”
Buck · 2022-12-14T04:03:27.081Z · comments (47)

Language models are nearly AGIs but we don't notice it because we keep shifting the bar
philosophybear · 2022-12-30T05:15:15.625Z · comments (13)

Slightly against aligning with neo-luddites
Matthew Barnett (matthew-barnett) · 2022-12-26T22:46:42.693Z · comments (31)

200 Concrete Open Problems in Mechanistic Interpretability: Introduction
Neel Nanda (neel-nanda-1) · 2022-12-28T21:06:53.853Z · comments (0)

[link] [Linkpost] The Story Of VaccinateCA
hath · 2022-12-09T23:54:48.703Z · comments (4)

Thoughts on AGI organizations and capabilities work
Rob Bensinger (RobbBB) · 2022-12-07T19:46:04.004Z · comments (17)

But is it really in Rome? An investigation of the ROME model editing technique
jacquesthibs (jacques-thibodeau) · 2022-12-30T02:40:36.713Z · comments (1)

Applied Linear Algebra Lecture Series
johnswentworth · 2022-12-22T06:57:26.643Z · comments (7)

Finding gliders in the game of life
paulfchristiano · 2022-12-01T20:40:04.230Z · comments (7)

Bad at Arithmetic, Promising at Math
cohenmacaulay · 2022-12-18T05:40:37.088Z · comments (19)

[link] Discovering Language Model Behaviors with Model-Written Evaluations
evhub · 2022-12-20T20:08:12.063Z · comments (34)

[link] [Link] Why I’m optimistic about OpenAI’s alignment approach
janleike · 2022-12-05T22:51:15.769Z · comments (15)

The LessWrong 2021 Review: Intellectual Circle Expansion
Ruby · 2022-12-01T21:17:50.321Z · comments (55)

[link] Revisiting algorithmic progress
Tamay · 2022-12-13T01:39:19.264Z · comments (15)

Towards Hodge-podge Alignment
Cleo Nardo (strawberry calm) · 2022-12-19T20:12:14.540Z · comments (30)

Setting the Zero Point
[DEACTIVATED] Duncan Sabien (Duncan_Sabien) · 2022-12-09T06:06:25.873Z · comments (43)

Consider using reversible automata for alignment research
Alex_Altair · 2022-12-11T01:00:24.223Z · comments (30)

Can we efficiently distinguish different mechanisms?
paulfchristiano · 2022-12-27T00:20:01.728Z · comments (30)

Local Memes Against Geometric Rationality
Scott Garrabrant · 2022-12-21T03:53:28.196Z · comments (3)

You can still fetch the coffee today if you're dead tomorrow
davidad · 2022-12-09T14:06:48.442Z · comments (19)

A hundredth of a bit of extra entropy
Adam Scherlis (adam-scherlis) · 2022-12-24T21:12:41.517Z · comments (4)

next page (older posts) →

Archive

Recent comments

rvnnt on AI Clarity: An Initial Research Agenda

Possibly a nitpick, but:

The development and deployment of AGI, or similarly advanced systems, could constitute a transformation rivaling those of the agricultural and industrial revolutions.

seems like a very strong understatement. Maybe replace "rivaling" with e.g. "(vastly) exceeding"?

1a3orn on Shortform

I mean, sure, but I've been updating in that direction a weirdly large amount.

neel-nanda-1 on Mechanistic Interpretability Workshop Happening at ICML 2024!

Makes sense! Sounds like a fairly good fit

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Another way of framing it: Try to write your paper in such a way that a mech interp researcher reading it says "huh, I want to go and use this library for my research". Eg give examples of things that were previously hard that are now easy.

ophira on Which skincare products are evidence-based?

Yeah, glycolic acid is an exfoliant. The retinoid family also promotes cell turnover, but in a different way. You'd be over-exfoliating by using both of them at the same time.

thomas-kwa on Thomas Kwa's Shortform

To some degree yes, but I expect lots of information to be spread out across time. For example: OpenAI releases GPT5 benchmark results. Then a couple weeks later they deploy it on ChatGPT and we can see how subjectively impressive it is out of the box, and whether it is obviously pursuing misaligned goals. Over the next few weeks people develop post-training enhancements like scaffolding, and we get a better sense of its true capabilities. Over the next few months, debate researchers study whether GPT4-judged GPT5 debates reliably produce truth, and control researchers study whether GPT4 can detect whether GPT5 is scheming. A year later an open-weights model of similar capability is released and the interp researchers check how understandable it is and whether activation steering still works.

akram-choudhary on Please stop publishing ideas/insights/research about AI

Daniel, your interpretation is literally contradicted by Eliezer's exact words. Eliezer defines dignity as that which increases our chance of survival.

""Wait, dignity points?" you ask. "What are those? In what units are they measured, exactly?"

And to this I reply: Obviously, the measuring units of dignity are over humanity's log odds of survival - the graph on which the logistic success curve is a straight line. A project that doubles humanity's chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity."

niplav on Thomas Kwa's Shortform

Thank you a lot! Strong upvoted.

I was wondering a while ago whether Bayesianism says anything about how much my probabilities are "allowed" to oscillate around—I was noticing that my probability of doom was often moving by 5% in the span of 1-3 weeks, though I guess this was mainly due to logical uncertainty and not empirical uncertainty.

bogdan-ionut-cirstea on Mechanistically Eliciting Latent Behaviors in Language Models

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I've only skimmed though), it might be worth having a look.

tigerlily on How would you navigate a severe financial emergency with no help or resources?

Thank you for this. I'm not eligible for it but I will send it to my sister who is. She needs emergency dental work but the health insurance plan offered through her employer doesn't cover it so she's just been suffering through the pain. So really, thank you. She will be so glad.

martinsq on Martín Soto's Shortform

Claude learns across different chats. What does this mean?

I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread [LW(p) · GW(p)]. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.

In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.

Does this mean Claude somehow learns across different chats (inside the same user account)?
If so, this might not happen through a process as naive as "append previous chats as the start of the prompt, with a certain indicator that they are different", but instead some more effective distillation of the important information from those chats.
Do we have any information on whether and how this happens?

(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason "more intelligent" and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)

I've checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn't have it.
Claude also explicitly states it doesn't have cross-chat memory, when asked about it.
Might something happen like "it does have some chat memory, but it's told not to acknowledge this fact, but it sometimes slips"?

Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.