LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Refusal in LLMs is mediated by a single direction
Andy Arditi (andy-arditi) · 2024-04-27T11:13:06.235Z · comments (66)

[link] Introducing AI Lab Watch
Zach Stein-Perlman · 2024-04-30T17:00:12.652Z · comments (7)

Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-04-30T18:51:13.493Z · comments (27)

Ironing Out the Squiggles
Zack_M_Davis · 2024-04-29T16:13:00.371Z · comments (33)

On Not Pulling The Ladder Up Behind You
Screwtape · 2024-04-26T21:58:29.455Z · comments (11)

The first future and the best future
KatjaGrace · 2024-04-25T06:40:04.510Z · comments (11)

Why I'm doing PauseAI
Joseph Miller (Josephm) · 2024-04-30T16:21:54.156Z · comments (11)

[question] Which skincare products are evidence-based?
Vanessa Kosoy (vanessa-kosoy) · 2024-05-02T15:22:12.597Z · answers+comments (25)

[link] My hour of memoryless lucidity
Eric Neyman (UnexpectedValues) · 2024-05-04T01:40:56.717Z · comments (2)

ACX Covid Origins Post convinced readers
ErnestScribbler · 2024-05-01T13:06:20.818Z · comments (7)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (3)

[link] The Inner Ring by C. S. Lewis
Saul Munn (saul-munn) · 2024-04-24T22:48:09.228Z · comments (6)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (6)

Constructability: Plainly-coded AGIs may be feasible in the near future
Épiphanie Gédéon (joy_void_joy) · 2024-04-27T16:04:45.894Z · comments (12)

Duct Tape security
Isaac King (KingSupernova) · 2024-04-26T18:57:05.659Z · comments (8)

Questions for labs
Zach Stein-Perlman · 2024-04-30T22:15:55.362Z · comments (9)

Q&A on Proposed SB 1047
Zvi · 2024-05-02T15:10:02.916Z · comments (1)

AISC9 has ended and there will be an AISC10
Linda Linsefors · 2024-04-29T10:53:18.812Z · comments (2)

[link] Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan (SenR) · 2024-04-25T18:43:47.003Z · comments (35)

LessWrong Community Weekend 2024, open for applications
UnplannedCauliflower · 2024-05-01T10:18:21.992Z · comments (0)

So What's Up With PUFAs Chemically?
J Bostock (Jemist) · 2024-04-27T13:32:52.159Z · comments (23)

[link] "Why I Write" by George Orwell (1946)
Arjun Panickssery (arjun-panickssery) · 2024-04-25T16:02:28.668Z · comments (2)

Changes in College Admissions
Zvi · 2024-04-24T13:50:03.487Z · comments (10)

[question] Shane Legg's necessary properties for every AGI Safety plan
jacquesthibs (jacques-thibodeau) · 2024-05-01T17:15:41.233Z · answers+comments (10)

Transcoders enable fine-grained interpretable circuit analysis for language models
Jacob Dunefsky (jacob-dunefsky) · 2024-04-30T17:58:09.982Z · comments (11)

[link] This is Water by David Foster Wallace
Nathan Young · 2024-04-24T21:21:09.445Z · comments (16)

[link] "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case
habryka (habryka4) · 2024-05-03T18:10:12.478Z · comments (10)

Superposition is not "just" neuron polysemanticity
LawrenceC (LawChan) · 2024-04-26T23:22:06.066Z · comments (4)

Spatial attention as a “tell” for empathetic simulation?
Steven Byrnes (steve2152) · 2024-04-26T15:10:58.040Z · comments (11)

[link] LLMs seem (relatively) safe
JustisMills · 2024-04-25T22:13:06.221Z · comments (23)

Mechanistic Interpretability Workshop Happening at ICML 2024!
Neel Nanda (neel-nanda-1) · 2024-05-03T01:18:26.936Z · comments (4)

Towards a formalization of the agent structure problem
Alex_Altair · 2024-04-29T20:28:15.190Z · comments (2)

Apply to ESPR & PAIR, Rationality and AI Camps for Ages 16-21
Anna Gajdova (anna-gajdova) · 2024-05-03T12:36:37.610Z · comments (0)

An Introduction to AI Sandbagging
Teun van der Weij (teun-van-der-weij) · 2024-04-26T13:40:00.126Z · comments (1)

D&D.Sci Long War: Defender of Data-mocracy
aphyer · 2024-04-26T22:30:15.780Z · comments (13)

[Aspiration-based designs] 1. Informal introduction
B Jacobs (Bob Jacobs) · 2024-04-28T13:00:43.268Z · comments (4)

Manifund Q1 Retro: Learnings from impact certs
Austin Chen (austin-chen) · 2024-05-01T16:48:33.140Z · comments (1)

Scaling of AI training runs will slow down after GPT-5
Maxime Riché (maxime-riche) · 2024-04-26T16:05:59.957Z · comments (5)

We are headed into an extreme compute overhang
devrandom · 2024-04-26T21:38:21.694Z · comments (21)

List your AI X-Risk cruxes!
Aryeh Englander (alenglander) · 2024-04-28T18:26:19.327Z · comments (7)

Losing Faith In Contrarianism
omnizoid · 2024-04-25T20:53:34.842Z · comments (42)

AI #62: Too Soon to Tell
Zvi · 2024-05-02T15:40:04.364Z · comments (8)

Take SCIFs, it’s dangerous to go alone
latterframe · 2024-05-01T08:02:38.067Z · comments (1)

The Intentional Stance, LLMs Edition
Eleni Angelou (ea-1) · 2024-04-30T17:12:29.005Z · comments (2)

Big-endian is better than little-endian
Menotim · 2024-04-29T02:30:48.053Z · comments (14)

AI #61: Meta Trouble
Zvi · 2024-05-02T18:40:03.242Z · comments (0)

An explanation of evil in an organized world
KatjaGrace · 2024-05-02T05:20:06.240Z · comments (9)

AXRP Episode 30 - AI Security with Jeffrey Ladish
DanielFilan · 2024-05-01T02:50:04.621Z · comments (0)

Bayesian inference without priors
DanielFilan · 2024-04-24T23:50:08.312Z · comments (8)

Unintentionally Creating Value
abstractapplic · 2024-04-28T20:05:08.479Z · comments (2)

next page (older posts) →

Archive

Recent comments

1a3orn on Shortform

I mean, sure, but I've been updating in that direction a weirdly large amount.

neel-nanda-1 on Mechanistic Interpretability Workshop Happening at ICML 2024!

Makes sense! Sounds like a fairly good fit

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Another way of framing it: Try to write your paper in such a way that a mech interp researcher reading it says "huh, I want to go and use this library for my research". Eg give examples of things that were previously hard that are now easy.

ophira on Which skincare products are evidence-based?

Yeah, glycolic acid is an exfoliant. The retinoid family also promotes cell turnover, but in a different way. You'd be over-exfoliating by using both of them at the same time.

thomas-kwa on Thomas Kwa's Shortform

To some degree yes, but I expect lots of information to be spread out across time. For example: OpenAI releases GPT5 benchmark results. Then a couple weeks later they deploy it on ChatGPT and we can see how subjectively impressive it is out of the box, and whether it is obviously pursuing misaligned goals. Over the next few weeks people develop post-training enhancements like scaffolding, and we get a better sense of its true capabilities. Over the next few months, debate researchers study whether GPT4-judged GPT5 debates reliably produce truth, and control researchers study whether GPT4 can detect whether GPT5 is scheming. A year later an open-weights model of similar capability is released and the interp researchers check how understandable it is and whether activation steering still works.

akram-choudhary on Please stop publishing ideas/insights/research about AI

Daniel, your interpretation is literally contradicted by Eliezer's exact words. Eliezer defines dignity as that which increases our chance of survival.

""Wait, dignity points?" you ask. "What are those? In what units are they measured, exactly?"

And to this I reply: Obviously, the measuring units of dignity are over humanity's log odds of survival - the graph on which the logistic success curve is a straight line. A project that doubles humanity's chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity."

niplav on Thomas Kwa's Shortform

Thank you a lot! Strong upvoted.

I was wondering a while ago whether Bayesianism says anything about how much my probabilities are "allowed" to oscillate around—I was noticing that my probability of doom was often moving by 5% in the span of 1-3 weeks, though I guess this was mainly due to logical uncertainty and not empirical uncertainty.

bogdan-ionut-cirstea on Mechanistically Eliciting Latent Behaviors in Language Models

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I've only skimmed though), it might be worth having a look.

tigerlily on How would you navigate a severe financial emergency with no help or resources?

Thank you for this. I'm not eligible for it but I will send it to my sister who is. She needs emergency dental work but the health insurance plan offered through her employer doesn't cover it so she's just been suffering through the pain. So really, thank you. She will be so glad.

martinsq on Martín Soto's Shortform

Claude learns across different chats. What does this mean?

I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread [LW(p) · GW(p)]. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.

In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.

Does this mean Claude somehow learns across different chats (inside the same user account)?
If so, this might not happen through a process as naive as "append previous chats as the start of the prompt, with a certain indicator that they are different", but instead some more effective distillation of the important information from those chats.
Do we have any information on whether and how this happens?

(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason "more intelligent" and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)

I've checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn't have it.
Claude also explicitly states it doesn't have cross-chat memory, when asked about it.
Might something happen like "it does have some chat memory, but it's told not to acknowledge this fact, but it sometimes slips"?

Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.

tigerlily on How would you navigate a severe financial emergency with no help or resources?

Thank you for the thoughtful suggestions. Aella is exemplary but camgirling strikes me as a nightmare.

I have considered making stuff, like custom glasses/premium drinkware, and selling on Etsy but the market seems saturated and I've never had the money to buy the equipment to learn the skills required to do this kind of thing.

I am certified in Salesforce and could probably get hired helping to manage the Salesforce org for my tribe (Cherokee Nation) but would have to move to Oklahoma.

I've applied for every grant I can find that I'm eligible for, but there's not much out there and the competition is stiff.

We will figure out something, I'm sure. If we don't, there's nothing standing between us and homelessness and that reality fills me with anger and despair.

I feel like there's nothing society wants from me, so there's no way for me to convince society that I deserve anything from it.

It's so hard out here.