LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde (kola-ayonrinde) · 2024-08-23T18:52:31.019Z · comments (5)

Anthropic rewrote its RSP
Zach Stein-Perlman · 2024-10-15T14:25:12.518Z · comments (19)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

Toy Models of Feature Absorption in SAEs
chanind · 2024-10-07T09:56:53.609Z · comments (7)

You're a Space Wizard, Luke
lsusr · 2024-08-18T05:35:39.238Z · comments (6)

[link] AISafety.info: What is the "natural abstractions hypothesis"?
Algon · 2024-10-05T12:31:14.195Z · comments (2)

AI as a powerful meme, via CGP Grey
TheManxLoiner · 2024-10-30T18:31:58.544Z · comments (1)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (0)

Australian AI Safety Forum 2024
Liam Carroll (liam-carroll) · 2024-09-27T00:40:11.451Z · comments (0)

[link] Generative ML in chemistry is bottlenecked by synthesis
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-16T16:31:34.801Z · comments (2)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (1)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

[link] I didn't have to avoid you; I was just insecure
Chipmonk · 2024-08-17T16:41:50.237Z · comments (7)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

We’re not as 3-Dimensional as We Think
silentbob · 2024-08-04T14:39:16.799Z · comments (16)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations
Steven Byrnes (steve2152) · 2024-10-29T13:36:16.325Z · comments (2)

[link] IAPS: Mapping Technical Safety Research at AI Companies
Zach Stein-Perlman · 2024-10-24T20:30:41.159Z · comments (9)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (10)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (10)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

Winners of the Essay competition on the Automation of Wisdom and Philosophy
AI Impacts (AI Imacts) · 2024-10-28T17:10:04.272Z · comments (3)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

[link] My article in The Nation — California’s AI Safety Bill Is a Mask-Off Moment for the Industry
garrison · 2024-08-15T19:25:59.592Z · comments (0)

Open Problems in AIXI Agent Foundations
Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · comments (2)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

[link] My Model of Epistemology
adamShimi · 2024-08-31T17:01:45.472Z · comments (0)

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Sodium · 2024-10-03T19:11:58.032Z · comments (17)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (36)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (7)

[link] Book review: On the Edge
PeterMcCluskey · 2024-08-30T22:18:39.581Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

donald-hobson on Occupational Licensing Roundup #1

I think part of the problem is that there is no middle ground between "Allow any idiot to do thing" and "long and difficult to get professional certification".

How about a 1 day, free or cheap, hair cutting certification course. It doesn't talk about style or anything at all. It's just a check to make sure that hairdressers have a passing familiarity with hygiene 101 and other basic safety measures.

Of course, if there is only a single certification system, then the rent seeking will ratchet up the test difficulty.

How about having several different organizations, and you only need one of the licenses. So if AliceLicenses are too hard to get, everyone goes and gets BobLicenses instead. And the regulators only care that you have some license. (With the threat of revoking license granting power if licenses are handed to total muppets too often)

gwern on On Shifgrethor

That is what someone might claim, yes, to avoid losing face by too visibly caring about losing face or attempting to manipulate it.

towards_keeperhood on Overview of strong human intelligence amplification methods

Thanks!

Yeah I believe what you say about that long-distance connections not that many.

I meant that there might be more non-long-distance connections between neighboring areas. (E.g. boundaries of areas are a bit fuzzy iirc, so macrocolumns towards the "edge" of a region are sorta intertwined with macrocolumns of the other side of the "edge".)
(I thought when you mean V1 to V2 you include those too, but I guess you didn't?)

Do you think those inter-area non-long-distance connections are relatively unimportant, and if so why?

notfnofn on I turned decision theory problems into memes about trolleys

A priori, why should we trust that Omega's statement even has a well-defined truth value in the XOR-trolley problem? We're already open to the idea that it could be false.

gwern on plex's Shortform

Can't you do this as polls in a single comment?

gwern on The hostile telepaths problem

The fact that Bob has this policy in the first place is more likely when he's being self-deceptive.

A fun fictional example here is Bester's The Demolished Man: how do you plan & carry out an assassination when telepaths are routinely eavesdropping on your mind? The protagonist visits a company musician, requesting a musical earworm for a company song to help the workers' health or something; alas! the earworm gets stuck in his head, and so all any telepath hears is the earworm. And you can't blame a man for having an earworm stuck in his head, now can you? He has an entirely legitimate reason for that to be there, which 'explains away' the evidence of the deception hypothesis that telepathic-immunity would otherwise support.

valentine on The hostile telepaths problem

(Though maybe I should say that the therapist needs to either experience unconditional positive regard toward the client, or successfully deceive themselves and the client into thinking that they do. Heh.)

I mean, technically they don't even need to deceive themselves. They can be consciously judgy as f**k as long as they can mask it effectively. Psychopaths might make for amazing therapists in this one way!

valentine on The hostile telepaths problem

I think the word "power" might be creating some confusion here.

I mean something pretty specific and very practical. I'm not sure how to precisely define it, but here are some examples:

If someone threatens to freak out at you if you disagree with them, and you tend to get overwhelmed and panic when the freak out at you, then they have a kind of power over you. Building power here probably looks like learning to experience them freaking out without you getting overwhelmed.
If someone pays for your rent and food but might stop if they get any hint that you're gay, it might not be safe to even ask yourself honestly whether you are. You build power here by getting an income, or a source of rent and food, that doesn't depend on the hostile telepathic benefactor.
If your lover gets turned on by you politically agreeing with them and turned off by disagreement, you might find your political views drifting toward theirs for "unrelated" reasons. One way to build power here is to get other access to sex. Another is to diminish your libido. Another is to break up with them. (Not saying any of these are a great idea. I'm just naming what the solution of "building power" might look like here.)

I'm not familiar with LDT. I can't comment on that part. Sorry if that means what I just said misses your point.

valentine on The hostile telepaths problem

The fact that Bob has this policy in the first place is more likely when he's being self-deceptive.

I don't know if that's true. It might be. But some possible counterpoints:

People can distrust systems that demand they check. "You have nothing to fear if you have nothing to hide" can get a response of "No" even from people who don't have anything to hide.
If someone subconsciously thinks they can pull off the illusion of honestly looking while in fact finding nothing, they become more likely to choose to look because they're self-deceiving.
Someone with a policy of not looking might be better at making their own self-deception unnecessary.

…more often it will be the result of Bob noticing that he's the sort of person who might have something to hide.

Sure, that way of deciding doesn't work.

Likewise, if you're inclined to decide you're going to dig into possible sources of self-deception because you think it's unlikely that you have any, then you can't do this trick.

The hypothetical respect for any self-deception that might be there needs to be unconditional on its existence. Otherwise, for the reason you say, it doesn't work as well.

(…with some caveats about how people are imperfect telepaths, so some fuzz in implementation here is in practice fine.)

That said, I think you're right in that if Omega-C is looking only at the choice of whether to look or not, then yes, Omega-C would be right to take the choice as evidence of a deception.

But the whole point is that Omega-C can read what conscious processes you're using, and can see that you're deciding for a glomerizing reason.

That's why why you choose what you do matters so much here. Not just what you choose.

It's a general rule [LW · GW] that if E is strong evidence for X, then ~E is at least weak evidence for ~X.

Conservation of expected evidence is what makes looking relevant. It's not what makes deciding to look relevant.

If I decide to appease Omega-C by looking, and then I find that I'm self-deceiving, the fact that I chose to look gets filtered. The fact that this is possible is why not finding evidence can matter at all. Otherwise it'd just be a charade.

Relatedly: I have a coin in my pocket. I don't feel like checking it for bias. Does that make it more likely that the coin is biased? Maybe. But if I could magically show you that I'm not looking because I honestly do not care one way or the other and don't want to waste the effort, and it doesn't affect me whether it's biased or not… then you can't use my disinterest in checking the coin for bias as evidence of some kind of subconscious deception about the coin's bias. I'm just refusing to do things that would inform you of the coin's possible bias.

If this kind of reasoning weren't possible, then it seems to me that glomerization wouldn't be possible.

nathan-helm-burger on Overview of strong human intelligence amplification methods

I'm glad you're curious to learn more! The cortex factors quite crisply into specialized regions. These regions have different cell types and groupings, so were first noticed by early microscope users like Cajal. In a cortical region, neurons are organized first into microcolumns of 80-100 neurons, and then into macrocolumns of many microcolumns. Each microcolumn works together as a group to calculate a function. Neighboring microcolumns inhibit each other. So each macrocolumn is sort of a mixture of experts. The question then is how many microcolumns from one region send an output to a different region. For the example of V1 to V2, basically every microcolumn in V1 sends a connection to V2 (and vise versa). This is why the connection percentage is about 1%. 100 neurons per microcolumn, 1 of which has a long distance axon to V2. The total number of neurons is roughly 10 million, organized into about 100,000 microcolumns.

For areas that are further apart, they send fewer axons. Which doesn't mean their signal is unimportant, just lower resolution. In that case you'd ask something like "how many microcolumns per macrocolumn send out a long distance axon from region A to region B?" This might be 1, just a summary report of the macrocolumn. So for roughly 10 million neurons, and 100,000 microcolumns organized into around 1000 macrocolumns... You get around 1000 neurons send axons from region A to region B.

More details are in the papers I linked elsewhere in this comment thread.