LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Deconfusing In-Context Learning
Arjun Panickssery (arjun-panickssery) · 2024-02-25T09:48:17.690Z · comments (1)

Effectively Handling Disagreements - Introducing a New Workshop
Camille Berger (Camille Berger) · 2024-04-15T16:33:50.339Z · comments (2)

Medical Roundup #2
Zvi · 2024-04-09T13:40:05.908Z · comments (18)

[question] Is there software to practice reading expressions?
lsusr · 2024-04-23T21:53:00.679Z · answers+comments (11)

Thousands of malicious actors on the future of AI misuse
Zershaaneh Qureshi (zershaaneh-qureshi) · 2024-04-01T10:08:42.357Z · comments (0)

[question] Is a random box of gas predictable after 20 seconds?
Thomas Kwa (thomas-kwa) · 2024-01-24T23:00:53.184Z · answers+comments (35)

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
cmathw · 2024-04-08T11:14:43.268Z · comments (4)

On DeepMind’s Frontier Safety Framework
Zvi · 2024-06-18T13:30:21.154Z · comments (4)

AI #66: Oh to Be Less Online
Zvi · 2024-05-30T14:20:03.334Z · comments (6)

UDT1.01: The Story So Far (1/10)
Diffractor · 2024-03-27T23:22:35.170Z · comments (6)

[link] WSJ: Inside Amazon’s Secret Operation to Gather Intel on Rivals
trevor (TrevorWiesinger) · 2024-04-23T21:33:08.049Z · comments (5)

[link] A High Decoupling Failure
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-14T19:46:09.552Z · comments (5)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (37)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

Turning Your Back On Traffic
jefftk (jkaufman) · 2024-07-17T01:00:08.627Z · comments (7)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

[link] Twitter thread on AI takeover scenarios
Richard_Ngo (ricraz) · 2024-07-31T00:24:33.866Z · comments (0)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (33)

[link] Claude 3 Opus can operate as a Turing machine
Gunnar_Zarncke · 2024-04-17T08:41:57.209Z · comments (2)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (10)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

Doomsday Argument and the False Dilemma of Anthropic Reasoning
Ape in the coat · 2024-07-05T05:38:39.428Z · comments (55)

Finding the Wisdom to Build Safe AI
Gordon Seidoh Worley (gworley) · 2024-07-04T19:04:16.089Z · comments (10)

[link] Searching for the Root of the Tree of Evil
Ivan Vendrov (ivan-vendrov) · 2024-06-08T17:05:53.950Z · comments (14)

Good job opportunities for helping with the most important century
HoldenKarnofsky · 2024-01-18T17:30:03.332Z · comments (0)

AI #47: Meet the New Year
Zvi · 2024-01-13T16:20:10.519Z · comments (7)

The Evolution of Humans Was Net-Negative for Human Values
Zack_M_Davis · 2024-04-01T16:01:10.037Z · comments (1)

[link] Toki pona FAQ
dkl9 · 2024-03-17T21:44:21.782Z · comments (8)

AI Safety Camp final presentations
Linda Linsefors · 2024-03-29T14:27:43.503Z · comments (3)

Closeness To the Issue (Part 5 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-09T00:36:47.388Z · comments (0)

AI companies' commitments
Zach Stein-Perlman · 2024-05-29T11:00:31.339Z · comments (0)

Childhood and Education Roundup #5
Zvi · 2024-04-17T13:00:03.015Z · comments (4)

Drone Wars Endgame
RussellThor · 2024-02-01T02:30:46.161Z · comments (71)

On Dwarkesh’s 3rd Podcast With Tyler Cowen
Zvi · 2024-02-02T19:30:05.974Z · comments (9)

Deep Learning is cheap Solomonoff induction?
Lucius Bushnaq (Lblack) · 2024-12-07T11:00:56.455Z · comments (1)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (38)

Intent alignment as a stepping-stone to value alignment
Seth Herd · 2024-11-05T20:43:24.950Z · comments (4)

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data
Sohaib Imran (sohaib-imran) · 2024-11-16T23:22:21.857Z · comments (11)

[link] The Way According To Zvi
Sable · 2024-12-07T17:35:48.769Z · comments (0)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (8)

A Matter of Taste
Zvi · 2024-12-18T17:50:07.201Z · comments (4)

My disagreements with "AGI ruin: A List of Lethalities"
Noosphere89 (sharmake-farah) · 2024-09-15T17:22:18.367Z · comments (46)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

Fireplace and Candle Smoke
jefftk (jkaufman) · 2025-01-01T01:50:01.408Z · comments (4)

[link] [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew (ymath) · 2024-09-25T14:52:48.263Z · comments (2)

Grammars, subgrammars, and combinatorics of generalization in transformers
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-02T09:37:23.191Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

noa-nabeshima on Noa Nabeshima's Shortform

One barrier to SAE circuits is that it's currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn't tell you how these latents interact inside the attention layer at all.

noa-nabeshima on Noa Nabeshima's Shortform

Auto-interp is currently really really bad

I think o1 is the only model that seems to perform decently at auto-interp but it's very expensive! IE $1/latent label. This is frustrating to me.

sharmake-farah on A problem shared by many different alignment targets

The Coherent Extrapolated Volition of a human Individual (CEVI) is a completely different type of thing, than the Coherent Extrapolated Volition of Humanity (CEVH). Both are mappings to an entity of the type that can be said to want things. But only CEVI is a mapping from an entity of the type that can be said to want things (the original human). CEVH does not map from such an entity. CEVH only maps to such an entity. A group of billions of human individuals can only be seen as such an entity, if one already has a specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements. Such a disagreement resolution rule is one necessary part of the definition of any CEVH mapping.

I like to state this as the issue that all versions of CEV/group alignment that want to aggregate thousands of people's or more values requires implicitly resolving disagreements in values, which in turn require value-laden choices, and at that point, you are essentially doing value-alignment to what you think is good, and the nominal society is just a society of you.

I basically agree with Seth Herd here, in that instruction following is both the most likely and the best alignment target for purposes of AI safety (at least assuming offense-defense balance issues aren't too severe).

noa-nabeshima on Noa Nabeshima's Shortform

TinyModel SAEs have these first entity and second entity latents.

E.g. if the story is 'Once upon a time Tim met his friend Sally.', Tim is the first entity and Sally is the second entity.

I think I at one point found an 'object owned by second entity' latent but have had trouble finding it again.

I wonder if LMs are generating these reusable 'pointers' and then doing computation with the pointers. For example to track that an object is owned by the first entity, you just need to calculate which entities are instances of the first entity, calculate when first entity is shown to own an object and write 'owned by first entity' to the object token, and then broadcast that forward to other instances of the object. Then, if you have the tokens Tim|'s

(and 's has calculated that the first entity is immediately before it), 's can, with a single attention head, look for objects owned by the first entity.

This means that the exact identity information of the object (e.g. ' hammer') and the exact identity information of the first entity (' Tim') don't need to be passed around in computations, you can just do much cheaper pointer calculations and grab the relevant identity information when necessary.

This suggests a more fine-grained story for what duplicate name heads are doing in IOI.

yonatan-cale-1 on Yonatan Cale's Shortform

"Protecting model weights" is aiming too low, I'd like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.

I feel it's outside the Overton Window to even suggest this and I'm not sure what to do about that except write a lesswrong shortform I guess.

Anyway, common pushbacks:

"Employees move between companies and we can't prevent them sharing what they know": In the IDF we had secrets in our air gapped network which people didn't share because they understood it's important. I think lab employees could also understand it's important. I'm not saying this works perfectly, but it works well enough for nation states to do when they're taking security seriously.
"Working in an air gapped network is annoying": Yeah 100%, but it's doable, and there are many things to do to make it more comfortable. I worked for about 6 years as a developer in an air gapped network.

Also, a note of hope: I think It's not crazy for labs to aim for a development environment that is world leading in the tradeoff between convenience and security. I don't know what the U.S has to offer in terms of a ready made air gapped development environment, but I can imagine, for example, Anthropic being able to build something better if they take this project seriously, or at least build some parts really well before the U.S government comes to fill in the missing parts. Anyway, that's what I'd aim for

ryan_b on A Novel Idea for Harnessing Magnetic Reconnection as an Energy Source

This is a fun idea! I was recently poking at field line reconnection myself, in conversation with Claude.

I don't think the energy balance turns out in the idea's favor. Here are the heuristics I considered:

The first thing I note is what happens during reconnection: a bunch of the magnetic energy turns into kinetic and thermal energy. The part you plan to harvest is just the electric field part. Even in otherwise ideal circumstances, that's a substantial loss.
The second thing I note is that in a fusion reactor, the magnetic field is already being generated by the device, via electromagnets. This makes the process look like putting current into a magnetic field, then to break the magnetic field in order to get less current back out (because of the first note).
The third thing I note is that reconnection is about the reconfiguration of the magnetic field lines. I'm highly confident that electric fields when the lines break define how the lines reconnect, so if you induct all the energy out the reconnection will look different than would have. Mostly this would cash out as a weaker magnetic field than it would be otherwise, driving more recharging of the magnetic field, making the balance worse.

All of that being said, Claude and ChatGPT both respond well to sanity checking. You can say directly something like: "Sanity check: is this consistent with thermodynamics?"

I also think that ChatGPT misleadingly treated the magnetic fields and electric fields as being separate because it was using an ideal MHD model, where this is common due to the simplifications the model makes. In my experience at least Claude catches a lot of confusion and oversights by asking specifically about the differences between the physics and the model.

seth-herd on A problem shared by many different alignment targets

I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.

But I don't think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] Ω or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target [LW · GW] is far superior to anything else. I think it's so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.

On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.

nathan-helm-burger on ryan_greenblatt's Shortform

https://www.lesswrong.com/posts/uPi2YppTEnzKG3nXD/nathan-helm-burger-s-shortform?commentId=rnT3z9F55A2pmrj4Y [LW(p) · GW(p)]

cleo-nardo on Shortform

Must humans obey the Axiom of Irrelevant Alternatives?

If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it's treated more fundamental than VNM. But should humans follow this? Maybe not.

Maybe humans are the negotiation between various "subagents", and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose humans into subagents.

Let's suppose you pick A from {A,B,C} and B from {A,B} where:

A = Walk with your friend
B = Dinner party
C = Stay home alone

This feel like something I can imagine. We can explain this behaviour with two subagents: the introvert and the extrovert. The introvert has preferences C > A > B and the extrovert has the opposite preferences B > A > C. When the possible options are A and B, then the KS bargaining solution between the introvert and the extrovert will be B. At least, if the introvert has more "weight". But when the option space expands to include C, then the bargaining solution might shift to B. Intuitively, the "fair" solution is one where neither bargainer is sacrificing significantly more than the other.

nathan-helm-burger on Nathan Helm-Burger's Shortform

Want to just give a quick take on this $450 o1-style model: https://novasky-ai.github.io/posts/sky-t1/

I think this matches a pattern we see a lot throughout the history of human engineering. Once a thing is known to be possible, and rough clues about how it was done are known (especially if many people get to play around with the product), then it won't be long until some other group figures out how to replicate a shoddy version of the new tech. And from there, usually (if there's market for it) improvements can steadily cause the shoddy version to catch up to close to the original in performance.

When we apply this lesson to AGI, we should assume that a similar sort of thing will happen if some company develops AGI and shows it off to the world. Especially if they give hints about how they did it, and if they let users interact with it. The question then is, how long until the world produces a '$450' knock-off version of the AGI?

This is super relevant for governance. You can't assume that everyone who makes a knock-off will be taking the same security precautions as the original inventors. If the thing blocking the AGI from self-improving is the disciplined restraint, government oversight, and security mindset of the original inventors... well, don't count on those things. If the knock-off AGI is good enough to self-improve, it's future versions won't be second-rate for long. Choosing not to assign the AGI to making stronger AGI is an alignment tax. Defectors will defect, and gain great power thereby.

We need a plan that covers this possibility. This is not definitely the path the future will take, but it is a plausible path.