LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Apply to be a mentor in SPAR!
agucova · 2024-11-05T21:32:45.797Z · comments (0)

Scattered thoughts on what it means for an LLM to believe
TheManxLoiner · 2024-11-06T22:10:29.429Z · comments (4)

[link] Inescapably Value-Laden Experience—a Catchy Term I Made Up to Make Morality Rationalisable
James Stephen Brown (james-brown) · 2024-12-19T04:45:37.906Z · comments (0)

Agency overhang as a proxy for Sharp left turn
Eris (anton-zheltoukhov) · 2024-11-07T12:14:24.333Z · comments (0)

No, the Polymarket price does not mean we can immediately conclude what the probability of a bird flu pandemic is. We also need to know the interest rate!
Christopher King (christopher-king) · 2024-12-28T16:05:47.037Z · comments (8)

Bellevue Library Meetup - Nov 23
Cedar (xida-ren) · 2024-11-09T23:05:02.452Z · comments (3)

[link] Is P(Doom) Meaningful? Bayesian vs. Popperian Epistemology Debate
Liron · 2024-11-09T23:39:30.039Z · comments (0)

Speedrunning Rationality: Day I
aproteinengine · 2025-01-04T14:28:49.220Z · comments (0)

Theories With Mentalistic Atoms Are As Validly Called Theories As Theories With Only Non-Mentalistic Atoms
Lorec · 2024-11-12T06:45:26.039Z · comments (5)

Logic vs intuition <=> algorithm vs ML
pchvykov · 2025-01-04T09:06:51.822Z · comments (0)

Fractals to Quasiparticles
James Camacho (james-camacho) · 2024-11-26T20:19:29.675Z · comments (0)

Governance Course - Week 1 Reflections
Alice Blair (Diatom) · 2025-01-09T04:48:27.502Z · comments (1)

Fred the Heretic, a GPT for poetry
Bill Benzon (bill-benzon) · 2024-12-08T16:52:07.660Z · comments (0)

Ways to think about alignment
Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-10-27T01:40:50.762Z · comments (0)

[question] Are there ways to artificially fix laziness?
Aidar (aidar-toktargazin) · 2024-12-08T18:26:26.433Z · answers+comments (2)

Good Fortune and Many Worlds
Jonah Wilberg (jrwilb@googlemail.com) · 2024-12-27T13:21:43.142Z · comments (0)

Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis
Matt Levinson · 2025-01-10T06:53:02.228Z · comments (0)

[link] Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude
rife (edgar-muniz) · 2025-01-06T17:34:01.505Z · comments (18)

It is time to start war gaming for AGI
yanni kyriacos (yanni) · 2024-10-17T05:14:17.932Z · comments (1)

Vision of a positive Singularity
RussellThor · 2024-12-23T02:19:35.050Z · comments (0)

3. Improve Cooperation: Better Technologies
Allison Duettmann (allison-duettmann) · 2025-01-02T19:03:16.588Z · comments (2)

Distillation Of DeepSeek-Prover V1.5
IvanLin (matthewshing) · 2024-10-15T18:53:11.199Z · comments (1)

[link] A Logical Proof for the Emergence and Substrate Independence of Sentience
rife (edgar-muniz) · 2024-10-24T21:08:09.398Z · comments (31)

[link] Entropic strategy in Two Truths and a Lie
dkl9 · 2024-11-21T22:03:28.986Z · comments (2)

[question] Has Anthropic checked if Claude fakes alignment for intended values too?
Maloew (maloew-valenar) · 2024-12-23T00:43:07.490Z · answers+comments (1)

Some Comments on Recent AI Safety Developments
testingthewaters · 2024-11-09T16:44:58.936Z · comments (0)

Visualizing small Attention-only Transformers
WCargo (Wcargo) · 2024-11-19T09:37:42.213Z · comments (0)

[link] Predictions as Public Works Project — What Metaculus Is Building Next
ChristianWilliams · 2024-10-22T16:35:13.999Z · comments (0)

Germany-wide ACX Meetup
Fernand0 · 2024-11-17T10:08:54.584Z · comments (0)

Effects of Non-Uniform Sparsity on Superposition in Toy Models
Shreyans Jain (shreyans-jain) · 2024-11-14T16:59:43.234Z · comments (3)

Investing in Robust Safety Mechanisms is critical for reducing Systemic Risks
Tom DAVID (tom-david) · 2024-12-11T13:37:24.177Z · comments (3)

Towards a Clever Hans Test: Unmasking Sentience Biases in Chatbot Interactions
glykokalyx · 2024-11-10T22:34:58.956Z · comments (0)

Transformers Explained (Again)
RohanS · 2024-10-22T04:06:33.646Z · comments (0)

Jailbreaking ChatGPT and Claude using Web API Context Injection
Jaehyuk Lim (jason-l) · 2024-10-21T21:34:37.579Z · comments (0)

[link] Better antibodies by engineering targets, not engineering antibodies (Nabla Bio)
Abhishaike Mahajan (abhishaike-mahajan) · 2025-01-13T15:05:35.261Z · comments (0)

On AI Detectors Regarding College Applications
Kaustubh Kislay (kaustubh-kislay) · 2024-11-27T20:25:48.151Z · comments (2)

Linkpost: Look at the Water
J Bostock (Jemist) · 2024-12-30T19:49:04.107Z · comments (3)

Grokking revisited: reverse engineering grokking modulo addition in LSTM
Nikita Khomich (nikitoskh) · 2024-12-16T18:48:43.533Z · comments (0)

Dishbrain and implications.
RussellThor · 2024-12-29T10:42:43.912Z · comments (0)

ARC-AGI is a genuine AGI test but o3 cheated :(
Knight Lee (Max Lee) · 2024-12-22T00:58:05.447Z · comments (6)

[question] What (if anything) made your p(doom) go down in 2024?
Satron · 2024-11-16T16:46:43.865Z · answers+comments (6)

Morality as Cooperation Part III: Failure Modes
DeLesley Hutchins (delesley-hutchins) · 2024-12-05T09:39:27.816Z · comments (0)

(draft) Cyborg software should be open (?)
AtillaYasar (atillayasar) · 2024-11-01T07:24:51.966Z · comments (5)

A better “Statement on AI Risk?”
Knight Lee (Max Lee) · 2024-11-25T04:50:29.399Z · comments (6)

[question] Is OpenAI net negative for AI Safety?
Lysandre Terrisse · 2024-11-02T16:18:02.859Z · answers+comments (0)

Are SAE features from the Base Model still meaningful to LLaVA?
Shan23Chen (shan-chen) · 2024-12-05T19:24:34.727Z · comments (0)

[link] Expevolu, a laissez-faire approach to country creation
Fernando · 2024-12-05T19:29:24.011Z · comments (4)

[question] Noticing the World
EvolutionByDesign (bioluminescent-darkness) · 2024-11-04T16:41:44.696Z · answers+comments (1)

More Growth, Melancholy, and MindCraft @3QD [revised and updated]
Bill Benzon (bill-benzon) · 2024-12-05T19:36:02.289Z · comments (0)

[link] Can AI improve the current state of molecular simulation?
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-06T20:22:31.685Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

noa-nabeshima on Noa Nabeshima's Shortform

One barrier to SAE circuits is that it's currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn't tell you how these latents interact inside the attention layer at all.

noa-nabeshima on Noa Nabeshima's Shortform

Auto-interp is currently really really bad

I think o1 is the only model that seems to perform decently at auto-interp but it's very expensive! IE $1/latent label. This is frustrating to me.

sharmake-farah on A problem shared by many different alignment targets

The Coherent Extrapolated Volition of a human Individual (CEVI) is a completely different type of thing, than the Coherent Extrapolated Volition of Humanity (CEVH). Both are mappings to an entity of the type that can be said to want things. But only CEVI is a mapping from an entity of the type that can be said to want things (the original human). CEVH does not map from such an entity. CEVH only maps to such an entity. A group of billions of human individuals can only be seen as such an entity, if one already has a specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements. Such a disagreement resolution rule is one necessary part of the definition of any CEVH mapping.

I like to state this as the issue that all versions of CEV/group alignment that want to aggregate thousands of people's or more values requires implicitly resolving disagreements in values, which in turn require value-laden choices, and at that point, you are essentially doing value-alignment to what you think is good, and the nominal society is just a society of you.

I basically agree with Seth Herd here, in that instruction following is both the most likely and the best alignment target for purposes of AI safety (at least assuming offense-defense balance issues aren't too severe).

noa-nabeshima on Noa Nabeshima's Shortform

TinyModel SAEs have these first entity and second entity latents.

E.g. if the story is 'Once upon a time Tim met his friend Sally.', Tim is the first entity and Sally is the second entity.

I think I at one point found an 'object owned by second entity' latent but have had trouble finding it again.

I wonder if LMs are generating these reusable 'pointers' and then doing computation with the pointers. For example to track that an object is owned by the first entity, you just need to calculate which entities are instances of the first entity, calculate when first entity is shown to own an object and write 'owned by first entity' to the object token, and then broadcast that forward to other instances of the object. Then, if you have the tokens Tim|'s

(and 's has calculated that the first entity is immediately before it), 's can, with a single attention head, look for objects owned by the first entity.

This means that the exact identity information of the object (e.g. ' hammer') and the exact identity information of the first entity (' Tim') don't need to be passed around in computations, you can just do much cheaper pointer calculations and grab the relevant identity information when necessary.

This suggests a more fine-grained story for what duplicate name heads are doing in IOI.

yonatan-cale-1 on Yonatan Cale's Shortform

"Protecting model weights" is aiming too low, I'd like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.

I feel it's outside the Overton Window to even suggest this and I'm not sure what to do about that except write a lesswrong shortform I guess.

Anyway, common pushbacks:

"Employees move between companies and we can't prevent them sharing what they know": In the IDF we had secrets in our air gapped network which people didn't share because they understood it's important. I think lab employees could also understand it's important. I'm not saying this works perfectly, but it works well enough for nation states to do when they're taking security seriously.
"Working in an air gapped network is annoying": Yeah 100%, but it's doable, and there are many things to do to make it more comfortable. I worked for about 6 years as a developer in an air gapped network.

Also, a note of hope: I think It's not crazy for labs to aim for a development environment that is world leading in the tradeoff between convenience and security. I don't know what the U.S has to offer in terms of a ready made air gapped development environment, but I can imagine, for example, Anthropic being able to build something better if they take this project seriously, or at least build some parts really well before the U.S government comes to fill in the missing parts. Anyway, that's what I'd aim for

ryan_b on A Novel Idea for Harnessing Magnetic Reconnection as an Energy Source

This is a fun idea! I was recently poking at field line reconnection myself, in conversation with Claude.

I don't think the energy balance turns out in the idea's favor. Here are the heuristics I considered:

The first thing I note is what happens during reconnection: a bunch of the magnetic energy turns into kinetic and thermal energy. The part you plan to harvest is just the electric field part. Even in otherwise ideal circumstances, that's a substantial loss.
The second thing I note is that in a fusion reactor, the magnetic field is already being generated by the device, via electromagnets. This makes the process look like putting current into a magnetic field, then to break the magnetic field in order to get less current back out (because of the first note).
The third thing I note is that reconnection is about the reconfiguration of the magnetic field lines. I'm highly confident that electric fields when the lines break define how the lines reconnect, so if you induct all the energy out the reconnection will look different than would have. Mostly this would cash out as a weaker magnetic field than it would be otherwise, driving more recharging of the magnetic field, making the balance worse.

All of that being said, Claude and ChatGPT both respond well to sanity checking. You can say directly something like: "Sanity check: is this consistent with thermodynamics?"

I also think that ChatGPT misleadingly treated the magnetic fields and electric fields as being separate because it was using an ideal MHD model, where this is common due to the simplifications the model makes. In my experience at least Claude catches a lot of confusion and oversights by asking specifically about the differences between the physics and the model.

seth-herd on A problem shared by many different alignment targets

I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.

But I don't think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] Ω or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target [LW · GW] is far superior to anything else. I think it's so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.

On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.

nathan-helm-burger on ryan_greenblatt's Shortform

https://www.lesswrong.com/posts/uPi2YppTEnzKG3nXD/nathan-helm-burger-s-shortform?commentId=rnT3z9F55A2pmrj4Y [LW(p) · GW(p)]

cleo-nardo on Shortform

Must humans obey the Axiom of Irrelevant Alternatives?

If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it's treated more fundamental than VNM. But should humans follow this? Maybe not.

Maybe humans are the negotiation between various "subagents", and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose humans into subagents.

Let's suppose you pick A from {A,B,C} and B from {A,B} where:

A = Walk with your friend
B = Dinner party
C = Stay home alone

This feel like something I can imagine. We can explain this behaviour with two subagents: the introvert and the extrovert. The introvert has preferences C > A > B and the extrovert has the opposite preferences B > A > C. When the possible options are A and B, then the KS bargaining solution between the introvert and the extrovert will be B. At least, if the introvert has more "weight". But when the option space expands to include C, then the bargaining solution might shift to B. Intuitively, the "fair" solution is one where neither bargainer is sacrificing significantly more than the other.

nathan-helm-burger on Nathan Helm-Burger's Shortform

Want to just give a quick take on this $450 o1-style model: https://novasky-ai.github.io/posts/sky-t1/

I think this matches a pattern we see a lot throughout the history of human engineering. Once a thing is known to be possible, and rough clues about how it was done are known (especially if many people get to play around with the product), then it won't be long until some other group figures out how to replicate a shoddy version of the new tech. And from there, usually (if there's market for it) improvements can steadily cause the shoddy version to catch up to close to the original in performance.

When we apply this lesson to AGI, we should assume that a similar sort of thing will happen if some company develops AGI and shows it off to the world. Especially if they give hints about how they did it, and if they let users interact with it. The question then is, how long until the world produces a '$450' knock-off version of the AGI?

This is super relevant for governance. You can't assume that everyone who makes a knock-off will be taking the same security precautions as the original inventors. If the thing blocking the AGI from self-improving is the disciplined restraint, government oversight, and security mindset of the original inventors... well, don't count on those things. If the knock-off AGI is good enough to self-improve, it's future versions won't be second-rate for long. Choosing not to assign the AGI to making stronger AGI is an alignment tax. Defectors will defect, and gain great power thereby.

We need a plan that covers this possibility. This is not definitely the path the future will take, but it is a plausible path.