LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (74)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (15)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

Counting arguments provide no evidence for AI doom
Nora Belrose (nora-belrose) · 2024-02-27T23:03:49.296Z · comments (188)

[question] How to get nerds fascinated about mysterious chronic illness research?
riceissa · 2024-05-27T22:58:29.707Z · answers+comments (50)

Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (22)

[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (13)

[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (33)

I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-12-03T21:19:42.333Z · comments (7)

[link] Ideological Bayesians
Kevin Dorst · 2024-02-25T14:17:25.070Z · comments (4)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt · 2024-12-22T20:56:00.298Z · comments (5)

[link] Ilya Sutskever created a new AGI startup
harfe · 2024-06-19T17:17:17.366Z · comments (35)

[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)

[link] Explaining Impact Markets
Saul Munn (saul-munn) · 2024-01-31T09:51:27.587Z · comments (2)

On Claude 3.5 Sonnet
Zvi · 2024-06-24T12:00:05.719Z · comments (14)

It's time for a self-reproducing machine
Carl Feynman (carl-feynman) · 2024-08-07T21:52:22.819Z · comments (68)

[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (5)

[link] Things You’re Allowed to Do: University Edition
Saul Munn (saul-munn) · 2024-02-06T00:36:11.690Z · comments (13)

Towards a Less Bullshit Model of Semantics
johnswentworth · 2024-06-17T15:51:06.060Z · comments (44)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (58)

[link] RAND report finds no effect of current LLMs on viability of bioterrorism attacks
StellaAthena · 2024-01-25T19:17:30.493Z · comments (14)

2024 Petrov Day Retrospective
Ben Pace (Benito) · 2024-09-28T21:30:14.952Z · comments (25)

[link] Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity
GideonF · 2024-07-15T18:37:40.232Z · comments (17)

[link] The Intelligence Curse
lukedrago · 2025-01-03T19:07:43.493Z · comments (26)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (2)

[link] Sabotage Evaluations for Frontier Models
David Duvenaud (david-duvenaud) · 2024-10-18T22:33:14.320Z · comments (55)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

Notes on Dwarkesh Patel’s Podcast with Demis Hassabis
Zvi · 2024-03-01T16:30:08.687Z · comments (0)

Apollo Research 1-year update
Marius Hobbhahn (marius-hobbhahn) · 2024-05-29T17:44:32.484Z · comments (0)

Everything Wrong with Roko's Claims about an Engineered Pandemic
WitheringWeights (EZ97) · 2024-02-22T15:59:08.439Z · comments (10)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

OpenAI: The Board Expands
Zvi · 2024-03-12T14:00:04.110Z · comments (1)

Catastrophic sabotage as a major threat model for human-level AI systems
evhub · 2024-10-22T20:57:11.395Z · comments (11)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (8)

Science advances one funeral at a time
Cameron Berg (cameron-berg) · 2024-11-01T23:06:19.381Z · comments (9)

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
Lukas_Gloor · 2024-05-09T15:43:11.490Z · comments (36)

Takeoff speeds presentation at Anthropic
Tom Davidson (tom-davidson-1) · 2024-06-04T22:46:35.448Z · comments (0)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (75)

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Johnny Lin (hijohnnylin) · 2024-03-25T21:17:58.421Z · comments (7)

Comment on "Death and the Gorgon"
Zack_M_Davis · 2025-01-01T05:47:30.730Z · comments (32)

[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (33)

Reasons for and against working on technical AI safety at a frontier AI lab
bilalchughtai (beelal) · 2025-01-05T14:49:53.529Z · comments (12)

New page: Integrity
Zach Stein-Perlman · 2024-07-10T15:00:41.050Z · comments (3)

Circular Reasoning
abramdemski · 2024-08-05T18:10:32.736Z · comments (37)

Introducing Squiggle AI
ozziegooen · 2025-01-03T17:53:42.915Z · comments (15)

Zvi’s Thoughts on His 2nd Round of SFF
Zvi · 2024-11-20T13:40:08.092Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

ryan_b on A Novel Idea for Harnessing Magnetic Reconnection as an Energy Source

This is a fun idea! I was recently poking at field line reconnection myself, in conversation with Claude.

I don't think the energy balance turns out in the idea's favor. Here are the heuristics I considered:

The first thing I note is what happens during reconnection: a bunch of the magnetic energy turns into kinetic and thermal energy. The part you plan to harvest is just the electric field part. Even in otherwise ideal circumstances, that's a substantial loss.
The second thing I note is that in a fusion reactor, the magnetic field is already being generated by the device, via electromagnets. This makes the process look like putting current into a magnetic field, then to break the magnetic field in order to get less current back out (because of the first note).
The third thing I note is that reconnection is about the reconfiguration of the magnetic field lines. I'm highly confident that electric fields when the lines break define how the lines reconnect, so if you induct all the energy out the reconnection will look different than would have. Mostly this would cash out as a weaker magnetic field than it would be otherwise, driving more recharging of the magnetic field, making the balance worse.

All of that being said, Claude and ChatGPT both respond well to sanity checking. You can say directly something like: "Sanity check: is this consistent with thermodynamics?"

I also think that ChatGPT misleadingly treated the magnetic fields and electric fields as being separate because it was using an ideal MHD model, where this is common due to the simplifications the model makes. In my experience at least Claude catches a lot of confusion and oversights by asking specifically about the differences between the physics and the model.

seth-herd on A problem shared by many different alignment targets

I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.

But I don't think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] Ω or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target [LW · GW] is far superior to anything else. I think it's so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.

On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.

nathan-helm-burger on ryan_greenblatt's Shortform

https://www.lesswrong.com/posts/uPi2YppTEnzKG3nXD/nathan-helm-burger-s-shortform?commentId=rnT3z9F55A2pmrj4Y [LW(p) · GW(p)]

cleo-nardo on Shortform

Must humans obey the Axiom of Irrelevant Alternatives?

If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it's treated more fundamental than VNM. But should humans follow this? Maybe not.

Maybe humans are the negotiation between various "subagents", and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose humans into subagents.

Let's suppose you pick A from {A,B,C} and B from {A,B} where:

A = Walk with your friend
B = Dinner party
C = Stay home alone

This feel like something I can imagine. We can explain this behaviour with two subagents: the introvert and the extrovert. The introvert has preferences C > A > B and the extrovert has the opposite preferences B > A > C. When the possible options are A and B, then the KS bargaining solution between the introvert and the extrovert will be B. At least, if the introvert has more "weight". But when the option space expands to include C, then the bargaining solution might shift to B. Intuitively, the "fair" solution is one where neither bargainer is sacrificing significantly more than the other.

nathan-helm-burger on Nathan Helm-Burger's Shortform

Want to just give a quick take on this $450 o1-style model: https://novasky-ai.github.io/posts/sky-t1/

I think this matches a pattern we see a lot throughout the history of human engineering. Once a thing is known to be possible, and rough clues about how it was done are known (especially if many people get to play around with the product), then it won't be long until some other group figures out how to replicate a shoddy version of the new tech. And from there, usually (if there's market for it) improvements can steadily cause the shoddy version to catch up to close to the original in performance.

When we apply this lesson to AGI, we should assume that a similar sort of thing will happen if some company develops AGI and shows it off to the world. Especially if they give hints about how they did it, and if they let users interact with it. The question then is, how long until the world produces a '$450' knock-off version of the AGI?

This is super relevant for governance. You can't assume that everyone who makes a knock-off will be taking the same security precautions as the original inventors. If the thing blocking the AGI from self-improving is the disciplined restraint, government oversight, and security mindset of the original inventors... well, don't count on those things. If the knock-off AGI is good enough to self-improve, it's future versions won't be second-rate for long. Choosing not to assign the AGI to making stronger AGI is an alignment tax. Defectors will defect, and gain great power thereby.

We need a plan that covers this possibility. This is not definitely the path the future will take, but it is a plausible path.

lorec on Marx and the Machine

I enjoyed reading this post; thank you for writing it. LessWrong has an allergy to basically every category Marx is a member of - "armchair" philosophers, socialist theorists, pop humanities idols - in my view, all entirely unjustified.

I had no idea Marx's forecast of utopia was explicitly based on extrapolating the gains from automation; I take your word for it somewhat, but from being passingly familiar with his work, I have a hunch you may be overselling his naivete.

Unfortunately, since the main psychological barrier to humans solving the technical alignment problem at present is not altruistic intentions, but raw cognitive intelligence, any meta-alignment scheme that proposes to succeed today has far more work cut out for it than just ensuring AGI-builders are accounting for risk to the best of their ability. It has to make the best of their ability good enough. That involves, at the very minimum, an intensive selection program for geniuses who are then placed in a carefully incentives-aligned research environment [LW(p) · GW(p)], and probably human intelligence enhancement.

gunnar_zarncke on What’s the short timeline plan?

I think the single most important point is: Keep a paradigm with human-legible CoT. Most other points are downstream of that. If it is legible, it is possible and more likely to notice that it is not faithful and to build monitoring on top. It might be the single simple You Get About Five Words [LW · GW] thing that might make it into regulation.

lc on Applying traditional economic thinking to AGI: a trilemma

I am continually surprised how many people go out of their way not to conclude something like this.

screwtape on Never Drop A Ball

(Self review) I stand by this essay, and in particular I like having this essay to point to as an example of why some organizations are not holding the idiot ball quite as much as people might assume. This essay is somewhat self defense? I work like this most of the time these days.

Followup work on how to better juggle balls is useful, and basically leads into an existing field of management. If One Day Sooner is unusual startup mode, Never Drop A Ball is a very normal middle and end stage of many organizations, and for good reasons. It's also a genuinely superior way for many groups to work. (Consider a hospital emergency room. Dr. House going deep into one patient's medical minutia is not as good as making sure that zero people have unsterilized and unbandaged bleeding wounds.) Having a shorter pointer is useful, though probably this could be made shorter and serve as a somewhat better pointer.

Followup on when and how to set balls down would be useful. Someone else should write that, I'm rubbish at it =P

nathan-helm-burger on Governance Course - Week 1 Reflections

Yeah, the thing the 'scaling extrapolation' view doesn't take into account is that as soon as radical speed-ups to algorithmic research are made possible by AI R&D agents, suddenly the trendlines for algorithmic progress should be projected to steepen. How much and for how long before slow-downs are hit? That's unclear. I think there is at least some substantial probability that no slow-downs are hit before full AGI, and some smaller but still considerable probability that the improvement cycle rushes forward at high speed past that point to ASI.

This should be assumed to potentially involve dramatic gains in both peak capabilities, and in efficiency and speed of training and inference. If so, then compute governance becomes completely irrelevant for blocking creation of dangerously powerful AI. It can still help put limits on the amount of inference used. Why? Because no matter how efficient the AI is, if you have more compute you have more parallel copies (and can run them faster up to the limits of the system, which is probably somewhere between 100x to 1000x human thought speed).

If we are going to head this off, we need new governance methods, and soon. Maybe really really soon, like, before the end of 2025. Hopefully we have until more like 2028, but we can't count on that for sure.

I have very little faith in current governments to implement and enforce policies that are more complex than things on the order of governance compute and chip export controls. Much less to do so within the short timeframes we are facing.

I think the conclusion this points towards is that we need new forms of governance. Not to replace existing governments, but to complement them. Voluntary mutual inspection contracts with privacy-respecting technology using AI inspectors. Something of that sort.

Here's some recent evidence of compute thresholds not being reliable: https://novasky-ai.github.io/posts/sky-t1/

Here's some self-links to some of my thoughts on this (I recommend reading the posts these comments are on as well):

https://www.lesswrong.com/posts/DvHokvyr2cZiWJ55y/2-skim-the-manual-intelligent-voluntary-cooperation?commentId=BBjpfYXWywb2RKjz5 [LW(p) · GW(p)]

https://www.lesswrong.com/posts/FEcw6JQ8surwxvRfr/human-takeover-might-be-worse-than-ai-takeover?commentId=uSPR9svtuBaSCoJ5P [LW(p) · GW(p)]

https://www.lesswrong.com/posts/tdrK7r4QA3ifbt2Ty/is-ai-alignment-enough?commentId=An6L68WETg3zCQrHT [LW(p) · GW(p)]