LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

The Field of AI Alignment: A Postmortem, and What To Do About It
johnswentworth · 2024-12-26T18:48:07.614Z · comments (123)

[link] Review: Planecrash
L Rudolf L (LRudL) · 2024-12-27T14:18:33.611Z · comments (24)

Orienting to 3 year AGI timelines
Nikola Jurkovic (nikolaisalreadytaken) · 2024-12-22T01:15:11.401Z · comments (37)

[link] By default, capital will matter more than ever after AGI
L Rudolf L (LRudL) · 2024-12-28T17:52:58.358Z · comments (60)

Shallow review of technical AI safety, 2024
technicalities · 2024-12-29T12:01:14.724Z · comments (22)

What o3 Becomes by 2028
Vladimir_Nesov · 2024-12-22T12:37:20.929Z · comments (15)

Hire (or Become) a Thinking Assistant
Raemon · 2024-12-23T03:58:42.061Z · comments (38)

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (72)

A Three-Layer Model of LLM Psychology
Jan_Kulveit · 2024-12-26T16:49:41.738Z · comments (3)

AIs Will Increasingly Fake Alignment
Zvi · 2024-12-24T13:00:07.770Z · comments (0)

A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt · 2024-12-22T20:56:00.298Z · comments (5)

Some arguments against a land value tax
Matthew Barnett (matthew-barnett) · 2024-12-29T15:17:00.740Z · comments (23)

Why I'm Moving from Mechanistic to Prosaic Interpretability
Daniel Tan (dtch1997) · 2024-12-30T06:35:43.417Z · comments (17)

2025 Prediction Thread
habryka (habryka4) · 2024-12-30T01:50:14.216Z · comments (13)

Checking in on Scott's composition image bet with imagen 3
Dave Orr (dave-orr) · 2024-12-22T19:04:17.495Z · comments (0)

Is "VNM-agent" one of several options, for what minds can grow up into?
AnnaSalamon · 2024-12-30T06:36:20.890Z · comments (31)

AI #96: o3 But Not Yet For Thee
Zvi · 2024-12-26T20:30:06.722Z · comments (8)

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility
Johannes C. Mayer (johannes-c-mayer) · 2024-12-22T22:08:31.971Z · comments (34)

ReSolsticed vol I: "We're Not Going Quietly"
Raemon · 2024-12-26T17:52:33.727Z · comments (4)

[question] What Have Been Your Most Valuable Casual Conversations At Conferences?
johnswentworth · 2024-12-25T05:49:36.711Z · answers+comments (19)

[link] Began a pay-on-results coaching experiment, made $40,300 since July
Chipmonk · 2024-12-29T21:12:02.574Z · comments (11)

AI Assistants Should Have a Direct Line to Their Developers
Jan_Kulveit · 2024-12-28T17:01:58.643Z · comments (4)

[link] The Deep Lore of LightHaven, with Oliver Habryka (TBC episode 228)
Eneasz · 2024-12-24T22:45:50.065Z · comments (4)

What happens next?
Logan Zoellner (logan-zoellner) · 2024-12-29T01:41:33.685Z · comments (19)

[link] Learn to write well BEFORE you have something worth saying
eukaryote · 2024-12-29T23:42:31.906Z · comments (7)

Considerations on orca intelligence
Towards_Keeperhood (Simon Skade) · 2024-12-29T14:35:16.445Z · comments (4)

[question] What are the most interesting / challenging evals (for humans) available?
Raemon · 2024-12-27T03:05:26.831Z · answers+comments (13)

Greedy-Advantage-Aware RLHF
sej2020 · 2024-12-27T19:47:25.562Z · comments (13)

o3, Oh My
Zvi · 2024-12-30T14:10:05.144Z · comments (9)

People aren't properly calibrated on FrontierMath
cakubilo · 2024-12-23T19:35:44.467Z · comments (4)

Acknowledging Background Information with P(Q|I)
JenniferRM · 2024-12-24T18:50:25.323Z · comments (8)

Corrigibility's Desirability is Timing-Sensitive
RobertM (T3t) · 2024-12-26T22:24:17.435Z · comments (4)

Living with Rats in College
lsusr · 2024-12-25T10:44:13.085Z · comments (0)

Book Summary: Zero to One
bilalchughtai (beelal) · 2024-12-29T16:13:52.922Z · comments (1)

[question] What is your personal totalizing and self-consistent worldview/philosophy?
lsusr · 2024-12-27T23:59:30.641Z · answers+comments (11)

[link] The Alignment Simulator
Yair Halberstadt (yair-halberstadt) · 2024-12-22T11:45:55.220Z · comments (3)

[link] Funding Case: AI Safety Camp 11
Remmelt (remmelt-ellen) · 2024-12-23T08:51:55.255Z · comments (0)

If all trade is voluntary, then what is "exploitation?"
Darmani · 2024-12-27T11:21:30.036Z · comments (42)

[link] PCR retrospective
bhauth · 2024-12-26T21:20:56.484Z · comments (0)

The average rationalist IQ is about 122
Rockenots (Ekefa) · 2024-12-28T15:42:07.067Z · comments (20)

[link] Letter from an Alien Mind
Shoshannah Tekofsky (DarkSym) · 2024-12-27T13:20:49.277Z · comments (7)

Non-Obvious Benefits of Insurance
jefftk (jkaufman) · 2024-12-23T03:40:02.184Z · comments (5)

[link] Human-AI Complementarity: A Goal for Amplified Oversight
rishubjain · 2024-12-24T09:57:55.111Z · comments (1)

[link] It looks like there are some good funding opportunities in AI safety right now
Benjamin_Todd · 2024-12-22T12:41:02.151Z · comments (0)

subfunctional overlaps in attentional selection history implies momentum for decision-trajectories
Emrik (Emrik North) · 2024-12-22T14:12:49.027Z · comments (1)

Theoretical Alignment's Second Chance
lunatic_at_large · 2024-12-22T05:03:51.653Z · comments (0)

Whistleblowing Twitter Bot
Mckiev · 2024-12-26T04:09:45.493Z · comments (5)

Monthly Roundup #25: December 2024
Zvi · 2024-12-23T14:20:04.682Z · comments (3)

Proof Explained for "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-22T15:06:16.880Z · comments (0)

[link] A primer on machine learning in cryo-electron microscopy (cryo-EM)
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-22T15:11:58.860Z · comments (0)

next page (older posts) →

Archive

Recent comments

jimrandomh on Jimrandomh's Shortform

Studying the diets of outlier-obese people is definitely something should be doing (and are doing, a little), but yeah, the outliers are probably going to be obese for reasons other than "the reason obesity has increased over time but moreso".

faul_sname on o3, Oh My

As someone who has been on both sides of that fence, agreed. Architecting a system is about being aware of hundreds of different ways things can go wrong, recognizing which of those things are likely to impact you in your current use case, and deciding what structure and conventions you will use. It's also very helpful, as an architect, to provide examples usages of the design patterns which will replicate themselves around your new system. All of which are things that current models are already very good, verging on superhuman, at.

On the flip side, I expect that the "piece together context to figure out where your software's model of the world has important holes" part of software engineering will remain relevant even after AI becomes technically capable of doing it, because that process frequently involves access to sensitive data across multiple sources where having an automated, unauthenticated system which can access all of those data sources at once would be a really bad idea (having a single human able to do all that is also a pretty bad idea in many cases, but at least the human has skin in the game).

the-gears-to-ascension on Alexander Gietelink Oldenziel's Shortform

perhaps. but my reasoning is something like -
better than "alignment": what's being aligned? outcomes should be (citation needed)
better than "ethics": how does one act ethically? by producing good outcomes (citation needed).
better than "notkilleveryoneism": I actually would prefer everyone dying now to everyone being tortured for a million years and then dying, for example, and I can come up with many other counterexamples - not dying is not the problem, achieving good things is the problem.
might not work for deontologists. that seems fine to me, I float somewhere between virtue ethics and utilitarianism anyway.
perhaps there are more catchy words that could be used, though. hope to see someone suggest one someday.

karl-krueger on If all trade is voluntary, then what is "exploitation?"

I use capitalism in a manner mutually exclusive with slave labor because it requires self-ownership.

This seems like a sort of definitional gimbal lock; it makes it harder to describe the world because two potentially-separate degrees of freedom are collapsed into one. While I'm reluctant to argue definitions [LW · GW], I think it's worth using terms in ways that allow us to describe the world in more detail than ones that collapse distinctions.

I expect to see this usage of "capitalism" not in history or economics, but in the sort of political doctrine where it's intended to lock those concepts together; to imply that capital markets and individual freedom are either the same thing, or closely related — more closely, I think, than history and contemporary events really support.

It would seem weird to me, for instance, to claim that a publicly-traded company that is discovered to have done something to violate individual freedom is thereby no longer a participant in a capitalist economy. The New York Stock Exchange doesn't ask "does this company infringe individual freedoms anywhere in the world?" before letting a company be listed. (To be clear, I'm not proposing that it should; I'm saying that it's useful to talk about "participation in a capital market economy" and "fully respecting some set of individual freedoms" as distinct axes.)

(For what it's worth, I think "self-ownership" is a pretty odd expression, because one of the central traits of ownership is that it can be transferred, and one of the central traits of selfhood is that it cannot. Your relation to yourself is distinct from property ownership in that you can sell any piece of your property, but you cannot sell your self; no matter what obligations you may have signed up for, you always retain possession of your self.)

vladimir_nesov on o3

Test time compute is applied in-context, so it's very worthwhile to scale, getting better at better at solving a particular problem, to the extent that no amount of pretraining [LW(p) · GW(p)] would be able to match with only modest test-time compute.

sodium on Shallow review of technical AI safety, 2024

Pr(Ai)2R is at least partially funded by Good Ventures/OpenPhil

moridinamael on Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds

This post resonated with me when it came out, and I think its thesis only seems more credible with time. Anthropic's seminal "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (the Golden Gate Claude paper) seems right in line with these ideas. We can make scrutable the inscrutable as long as the inscrutable takes the form of something organized and regular and repeatable.

This article gets bonus points for me for being succinct and while still making its argument clearly.

qvalq on You Provably Can't Trust Yourself

How/does this square with https://arxiv.org/abs/1902.07404?
IIUC, Gödel's Second Incompleteness Theorem was overinterpreted, and a different operationalization of consistency is provable.

I talked to Mihály Bárász about that, and he didn't think it was crazy.

mateusz-baginski on Alexander Gietelink Oldenziel's Shortform

Insufficiently catchy

akash-wasil on evhub's Shortform

I'm glad you're doing this, and I support many of the ideas already suggested. Some additional ideas:

Interview program. Work with USAISI or UKAISI (or DHS/NSA) to pilot an interview program in which officials can ask questions about AI capabilities, safety and security threats, and national security concerns. (If it's not feasible to do this with a government entity yet, start a pilot with a non-government group– perhaps METR, Apollo, Palisade, or the new AI Futures Project.)
Clear communication about RSP capability thresholds. I think the RSP could do a better job at outlining the kinds of capabilities that Anthropic is worried about and what sorts of thresholds would trigger a reaction. I think the OpenAI preparedness framework tables are a good example of this kind of clear/concise communication. It's easy for a naive reader to quickly get a sense of "oh, this is the kind of capability that OpenAI is worried about." (Clarification: I'm not suggesting that Anthropic should abandon the ASL approach or that OpenAI has necessarily identified the right capability thresholds. I'm saying that the tables are a good example of the kind of clarity I'm looking for– someone could skim this and easily get a sense of what thresholds OpenAI is tracking, and I think OpenAI's PF currently achieves this much more than the Anthropic RSP.)
Emergency protocols. Publishing an emergency protocol that specifies how Anthropic would react if it needed to quickly shut down a dangerous AI system. (See some specific prompts in the "AI developer emergency response protocol" section here). Some information can be redacted from a public version (I think it's important to have a public version, though, partly to help government stakeholders understand how to handle emergency scenarios, partly to raise the standard for other labs, and partly to acquire feedback from external groups.)
RSP surveys. Evaluate the extent to which Anthropic employees understand the RSP, their attitudes toward the RSP, and how the RSP affects their work. More on this here [LW(p) · GW(p)].
More communication about Anthropic's views about AI risks and AI policy. Some specific examples of hypothetical posts I'd love to see:
- "How Anthropic thinks about misalignment risks"
- "What the world should do if the alignment problem ends up being hard"
- "How we plan to achieve state-proof security before AGI"
- Encouraging more employees to share their views on various topics, EG Sam Bowman's post [LW · GW].
AI dialogues/debates. It would be interesting to see Anthropic employees have discussions/debates from other folks thinking about advanced AI. Hypothetical examples:
- "What are the best things the US government should be doing to prepare for advanced AI" with Jack Clark and Daniel Kokotajlo.
- "Should we have a CERN for AI?" with [someone from Anthropic] and Miles Brundage.
- "How difficult should we expect alignment to be" with [someone from Anthropic] and [someone who expects alignment to be harder; perhaps Jeffrey Ladish or Malo Bourgon].

More ambitiously, I feel like I don't really understand Anthropic's plan for how to manage race dynamics in worlds where alignment ends up being "hard enough to require a lot more than RSPs and voluntary commitments."

From a policy standpoint, several of the most interesting open questions seem to be along the lines of "under what circumstances should the USG get considerably more involved in overseeing certain kinds of AI development" and "conditional on the USG wanting to get way more involved, what are the best things for it to do?" It's plausible that Anthropic is limited in how much work it could do on these kinds of questions (particularly in a public way). Nonetheless, it could be interesting to see Anthropic engage more with questions like the ones Miles raises here.