LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

How much progress actually happens in theoretical physics?
ChristianKl · 2025-04-04T23:08:00.633Z · comments (32)

Top AI safety newsletters, books, podcasts, etc – new AISafety.com resource
Bryce Robertson (bryceerobertson) · 2025-03-04T17:01:18.758Z · comments (2)

What Uniparental Disomy Tells Us About Improper Imprinting in Humans
Morpheus · 2025-03-28T11:24:47.133Z · comments (1)

Goodhart Typology via Structure, Function, and Randomness Distributions
JustinShovelain · 2025-03-25T16:01:08.327Z · comments (0)

Most Questionable Details in 'AI 2027'
scarcegreengrass · 2025-04-05T00:32:54.896Z · comments (4)

The Upcoming PEPFAR Cut Will Kill Millions, Many of Them Children
omnizoid · 2025-01-27T16:03:51.214Z · comments (2)

Field tests of semi-rationality in Brazilian military training
P. João (gabriel-brito) · 2025-03-12T16:14:12.590Z · comments (0)

Llama Does Not Look Good 4 Anything
Zvi · 2025-04-09T13:20:01.799Z · comments (1)

Non-Monotonic Infra-Bayesian Physicalism
Marcus Ogren · 2025-04-02T12:14:19.783Z · comments (0)

[link] What is it to solve the alignment problem?
Joe Carlsmith (joekc) · 2025-02-13T18:42:07.215Z · comments (6)

An overview of areas of control work
ryan_greenblatt · 2025-03-25T22:02:16.178Z · comments (0)

When the Wannabe Rambo Comedian Cried
P. João (gabriel-brito) · 2025-03-31T14:47:50.660Z · comments (0)

AI #105: Hey There Alexa
Zvi · 2025-02-27T14:30:08.038Z · comments (3)

Knocking Down My AI Optimist Strawman
tailcalled · 2025-02-08T10:52:33.183Z · comments (3)

Chicanery: No
Screwtape · 2025-02-06T05:42:45.095Z · comments (10)

Monthly Roundup #28: March 2025
Zvi · 2025-03-17T12:50:03.097Z · comments (8)

On the Implications of Recent Results on Latent Reasoning in LLMs
Rauno Arike (rauno-arike) · 2025-03-31T11:06:23.939Z · comments (6)

[link] How prediction markets can create harmful outcomes: a case study
B Jacobs (Bob Jacobs) · 2025-04-02T15:37:09.285Z · comments (2)

Eliciting bad contexts
Geoffrey Irving · 2025-01-24T10:39:39.358Z · comments (8)

[link] The 4-Minute Mile Effect
Parker Conley (parker-conley) · 2025-04-14T21:41:27.726Z · comments (6)

Who wants to bet me $25k at 1:7 odds that there won't be an AI market crash in the next year?
Remmelt (remmelt-ellen) · 2025-04-08T08:31:59.900Z · comments (15)

How Close We Are to a Complete List of Imprinted Genes
Morpheus · 2025-04-19T18:37:57.074Z · comments (1)

Meetups Notes (Q1 2025)
jenn (pixx) · 2025-03-31T01:12:11.774Z · comments (2)

Why Aligning an LLM is Hard, and How to Make it Easier
RogerDearnaley (roger-d-1) · 2025-01-23T06:44:04.048Z · comments (3)

AI #103: Show Me the Money
Zvi · 2025-02-13T15:20:07.057Z · comments (9)

Prospects for Alignment Automation: Interpretability Case Study
Jacob Pfau (jacob-pfau) · 2025-03-21T14:05:51.528Z · comments (4)

Why you maybe should lift weights, and How to.
samusasuke · 2025-02-12T05:15:32.011Z · comments (29)

Nonpartisan AI safety
Yair Halberstadt (yair-halberstadt) · 2025-02-10T14:55:50.913Z · comments (4)

[link] A High Level Closed-Door Session Discussing DeepSeek: Vision Trumps Technology
Cosmia_Nebula · 2025-01-30T09:53:16.152Z · comments (1)

[link] Estimating the Probability of Sampling a Trained Neural Network at Random
Adam Scherlis (adam-scherlis) · 2025-03-01T02:11:56.313Z · comments (10)

EIS XV: A New Proof of Concept for Useful Interpretability
scasper · 2025-03-17T20:05:30.580Z · comments (2)

Takeaways From Our Recent Work on SAE Probing
Josh Engels (JoshEngels) · 2025-03-03T19:50:16.692Z · comments (0)

[link] Anthropic CEO calls for RSI
Andrea_Miotti (AndreaM) · 2025-01-29T16:54:24.943Z · comments (10)

Deep sparse autoencoders yield interpretable features too
Armaan A. Abraham (armaanabraham) · 2025-02-23T05:46:59.189Z · comments (8)

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Tomek Korbak (tomek-korbak) · 2025-04-14T16:45:46.584Z · comments (1)

[Linkpost] Visual roadmap to strong human germline engineering
TsviBT · 2025-04-05T22:22:57.744Z · comments (0)

Notes on Occam via Solomonoff vs. hierarchical Bayes
JesseClifton · 2025-02-10T17:55:14.689Z · comments (7)

Agents don't have to be aligned to help us achieve an indefinite pause.
Hastings (hastings-greer) · 2025-01-25T18:51:03.523Z · comments (0)

[link] Altman blog on post-AGI world
Julian Bradshaw · 2025-02-09T21:52:30.631Z · comments (10)

Towards building blocks of ontologies
Daniel C (harper-owen) · 2025-02-08T16:03:29.854Z · comments (0)

Validating against a misalignment detector is very different to training against one
mattmacdermott · 2025-03-04T15:41:04.692Z · comments (4)

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
ChengCheng (ccstan99) · 2025-02-07T03:57:30.904Z · comments (0)

Selection Pressures on LM Personas
Raymond D · 2025-03-28T20:33:09.918Z · comments (0)

Deference and Decision-Making
ben_levinstein (benlev) · 2025-01-27T22:02:17.578Z · comments (2)

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure
David Lindner · 2025-04-12T23:15:07.964Z · comments (0)

AI #112: Release the Everything
Zvi · 2025-04-17T15:10:02.029Z · comments (6)

[link] Reasoning models don't always say what they think
Joe Benton · 2025-04-09T19:48:58.733Z · comments (4)

[link] Takeaways from sketching a control safety case
joshc (joshua-clymer) · 2025-01-31T04:43:45.917Z · comments (0)

[link] Smelling Nice is Good, Actually
Gordon Seidoh Worley (gworley) · 2025-03-18T16:54:43.324Z · comments (8)

How much does it cost to back up solar with batteries?
jasoncrawford · 2025-03-25T16:35:52.834Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

q-home on Clarifying the Agent-Like Structure Problem

Can anybody give/reference an ELI5 or ELI15 explanation of this example? How can we use the models without creating them? I know that gradient descent is used to update neural networks, but how can you get the predictions of those NNs without having them?

mo-putera on Most AI value will come from broad automation, not from R&D

Thought it would be useful to pull out your plot and surrounding text, which seemed cruxy:

At first glance, the job of a scientist might seem like it leans very heavily on abstract reasoning... In such a world, AIs would greatly accelerate R&D before AIs are broadly deployed across the economy to take over more common jobs, such as retail workers, real estate agents, or IT professionals. In short, AIs would “first automate science, then automate everything else.”
But this picture is likely wrong. In reality, most R&D jobs require much more than abstract reasoning skills. ... To demonstrate this, we used GPT-4.5 to label tasks across 12 common R&D occupations into one of three categories, depending on whether it thinks the task can be performed using only abstract reasoning skills, whether it requires complex computer-use skills (but not physical presence), or whether it one needs to be physically present to complete the task. See this link to our conversation with GPT-4.5 to find our methodology and results.

This plot reveals a more nuanced picture of what scientific research actually entails. Contrary to the assumption that research is largely an abstract reasoning task, the reality is that much of it involves physical manipulation and advanced agency. To fully automate R&D, AI systems likely require the ability to autonomously operate computer GUIs, coordinate effectively with human teams, possess strong executive functioning skills to complete highly complex projects over long time horizons, and manipulate their physical environment to conduct experiments.
Yet, by the time AI reaches the level required to fully perform this diverse array of skills at a high level of capability, it is likely that a broad swath of more routine jobs will have already been automated. This contradicts the notion that AI will “first automate science, then automate everything else.” Instead, a more plausible prediction is that AI automation will first automate a large share of the general workforce, across a very wide range of industries, before it reaches the level needed to fully take over R&D.

sustrik on Accountability Sinks

It’s always worth asking, do we own the process or does the process own us?

That's a nice shortcut to explain the distinction between "a process imposed upon yourself" vs. "a process handed to you from above".

pablo_stafforini on Pablo's Shortform

I think there is a vast difference between Gerard and Kruel, not just in the damage each has caused but also in their intellectual honesty and responsiveness to argument (null in the case of Gerard, decent in the case of Kruel, at least from my recollection).

samuelshadrach on Davidmanheim's Shortform

Why does this matter? To quote a Yudkowsky-ish example, maybe you can take a 16-th century human (before Newtonian physics was invented, after guns were invented) and explain to him how a nuclear bomb works. This doesn't matter for predicting the outcome of a hypothetical war between 16th century Britain and 21st century USA.

ASI inventions can be big surprises and yet be things that you could understand if someone taught you.

We could probably understand how a von Neumann probe or an anti-aging cure worked too, if someone taught us.

mo-putera on Accountability Sinks

I think this essay is going to be one I frequently recommend to others over the coming years, thanks for writing it.

But in the end, deep in the heart of any bureaucracy, the process is about responsibility and the ways to avoid it. It's not an efficiency measure, it’s an accountability management technique.

This vaguely reminded me of what Ivan Vendrov wrote in Metrics, Cowardice, and Mistrust. Ivan began by noting that "companies optimizing for simple engagement metrics aren’t even being economically rational... so why don't they?" It's not because "these more complex things are hard to measure", if you think about it. His answer is cowardice and mistrust, which lead to the selection of metrics "robust to an adversary trying to fool you":

But the other reason we use metrics, sadly much more common, is due to cowardice (sorry, risk-aversion) and mistrust.
Cowardice because nobody wants to be responsible for making a decision. Actually trying to understand the impact of a new feature on your users and then making a call is an inherently subjective process that involves judgment, i.e. responsibility, i.e. you could be fired if you fuck it up. Whereas if you just pick whichever side of the A/B test has higher metrics, you’ve successfully outsourced your agency to an impartial process, so you’re safe.
Mistrust because not only does nobody want to make the decision themselves, nobody even wants to delegate it! Delegating the decision to a specific person also involves a judgment call about that person. If they make a bad decision, that reflects badly on you for trusting them! So instead you insist that “our company makes data driven decisions” which is a euphemism for “TRUST NO ONE”. This works all the way up the hierarchy - the CEO doesn’t trust the Head of Product, the board members don’t trust the CEO, everyone insists on seeing metrics and so metrics rule.
Coming back to our original question: why can’t we have good metrics that at least try to capture the complexity of what users want? Again, cowardice and mistrust. There’s a vast space of possible metrics, and choosing any specific one is a matter of judgment. But we don’t trust ourselves or anyone else enough to make that judgment call, so we stick with the simple dumb metrics.
This isn’t always a bad thing! Police departments are often evaluated by their homicide clearance rate because murders are loud and obvious and their numbers are very hard to fudge. If we instead evaluated them by some complicated CRIME+ index that a committee came up with, I’d expect worse outcomes across the board.
Nobody thinks “number of murders cleared” is the best metric of police performance, any more than DAUs are the best metric of product quality, or GDP is the best metric of human well-being. However they are the best in the sense of being hardest to fudge, i.e. robust to an adversary trying to fool you. As trust declines, we end up leaning more and more on these adversarially robust metrics, and we end up in a gray Brezhnev world where the numbers are going up, everyone knows something is wrong, but the problems get harder and harder to articulate.

His preferred solution to counteracting this tendency to use adversarially robust but terrible metrics is to develop an ideology to promote mission alignment:

A popular attempt at a solution is monarchy. ... The big problem with monarchy is that it doesn’t scale. ...
A more decentralized and scalable solution is developing an ideology: a self-reinforcing set of ideas nominally held by everyone in your organization. Having a shared ideology increases trust, and ideologues are able to make decisions against the grain of natural human incentives. This is why companies talk so much about “mission alignment”, though very few organizations can actually pull off having an ideology: when real sacrifices need to be made, either your employees or your investors will rebel.

While his terminology feels somewhat loaded, I thought it natural to interpret all your examples of people breaking rules to get the thing done in mission alignment terms.

Another way to promote mission alignment is some combination of skilful message compression and resistance to proxies, which Eugene Wei wrote about in Compress to impress about Jeff Bezos (monarchy in Ivan's framing above). On the latter, quoting Bezos:

As companies get larger and more complex, there’s a tendency to manage to proxies. This comes in many shapes and sizes, and it’s dangerous, subtle, and very Day 2.

A common example is process as proxy. Good process serves you so you can serve customers. But if you’re not watchful, the process can become the thing. This can happen very easily in large organizations. The process becomes the proxy for the result you want. You stop looking at outcomes and just make sure you’re doing the process right. Gulp. It’s not that rare to hear a junior leader defend a bad outcome with something like, “Well, we followed the process.” A more experienced leader will use it as an opportunity to investigate and improve the process. The process is not the thing. It’s always worth asking, do we own the process or does the process own us? In a Day 2 company, you might find it’s the second.

I wonder how all this is going to look like in a (soonish?) future where most of the consequential decision-making has been handed over to AIs.

shawnghu on Accountability Sinks

Well, a 5 percent rate of being sent to a concentration camp or combat unit isn't exactly negligible, and a further 17 percent were threatened. So maybe it's correct that these effects are "surprisingly mild", but these stats are more justification for the "just following orders" explanation than I'd have imagined from the main text.

rasool on Eulogy to the Obits

It looks like this is a linkpost to:

https://press.asimov.com/articles/obit

yams on jacquesthibs's Shortform

Rather than make things worse as a means of compelling others to make things better, I would rather just make things better.

Brinksmanship and accelerationism (in the Marxist sense) are high variance strategies ill-suited to the stakes of this particular game.

[one way this makes things worse is stimulating additional investment on the frontier; another is attracting public attention to the wrong problem, which will mostly just generate action on solutions to that problem, and not to the problem we care most about. Importantly, the contingent of people-mostly-worried-about-jobs are not yet our allies, and it’s likely their regulatory priorities would not address our concerns, even though I share in some of those concerns.]

samuelshadrach on xpostah's Shortform

Suppose you are trying to figure out a function f(x,y,z | a,b,c) where x, y ,z are all scalar values and a, b, c are all constants.

If you knew a few zeroes of this function, you could figure out good approximations of this function. Let's say you knew

U(x,y, a=0) = x
U(x,y, a=1) = x
U(x,y, a=2) = y
U(x,y, a=3) = y

You could now guess U(x,y) = x if a<1.5, y if a>1.5

You will not be able to get a good approximation if you did not know enough zeroes.

This is a comment about morality. x, y, z are agent's multiple possibly-conflicting values and a, b, c are info about environment of agent. You lack data about how your own mind will react to hypothetical situations you have not faced. At best you can extrapolate from historical data around minds of other people that are different from yours. Bigger and more trustworthy dataset will help solve this.