LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI 2027 is a Bet Against Amdahl's Law
snewman · 2025-04-21T03:09:40.751Z · comments (26)

My model of what is going on with LLMs
Cole Wyeth (Amyr) · 2025-02-13T03:43:29.447Z · comments (49)

[link] A short course on AGI safety from the GDM Alignment team
Vika · 2025-02-14T15:43:50.903Z · comments (1)

C'mon guys, Deliberate Practice is Real
Raemon · 2025-02-05T22:33:59.069Z · comments (25)

[link] What the Headlines Miss About the Latest Decision in the Musk vs. OpenAI Lawsuit
garrison · 2025-03-06T19:49:02.145Z · comments (0)

AI Control May Increase Existential Risk
Jan_Kulveit · 2025-03-11T14:30:05.972Z · comments (13)

Third-wave AI safety needs sociopolitical thinking
Richard_Ngo (ricraz) · 2025-03-27T00:55:30.548Z · comments (23)

Timaeus in 2024
Jesse Hoogland (jhoogland) · 2025-02-20T23:54:56.939Z · comments (1)

Reviewing LessWrong: Screwtape's Basic Answer
Screwtape · 2025-02-05T04:30:34.347Z · comments (18)

The Lizardman and the Black Hat Bobcat
Screwtape · 2025-04-06T19:02:01.238Z · comments (13)

Impact, agency, and taste
benkuhn · 2025-04-19T21:10:06.960Z · comments (1)

How I talk to those above me
Maxwell Peterson (maxwell-peterson) · 2025-03-30T06:54:59.869Z · comments (13)

Show, not tell: GPT-4o is more opinionated in images than in text
Daniel Tan (dtch1997) · 2025-04-02T08:51:02.571Z · comments (41)

The Rising Sea
Jesse Hoogland (jhoogland) · 2025-01-25T20:48:52.971Z · comments (2)

How training-gamers might function (and win)
Vivek Hebbar (Vivek) · 2025-04-11T21:26:18.669Z · comments (4)

[link] Elite Coordination via the Consensus of Power
Richard_Ngo (ricraz) · 2025-03-19T06:56:44.825Z · comments (15)

Six Thoughts on AI Safety
boazbarak · 2025-01-24T22:20:50.768Z · comments (55)

[link] Towards a scale-free theory of intelligent agency
Richard_Ngo (ricraz) · 2025-03-21T01:39:42.251Z · comments (24)

Dear AGI,
Nathan Young · 2025-02-18T10:48:15.030Z · comments (11)

We should start looking for scheming "in the wild"
Marius Hobbhahn (marius-hobbhahn) · 2025-03-06T13:49:39.739Z · comments (4)

How To Believe False Things
Eneasz · 2025-04-02T16:28:29.055Z · comments (10)

[link] Wired on: "DOGE personnel with admin access to Federal Payment System"
Raemon · 2025-02-05T21:32:11.205Z · comments (45)

[link] Anthropic releases Claude 3.7 Sonnet with extended thinking mode
LawrenceC (LawChan) · 2025-02-24T19:32:43.947Z · comments (8)

On Emergent Misalignment
Zvi · 2025-02-28T13:10:05.973Z · comments (5)

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Julian Bradshaw · 2025-04-21T03:52:34.759Z · comments (10)

What goals will AIs have? A list of hypotheses
Daniel Kokotajlo (daniel-kokotajlo) · 2025-03-03T20:08:31.539Z · comments (19)

The Risk of Gradual Disempowerment from AI
Zvi · 2025-02-05T22:10:06.979Z · comments (15)

Voting Results for the 2023 Review
Raemon · 2025-02-06T08:00:37.461Z · comments (3)

How I force LLMs to generate correct code
claudio · 2025-03-21T14:40:19.211Z · comments (7)

Vacuum Decay: Expert Survey Results
JessRiedel · 2025-03-13T18:31:17.434Z · comments (26)

Stargate AI-1
Zvi · 2025-01-24T15:20:18.752Z · comments (1)

One-shot steering vectors cause emergent misalignment, too
Jacob Dunefsky (jacob-dunefsky) · 2025-04-14T06:40:41.503Z · comments (6)

[link] ASI existential risk: Reconsidering Alignment as a Goal
habryka (habryka4) · 2025-04-15T19:57:42.547Z · comments (14)

How might we safely pass the buck to AI?
joshc (joshua-clymer) · 2025-02-19T17:48:32.249Z · comments (58)

A Slow Guide to Confronting Doom
Ruby · 2025-04-06T02:10:56.483Z · comments (20)

OpenAI #11: America Action Plan
Zvi · 2025-03-18T12:50:03.880Z · comments (3)

Ambiguous out-of-distribution generalization on an algorithmic task
Wilson Wu (wilson-wu) · 2025-02-13T18:24:36.160Z · comments (6)

The Mask Comes Off: A Trio of Tales
Zvi · 2025-02-14T15:30:15.372Z · comments (1)

Is Gemini now better than Claude at Pokémon?
Julian Bradshaw · 2025-04-19T23:34:43.298Z · comments (11)

Keltham's Lectures in Project Lawful
Morpheus · 2025-04-01T10:39:47.973Z · comments (4)

MONA: Managed Myopia with Approval Feedback
Seb Farquhar · 2025-01-23T12:24:18.108Z · comments (29)

Mistral Large 2 (123B) exhibits alignment faking
Marc Carauleanu (Marc-Everin Carauleanu) · 2025-03-27T15:39:02.176Z · comments (4)

Open problems in emergent misalignment
Jan Betley (jan-betley) · 2025-03-01T09:47:58.889Z · comments (13)

Microplastics: Much Less Than You Wanted To Know
jenn (pixx) · 2025-02-15T19:08:14.561Z · comments (8)

You will crash your car in front of my house within the next week
Richard Korzekwa (Grothor) · 2025-04-01T21:43:21.472Z · comments (6)

What Makes an AI Startup "Net Positive" for Safety?
jacquesthibs (jacques-thibodeau) · 2025-04-18T20:33:22.682Z · comments (22)

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik (lucy.fa) · 2025-02-26T12:50:04.204Z · comments (8)

Elon Musk May Be Transitioning to Bipolar Type I
Cyborg25 · 2025-03-11T17:45:06.599Z · comments (22)

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Stuart_Armstrong · 2025-03-18T14:48:54.762Z · comments (12)

Announcing ILIAD2: ODYSSEY
Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-04-03T17:01:06.004Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

q-home on Clarifying the Agent-Like Structure Problem

Can anybody give/reference an ELI5 or ELI15 explanation of this example? How can we use the models without creating them? I know that gradient descent is used to update neural networks, but how can you get the predictions of those NNs without having them?

mo-putera on Most AI value will come from broad automation, not from R&D

Thought it would be useful to pull out your plot and surrounding text, which seemed cruxy:

At first glance, the job of a scientist might seem like it leans very heavily on abstract reasoning... In such a world, AIs would greatly accelerate R&D before AIs are broadly deployed across the economy to take over more common jobs, such as retail workers, real estate agents, or IT professionals. In short, AIs would “first automate science, then automate everything else.”
But this picture is likely wrong. In reality, most R&D jobs require much more than abstract reasoning skills. ... To demonstrate this, we used GPT-4.5 to label tasks across 12 common R&D occupations into one of three categories, depending on whether it thinks the task can be performed using only abstract reasoning skills, whether it requires complex computer-use skills (but not physical presence), or whether it one needs to be physically present to complete the task. See this link to our conversation with GPT-4.5 to find our methodology and results.

This plot reveals a more nuanced picture of what scientific research actually entails. Contrary to the assumption that research is largely an abstract reasoning task, the reality is that much of it involves physical manipulation and advanced agency. To fully automate R&D, AI systems likely require the ability to autonomously operate computer GUIs, coordinate effectively with human teams, possess strong executive functioning skills to complete highly complex projects over long time horizons, and manipulate their physical environment to conduct experiments.
Yet, by the time AI reaches the level required to fully perform this diverse array of skills at a high level of capability, it is likely that a broad swath of more routine jobs will have already been automated. This contradicts the notion that AI will “first automate science, then automate everything else.” Instead, a more plausible prediction is that AI automation will first automate a large share of the general workforce, across a very wide range of industries, before it reaches the level needed to fully take over R&D.

sustrik on Accountability Sinks

It’s always worth asking, do we own the process or does the process own us?

That's a nice shortcut to explain the distinction between "a process imposed upon yourself" vs. "a process handed to you from above".

pablo_stafforini on Pablo's Shortform

I think there is a vast difference between Gerard and Kruel, not just in the damage each has caused but also in their intellectual honesty and responsiveness to argument (null in the case of Gerard, decent in the case of Kruel, at least from my recollection).

samuelshadrach on Davidmanheim's Shortform

Why does this matter? To quote a Yudkowsky-ish example, maybe you can take a 16-th century human (before Newtonian physics was invented, after guns were invented) and explain to him how a nuclear bomb works. This doesn't matter for predicting the outcome of a hypothetical war between 16th century Britain and 21st century USA.

ASI inventions can be big surprises and yet be things that you could understand if someone taught you.

We could probably understand how a von Neumann probe or an anti-aging cure worked too, if someone taught us.

mo-putera on Accountability Sinks

I think this essay is going to be one I frequently recommend to others over the coming years, thanks for writing it.

But in the end, deep in the heart of any bureaucracy, the process is about responsibility and the ways to avoid it. It's not an efficiency measure, it’s an accountability management technique.

This vaguely reminded me of what Ivan Vendrov wrote in Metrics, Cowardice, and Mistrust. Ivan began by noting that "companies optimizing for simple engagement metrics aren’t even being economically rational... so why don't they?" It's not because "these more complex things are hard to measure", if you think about it. His answer is cowardice and mistrust, which lead to the selection of metrics "robust to an adversary trying to fool you":

But the other reason we use metrics, sadly much more common, is due to cowardice (sorry, risk-aversion) and mistrust.
Cowardice because nobody wants to be responsible for making a decision. Actually trying to understand the impact of a new feature on your users and then making a call is an inherently subjective process that involves judgment, i.e. responsibility, i.e. you could be fired if you fuck it up. Whereas if you just pick whichever side of the A/B test has higher metrics, you’ve successfully outsourced your agency to an impartial process, so you’re safe.
Mistrust because not only does nobody want to make the decision themselves, nobody even wants to delegate it! Delegating the decision to a specific person also involves a judgment call about that person. If they make a bad decision, that reflects badly on you for trusting them! So instead you insist that “our company makes data driven decisions” which is a euphemism for “TRUST NO ONE”. This works all the way up the hierarchy - the CEO doesn’t trust the Head of Product, the board members don’t trust the CEO, everyone insists on seeing metrics and so metrics rule.
Coming back to our original question: why can’t we have good metrics that at least try to capture the complexity of what users want? Again, cowardice and mistrust. There’s a vast space of possible metrics, and choosing any specific one is a matter of judgment. But we don’t trust ourselves or anyone else enough to make that judgment call, so we stick with the simple dumb metrics.
This isn’t always a bad thing! Police departments are often evaluated by their homicide clearance rate because murders are loud and obvious and their numbers are very hard to fudge. If we instead evaluated them by some complicated CRIME+ index that a committee came up with, I’d expect worse outcomes across the board.
Nobody thinks “number of murders cleared” is the best metric of police performance, any more than DAUs are the best metric of product quality, or GDP is the best metric of human well-being. However they are the best in the sense of being hardest to fudge, i.e. robust to an adversary trying to fool you. As trust declines, we end up leaning more and more on these adversarially robust metrics, and we end up in a gray Brezhnev world where the numbers are going up, everyone knows something is wrong, but the problems get harder and harder to articulate.

His preferred solution to counteracting this tendency to use adversarially robust but terrible metrics is to develop an ideology to promote mission alignment:

A popular attempt at a solution is monarchy. ... The big problem with monarchy is that it doesn’t scale. ...
A more decentralized and scalable solution is developing an ideology: a self-reinforcing set of ideas nominally held by everyone in your organization. Having a shared ideology increases trust, and ideologues are able to make decisions against the grain of natural human incentives. This is why companies talk so much about “mission alignment”, though very few organizations can actually pull off having an ideology: when real sacrifices need to be made, either your employees or your investors will rebel.

While his terminology feels somewhat loaded, I thought it natural to interpret all your examples of people breaking rules to get the thing done in mission alignment terms.

Another way to promote mission alignment is some combination of skilful message compression and resistance to proxies, which Eugene Wei wrote about in Compress to impress about Jeff Bezos (monarchy in Ivan's framing above). On the latter, quoting Bezos:

As companies get larger and more complex, there’s a tendency to manage to proxies. This comes in many shapes and sizes, and it’s dangerous, subtle, and very Day 2.

A common example is process as proxy. Good process serves you so you can serve customers. But if you’re not watchful, the process can become the thing. This can happen very easily in large organizations. The process becomes the proxy for the result you want. You stop looking at outcomes and just make sure you’re doing the process right. Gulp. It’s not that rare to hear a junior leader defend a bad outcome with something like, “Well, we followed the process.” A more experienced leader will use it as an opportunity to investigate and improve the process. The process is not the thing. It’s always worth asking, do we own the process or does the process own us? In a Day 2 company, you might find it’s the second.

I wonder how all this is going to look like in a (soonish?) future where most of the consequential decision-making has been handed over to AIs.

shawnghu on Accountability Sinks

Well, a 5 percent rate of being sent to a concentration camp or combat unit isn't exactly negligible, and a further 17 percent were threatened. So maybe it's correct that these effects are "surprisingly mild", but these stats are more justification for the "just following orders" explanation than I'd have imagined from the main text.

rasool on Eulogy to the Obits

It looks like this is a linkpost to:

https://press.asimov.com/articles/obit

yams on jacquesthibs's Shortform

Rather than make things worse as a means of compelling others to make things better, I would rather just make things better.

Brinksmanship and accelerationism (in the Marxist sense) are high variance strategies ill-suited to the stakes of this particular game.

[one way this makes things worse is stimulating additional investment on the frontier; another is attracting public attention to the wrong problem, which will mostly just generate action on solutions to that problem, and not to the problem we care most about. Importantly, the contingent of people-mostly-worried-about-jobs are not yet our allies, and it’s likely their regulatory priorities would not address our concerns, even though I share in some of those concerns.]

samuelshadrach on xpostah's Shortform

Suppose you are trying to figure out a function f(x,y,z | a,b,c) where x, y ,z are all scalar values and a, b, c are all constants.

If you knew a few zeroes of this function, you could figure out good approximations of this function. Let's say you knew

U(x,y, a=0) = x
U(x,y, a=1) = x
U(x,y, a=2) = y
U(x,y, a=3) = y

You could now guess U(x,y) = x if a<1.5, y if a>1.5

You will not be able to get a good approximation if you did not know enough zeroes.

This is a comment about morality. x, y, z are agent's multiple possibly-conflicting values and a, b, c are info about environment of agent. You lack data about how your own mind will react to hypothetical situations you have not faced. At best you can extrapolate from historical data around minds of other people that are different from yours. Bigger and more trustworthy dataset will help solve this.