LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Announcing New Beginner-friendly Book on AI Safety and Risk
Darren McKee · 2023-11-25T15:57:08.078Z · comments (2)

On the Debate Between Jezos and Leahy
Zvi · 2024-02-06T14:40:05.487Z · comments (6)

A to Z of things
KatjaGrace · 2023-11-17T05:20:03.134Z · comments (6)

[link] DeepMind: Frontier Safety Framework
Zach Stein-Perlman · 2024-05-17T17:30:02.504Z · comments (0)

Book Review: On the Edge: The Fundamentals
Zvi · 2024-09-23T13:40:11.058Z · comments (3)

Complex systems research as a field (and its relevance to AI Alignment)
Nora_Ammann · 2023-12-01T22:10:25.801Z · comments (11)

On the Gladstone Report
Zvi · 2024-03-20T19:50:05.186Z · comments (11)

A gentle introduction to mechanistic anomaly detection
Erik Jenner (ejenner) · 2024-04-03T23:06:16.778Z · comments (2)

Superposition is not "just" neuron polysemanticity
LawrenceC (LawChan) · 2024-04-26T23:22:06.066Z · comments (4)

How to Control an LLM's Behavior (why my P(DOOM) went down)
RogerDearnaley (roger-d-1) · 2023-11-28T19:56:49.679Z · comments (30)

[link] Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan (SenR) · 2024-04-25T18:43:47.003Z · comments (38)

Another argument against maximizer-centric alignment paradigms
Fiora from Rosebloom · 2024-09-22T07:28:27.856Z · comments (39)

Self-Awareness: Taxonomy and eval suite proposal
Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-17T01:47:01.802Z · comments (2)

[link] A primer on why computational predictive toxicology is hard
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-19T17:16:37.735Z · comments (2)

[question] Is cybercrime really costing trillions per year?
Fabien Roger (Fabien) · 2024-09-27T08:44:07.621Z · answers+comments (28)

All About Concave and Convex Agents
mako yass (MakoYass) · 2024-03-24T21:37:17.922Z · comments (23)

What mistakes has the AI safety movement made?
EuanMcLean (euanmclean) · 2024-05-23T11:19:02.717Z · comments (29)

Bayesian updating in real life is mostly about understanding your hypotheses
Max H (Maxc) · 2024-01-01T00:10:30.978Z · comments (4)

AiPhone
Zvi · 2024-06-12T22:20:02.141Z · comments (4)

On Llama-3 and Dwarkesh Patel’s Podcast with Zuckerberg
Zvi · 2024-04-22T13:10:02.645Z · comments (4)

[link] Moving on from community living
Vika · 2024-04-17T17:02:11.357Z · comments (7)

[Intuitive self-models] 4. Trance
Steven Byrnes (steve2152) · 2024-10-08T13:30:41.446Z · comments (7)

Generalization, from thermodynamics to statistical physics
Jesse Hoogland (jhoogland) · 2023-11-30T21:28:50.089Z · comments (9)

Against most, but not all, AI risk analogies
Matthew Barnett (matthew-barnett) · 2024-01-14T03:36:16.267Z · comments (41)

[link] AI, centralization, and the One Ring
owencb · 2024-09-13T14:00:16.126Z · comments (11)

AI research assistants competition 2024Q3: Tie between Elicit and You.com
Elizabeth (pktechgirl) · 2024-10-12T15:10:05.417Z · comments (2)

A framework for thinking about AI power-seeking
Joe Carlsmith (joekc) · 2024-07-24T22:41:01.685Z · comments (15)

Do not delete your misaligned AGI.
mako yass (MakoYass) · 2024-03-24T21:37:07.724Z · comments (13)

What is a Tool?
johnswentworth · 2024-06-25T23:40:07.483Z · comments (4)

[link] Superforecasting the Origins of the Covid-19 Pandemic
DanielFilan · 2024-03-12T19:01:15.914Z · comments (0)

Never Drop A Ball
Screwtape · 2023-11-23T04:15:35.834Z · comments (1)

AI Craftsmanship
abramdemski · 2024-11-11T22:17:01.112Z · comments (7)

[link] Slightly More Than You Wanted To Know: Pregnancy Length Effects
JustisMills · 2024-10-21T01:26:02.030Z · comments (4)

[link] Twitter thread on AI safety evals
Richard_Ngo (ricraz) · 2024-07-31T00:18:14.076Z · comments (3)

E.T. Jaynes Probability Theory: The logic of Science I
Jan Christian Refsgaard (jan-christian-refsgaard) · 2023-12-27T23:47:52.579Z · comments (20)

On coincidences and Bayesian reasoning, as applied to the origins of COVID-19
viking_math · 2024-02-19T01:14:06.772Z · comments (28)

[link] Ice: The Penultimate Frontier
Roko · 2024-07-13T23:44:56.827Z · comments (56)

[link] Outrage Bonding
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:46:59.818Z · comments (12)

AI #55: Keep Clauding Along
Zvi · 2024-03-14T15:40:09.335Z · comments (16)

[link] Electrostatic Airships?
DaemonicSigil · 2024-10-27T04:32:34.852Z · comments (13)

SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane (ckkissane) · 2024-11-07T05:22:18.807Z · comments (4)

[link] Pay-on-results personal growth: first success
Chipmonk · 2024-09-14T03:39:12.975Z · comments (5)

Catastrophic Goodhart in RL with KL penalty
Thomas Kwa (thomas-kwa) · 2024-05-15T00:58:20.763Z · comments (10)

Don't sleep on Coordination Takeoffs
trevor (TrevorWiesinger) · 2024-01-27T19:55:26.831Z · comments (24)

Black Box Biology
GeneSmith · 2023-11-29T02:27:29.794Z · comments (30)

RTFB: California’s AB 3211
Zvi · 2024-07-30T13:10:03.853Z · comments (2)

Book Review: On the Edge: The Future
Zvi · 2024-09-27T14:00:05.279Z · comments (1)

A civilization ran by amateurs
Olli Järviniemi (jarviniemi) · 2024-05-30T17:57:32.601Z · comments (7)

Balancing Games
jefftk (jkaufman) · 2024-02-24T14:40:04.237Z · comments (18)

[link] electric turbofans
bhauth · 2024-11-02T22:50:59.807Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

johnswentworth on johnswentworth's Shortform

FYI, my update from this comment was:

Hmm, seems like a decent argument...
... except he said "we don't know that it doesn't work", which is an extremely strong update that it will clearly not work.

abramdemski on o1 is a bad idea

I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have

[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]

But what lesson do you think you can generalize, and why do you think you can generalize that?

I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.

So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.

Similarly, playing text-based video games, with the sparse feedback given for winning.

Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.

Etc.

You think these sorts of things just won't work well enough to be relevant?

christiankl on The Online Sports Gambling Experiment Has Failed

Most people don't have very fixed ideas about how much a 28% overall increase in bankruptcy happens to be.

If you would ask most people without a libertarian outlook to rank different factors that lead to an increase in bankruptcy, I would not expect them to be able to accurately compare them and find that sports online gambling only will have such a strong influence.

otto-barten on Proposing the Conditional AI Safety Treaty (linkpost TIME)

I'm aware and I don't disagree. However, in xrisk, many (not all) of those who are most worried are also most bullish about capabilities. Reversely, many (not all) who are not worried are unimpressed with capabilities. Being aware of the concept of AGI, that it may be coming soon, and of how impactful it could be, is in practice often a first step towards becoming concerned about the risks, too. This is not true for everyone unfortunately. Still, I would say that at least for our chances to get an international treaty passed, it is perhaps hopeful that the power of AGI is on the radar of leading politicians (although this may also increase risk through other paths).

evhub on Sabotage Evaluations for Frontier Models

The usual plan for control as I understand it is that you use control techniques [LW · GW] to ensure the safety of models that are sufficiently good at themselves doing alignment research that you can then leverage your controlled human-ish-level models to help you align future superhuman models.

christiankl on Lao Mein's Shortform

I would add that convincing Musk to take action against Altman is the highest ROI thing I can think of in terms of decreasing AI extinction risk.

I would expect, the issue isn't about convincing Musk to take action but about finding effective actions that Musk could take.

satron on Sabotage Evaluations for Frontier Models

I do get that point that you are making, but I think this is a little bit unfair to these organizations. Articles like Machines of Loving Grace, The Intelligence Age and Planning for AGI and Beyond are implicit public justifications for building AGI.

These labs have also released their plans on "safe development". I expect a big part of what they say to be the usual business marketing, but it's not like they completely ignoring the issue. In fact, taking one example, Anthropic's research papers on safety are often discussed on this site as genuine improvements on this or that niche of AI Safety.

I don't think that money alone would've convinced CEOs of big companies to run this enterprise. Altman and Amodei, they both have families. If they don't care about their own families, then they at least care about themselves. After all, we are talking about scenarios where these guys would die the same deaths as the rest of us. No amounts of hoarded money would save them. They would have little motivation to do any of this if they believed that they would die as the result of their own actions. And that's not mentioning all of the other researchers working at their labs. Just Anthropic and OpenAI together have almost 2000 employees. Do they all not care about their and their families' well-being?

I think the point about them not engaging with critics is also a bit too harsh. Here [LW · GW] is DeepMind's alignment team response to concerns raised by Yudkowski. I am not saying that their response is flawless or even correct, but it is a response nonetheless. They are engaging with this work. DeepMind's alignment team also seemed to engage with concerns raised by critics in their (relatively) recent work [AF · GW].

EDIT: Another example would be Anthropic creating a dedicated team [LW · GW] for stress testing their alignment proposals. And as far as I can see, this team is lead by someone who has been actively engaged with the topic of AI safety on LessWrong, someone who you sort of praised [LW · GW] a few days ago.

benito on Lao Mein's Shortform

It's not clear to me that there was actually an option to build a $100B company with competent people around the world who would've been united in conditionally shutting down and unconditionally pushing for regulation. I don't know that the culture and concepts of people who do a lot of this work in the business world would allow for such a plan to be actively worked on.

romeostevensit on A Theory of Equilibrium in the Offense-Defense Balance

This elides the original argument by assuming the conclusion: that countermanding efforts remain cheap relative to the innovations. But the whole point is that significant shifts in costs associated with defense of a certain level can change behaviors and which plans and supply chains are economically defensible a lot.

casey-b on Making a conservative case for alignment

While you nod to 'politics is the mind-killer', I don't think the right lesson is being taken away, or perhaps just not with enough emphasis.

Whether one is an accelerationist, Pauser, or an advocate of some nuanced middle path, the prospects/goals of everyone are harmed if the discourse-landscape becomes politicized/polarized. All possible movement becomes more difficult.

"Well we of course don't want that to happen, but X ppl are in power, so it makes sense to ask how X ppl tend to think and cater our arguments to them"

If your argument is taking advantage of features of {group of ppl X} qua X, then that is almost unavoidably going to run counter to some Y qua Y, (either as a direct consequence of the arguments and/or because Nuance cannot survive public exposure) and if it isn't, then why couldn't the argument have been made completely apolitically to begin with?

I just continue to think that any mention, literally at all, of ideology or party is courting discourse-disaster for all, again no matter what specific policy one is advocating for. Do we all remember what happened with covid masks? Or what is currently happening with discourse surrounding elon? Nuance just does not survive public exposure, and nobody is going to fix that in the few years we have left. (and this is a public document). The best way forward continues to be apolitical good arguments. Yes those arguments are going to be sent towards those who are in power at any given time, but you can do that without routing through ideology.

To touch, even in passing reference, ideology/alliance (ex: the c word included in the title of this post) is to risk the poison/mindkill spreading in a way that is basically irreversible, because to correct it (other than comments like this just calling to Stop Referencing Ideology) usually involves Referencing An Ideology. Like a bug stuck in a glue trap, it places yet another limb into the glue in a vain attempt to push itself free.