LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Math-to-English Cheat Sheet
nahoj · 2024-04-08T09:19:40.814Z · comments (5)

[link] On the Role of Proto-Languages
adamShimi · 2024-09-22T16:50:34.720Z · comments (1)

[link] OpenAI releases GPT-4o, natively interfacing with text, voice and vision
Martín Soto (martinsq) · 2024-05-13T18:50:52.337Z · comments (23)

[question] What Have Been Your Most Valuable Casual Conversations At Conferences?
johnswentworth · 2024-12-25T05:49:36.711Z · answers+comments (21)

[link] Come to Manifest 2024 (June 7-9 in Berkeley)
Saul Munn (saul-munn) · 2024-03-27T21:30:17.306Z · comments (2)

A D&D.Sci Dodecalogue
abstractapplic · 2024-04-12T01:10:01.625Z · comments (0)

Ten Modes of Culture War Discourse
jchan · 2024-01-31T13:58:20.572Z · comments (15)

On Anthropic’s Sleeper Agents Paper
Zvi · 2024-01-17T16:10:05.145Z · comments (5)

Provably Safe AI: Worldview and Projects
Ben Goldhaber (bgold) · 2024-08-09T23:21:02.763Z · comments (43)

New, improved multiple-choice TruthfulQA
Owain_Evans · 2025-01-15T23:32:09.202Z · comments (0)

On “first critical tries” in AI alignment
Joe Carlsmith (joekc) · 2024-06-05T00:19:02.814Z · comments (8)

How might we solve the alignment problem? (Part 1: Intro, summary, ontology)
Joe Carlsmith (joekc) · 2024-10-28T21:57:12.063Z · comments (5)

Monthly Roundup #17: April 2024
Zvi · 2024-04-15T12:10:03.126Z · comments (4)

[Closed] PIBBSS is hiring in a variety of roles (alignment research and incubation program)
Nora_Ammann · 2024-04-09T08:12:59.241Z · comments (0)

How to Give in to Threats (without incentivizing them)
Mikhail Samin (mikhail-samin) · 2024-09-12T15:55:50.384Z · comments (26)

Safe Stasis Fallacy
Davidmanheim · 2024-02-05T10:54:44.061Z · comments (2)

Thiel on AI & Racing with China
Ben Pace (Benito) · 2024-08-20T03:19:18.966Z · comments (10)

[question] Can we get an AI to "do our alignment homework for us"?
Chris_Leong · 2024-02-26T07:56:22.320Z · answers+comments (33)

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.
Andrew_Critch · 2024-09-11T04:41:24.872Z · comments (11)

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
Patrick Leask (patrickleask) · 2024-08-17T01:16:53.764Z · comments (0)

Per protocol analysis as medical malpractice
braces · 2024-01-31T16:22:21.367Z · comments (8)

Luck Based Medicine: No Good Very Bad Winter Cured My Hypothyroidism
Elizabeth (pktechgirl) · 2024-12-08T20:10:02.651Z · comments (3)

Fat Tails Discourage Compromise
niplav · 2024-06-17T09:39:16.489Z · comments (5)

We are headed into an extreme compute overhang
devrandom · 2024-04-26T21:38:21.694Z · comments (34)

Book Review: Righteous Victims - A History of the Zionist-Arab Conflict
Yair Halberstadt (yair-halberstadt) · 2024-06-24T11:02:03.490Z · comments (8)

Causal Graphs of GPT-2-Small's Residual Stream
David Udell · 2024-07-09T22:06:55.775Z · comments (7)

The OODA Loop -- Observe, Orient, Decide, Act
Davis_Kingsley · 2025-01-01T08:00:27.979Z · comments (2)

[link] Breaking Circuit Breakers
mikes · 2024-07-14T18:57:20.251Z · comments (13)

Be More Katja
Nathan Young · 2024-03-11T21:12:14.249Z · comments (0)

[link] LLMs seem (relatively) safe
JustisMills · 2024-04-25T22:13:06.221Z · comments (24)

[link] S-Risks: Fates Worse Than Extinction
aggliu · 2024-05-04T15:30:36.666Z · comments (2)

AI #71: Farewell to Chevron
Zvi · 2024-07-04T13:40:05.905Z · comments (9)

AI #50: The Most Dangerous Thing
Zvi · 2024-02-08T14:30:13.168Z · comments (4)

AI Safety as a YC Startup
Lukas Petersson (lukas-petersson-1) · 2025-01-08T10:46:29.042Z · comments (9)

AI #76: Six Shorts Stories About OpenAI
Zvi · 2024-08-08T13:50:04.659Z · comments (10)

The Shutdown Problem: Incomplete Preferences as a Solution
EJT (ElliottThornley) · 2024-02-23T16:01:16.378Z · comments (28)

Complexity of value but not disvalue implies more focus on s-risk. Moral uncertainty and preference utilitarianism also do.
Chi Nguyen · 2024-02-23T06:10:05.881Z · comments (18)

Two LessWrong speed friending experiments
mikko (morrel) · 2024-06-15T10:52:26.081Z · comments (3)

Correct my H5N1 research
Elizabeth (pktechgirl) · 2024-12-09T19:07:03.277Z · comments (25)

Implications of the inference scaling paradigm for AI safety
Ryan Kidd (ryankidd44) · 2025-01-14T02:14:53.562Z · comments (23)

The case for stopping AI safety research
catubc (cat-1) · 2024-05-23T15:55:18.713Z · comments (38)

Parental Writing Selection Bias
jefftk (jkaufman) · 2024-10-13T14:00:03.225Z · comments (3)

Was Releasing Claude-3 Net-Negative?
Logan Riggs (elriggs) · 2024-03-27T17:41:56.245Z · comments (5)

A Conflicted Linkspost
Screwtape · 2024-11-21T00:37:54.035Z · comments (0)

BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann (Stuckwork) · 2024-07-20T02:20:51.848Z · comments (0)

So You Created a Sociopath - New Book Announcement!
Garrett Baker (D0TheMath) · 2024-04-01T18:02:18.010Z · comments (3)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (16)

Estimates of GPU or equivalent resources of large AI players for 2024/5
CharlesD · 2024-11-28T23:01:58.522Z · comments (7)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (10)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

elriggs on Logan Riggs's Shortform

A trending youtube video w/ 500k views in a day brings up Dario Amodei's Machines of Loving Grace (Timestamp for the quote):
[Note: I had Claude help format, but personally verified the text's faithfulness]

I am an AI optimist. I think our world will be better because of AI. One of the best expressions of that I've seen is this blog post by Dario Amodei, who is the CEO of Anthropic, one of the biggest AI companies. I would really recommend reading this - it's one of the more interesting articles and arguments I have read. He's basically saying AI is going to have an incredibly positive impact, but he's also a realist and is like "AI is going to really potentially fuck up our world"

He's notable and more trustworthy because his company Anthropic has put WAY more effort into safety, way way more effort into making sure there are really high standards for safety and that there isn't going to be danger what these AIs are doing. So I really really like Dario and I've listened to a lot of what he's said. Whereas with some other AI leaders like Sam Altman who runs OpenAI, you don't know what the fuck he's thinking. I really like [Dario] - he also has an interesting background in biological work and biotech, so he's not just some tech-bro; he's a bio-tech-bro. But his background is very interesting.

But he's very realistic. There is a lot of bad shit that is going to happen with AI. I'm not denying that at all. It's about how we maximize the positive while reducing the negatives. I really want AI to solve all of our diseases. I would really like AI to fix cancer - I think that will happen in our lifetimes. To me, I'd rather we fight towards that future rather than say 'there will be problems, let's abandon the whole thing.'

Other notes: This is youtuber/Streamer DougDoug (2.8M subscribers), with this video posted on his other channel DougDougDoug ("DougDoug content that's too rotten for the main channel") who often streams/posts coding/AI integrated content.

The full video is also an entertaining summary of case law on AI generated art/text copyright.

oliver-sourbut on What Is The Alignment Problem?

Organisms in general typically sense their environment and take different actions across a wide variety of environmental conditions, so as to cause there to be approximate copies of themselves in the future.^[4] That's basic agency.^[5]

I agree with this breakdown, except I start the analysis with moment-to-moment deliberation [? · GW], and note that having there (continue to) be relevantly similar deliberators is a very widely-applicable intermediate objective [LW · GW], from where we get control ('basic agency') but also delegation and replication [LW · GW].

The way the terms have typically been used historically, the simplest summary would be:
Today's LLMs and image generators are generative models of (certain parts of) the world.
Systems like e.g. o1 are somewhat-general planners/solvers on top of those models. Also, LLMs can be used directly as planners/solvers when suitably prompted or tuned.
To go from a general planner/solver to an agent, one can simply hook the system up to some sensors and actuators (possibly a human user) and specify a nominal goal... assuming the planner/solver is capable enough to figure it out from there.

Yep! But (I think maybe you'd agree) there's a lot of bleed between these abstractions, especially when we get to heavily finetuned models. For example...

Applying all that to typical usage of LLMs (including o1-style models): an LLM isn't the kind of thing which is aligned or unaligned, in general. If we specify how the LLM is connected to the environment (e.g. via some specific sensors and actuators, or via a human user), then we can talk about both (a) how aligned to human values is the nominal objective given to the LLM^[8], and (b) how aligned to the nominal objective is the LLM's actual effects on its environment. Alignment properties depend heavily on how the LLM is wired up to the environment, so different usage or different scaffolding will yield different alignment properties.

Yes and no? I'd say that the LLM-plus agent's objectives are some function of

incompletely-specified objectives provided by operators
priors and biases from training/development
- pretraining
- finetuning
scaffolding/reasoning structure (including any multi-context/multi-persona interactions, internal ratings, reflection, refinement, ...)
- or these things developed implicitly through structured CoT
drift of various kinds

and I'd emphasise that the way that these influences interact is currently very poorly characterised. But plausibly the priors and biases from training could have nontrivial influence across a wide variety of scenarios (especially combined with incompletely-specified natural-language objectives), at which point it's sensible to ask 'how aligned' the LLM is. I appreciate you're talking in generalities, but I think in practice this case might take up a reasonable chunk of the space! For what it's worth, the perspective of LLMs as pre-agent building blocks and conditioned-LLMs as closer to agents is underrepresented, and I appreciate you conceptually distinguishing those things here.

alleged-wisdom on Passages I Highlighted in The Letters of J.R.R.Tolkien

These quotes show how anti-progress and reactionary Tolkien was. He hated machines, he hated housing construction, he hated innovation. He would condemn humanity to be tenant farmers ruled by a warrior aristocracy at a medieval tech level, forever. If you want to live in Tolkien's utopia, move to Zambia.

Basically, Tolkien is very much like the Unabomber. He saw real problems, but his proposed solutions are destructive. He was a master of using the bouba–kiki effect to incept his worldview in the minds of millions, so he did far more to stop progress than the Unabomber ever did. He bears significant responsibility for the productivity slowdown, and for your rent being too damn high.

kabir-kumar on Kabir Kumar's Shortform

Thinking about judgement criteria for the coming ai safety evals hackathon (https://lu.ma/xjkxqcya )
These are the things that need to be judged:
1. Is the benchmark actually measuring alignment (the real, scale, if we dont get this fully right right we die, problem)
2. Is the way of Deceiving the benchmark to get high scores actually deception, or have they somehow done alignment?

Both of these things need:
- a strong deep learning & ml background (ideally, muliple influential papers where they're one of the main authors/co-authors, or doing ai research at a significant lab, or they have, in the last 4 years)
- a good understanding of what the real alignment problem actually means - can judge this by looking at their papers, activity on lesswrong, alignmentforum, blog, etc
- a good understanding of evals/benchmarks (1 great or two pretty good papers/repos/works on this, ideally for alignment)

Do these seem loose? Strict? Off base?

wassname on Implications of the inference scaling paradigm for AI safety

To illustrate Gwerns idea here is an image from Jones 2021 that shows some of these self play training curves

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality

And so OAI employees may internally see that they are on the steady upward slope

ete on Six Small Cohabitive Games

Nice!

(I wrote the bit about not having to tell people your favourite suit or what cards you have leaves things open for some sharp or clever negotiation, but looking back I think it's mostly a trap. I haven't seen anyone get things to go better for them by hiding the suit.)

To add some layer of this strategy: Giving each person one specific card on their suit that they want with much higher strength might be fun, as the other players can ransom that card if they know (but might be happy trading it anyway). Also having the four suits each having a different multiplier might be fun?

wassname on Implications of the inference scaling paradigm for AI safety

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality, from merely cutting-edge AI work which everyone else will replicate in a few years, to takeoff

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.

This means that outsiders may never see the intermediate models

Doubly so, if outsiders will just distil your models behaviour, and bootstrap from your elevated starting point.

Inference-time search is a stimulant drug that juices your score immediately, but asymptotes hard. Quickly, you have to use a smarter model to improve the search itself, instead of doing more.

It's worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.

we find that this is extremely sensitive to the quality of the verifier. If the verifier is slightly imperfect, in many realistic settings of a coding task, performance maxes out and actually starts to decrease after about 10 attempts." - Inference Scaling fLaws

Jones 2021 also says something similar: "The error in the prediction decays exponentially as more boards are used"

But maybe the counterpoint is just, GPU's go brrrr.

wassname on Implications of the inference scaling paradigm for AI safety

Gwern and Daniel Kokotajlo [LW · GW] have a pretty notable track records at predicting AI scaling too, and they have comments in this thread.

rogerdearnaley on What Is The Alignment Problem?

If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn't have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don't need to worry about it.

alexander-gietelink-oldenziel on Shortform

See also geometric rationality. [LW · GW]