LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Aligning AI Safety Projects with a Republican Administration
Deric Cheng (deric-cheng) · 2024-11-21T22:12:27.502Z · comments (1)

Experience Report - ML4Good AI Safety Bootcamp
Kieron Kretschmar · 2024-04-11T18:03:41.040Z · comments (0)

[link] My MATS Summer 2023 experience
James Chua (james-chua) · 2024-03-20T11:26:14.944Z · comments (0)

Scorable Functions: A Format for Algorithmic Forecasting
ozziegooen · 2024-05-21T04:14:11.749Z · comments (0)

Wholesome Culture
owencb · 2024-03-01T12:08:17.877Z · comments (3)

[link] GDP per capita in 2050
Hauke Hillebrandt (hauke-hillebrandt) · 2024-05-06T15:14:30.934Z · comments (8)

End-to-end hacking with language models
tchauvin (timot.cool) · 2024-04-05T15:06:53.689Z · comments (0)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

AI #61: Meta Trouble
Zvi · 2024-05-02T18:40:03.242Z · comments (0)

[question] Weighing reputational and moral consequences of leaving Russia or staying
spza · 2024-02-18T19:36:40.676Z · answers+comments (24)

[link] AI forecasting bots incoming
Dan H (dan-hendrycks) · 2024-09-09T19:14:31.050Z · comments (44)

Acknowledging Background Information with P(Q|I)
JenniferRM · 2024-12-24T18:50:25.323Z · comments (8)

Disagreement on AGI Suggests It’s Near
tangerine · 2025-01-07T20:42:43.456Z · comments (15)

Aggregative principles approximate utilitarian principles
Cleo Nardo (strawberry calm) · 2024-06-12T16:27:22.179Z · comments (3)

[link] Our Digital and Biological Children
Eneasz · 2024-10-24T18:36:38.719Z · comments (0)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

[link] Arithmetic Models: Better Than You Think
kqr · 2024-10-26T09:42:07.185Z · comments (4)

Concrete Methods for Heuristic Estimation on Neural Networks
Oliver Daniels (oliver-daniels-koch) · 2024-11-14T05:07:55.240Z · comments (0)

Distinguishing ways AI can be "concentrated"
Matthew Barnett (matthew-barnett) · 2024-10-21T22:21:13.666Z · comments (2)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

Monthly Roundup #19: June 2024
Zvi · 2024-06-25T12:00:03.333Z · comments (9)

Can quantised autoencoders find and interpret circuits in language models?
charlieoneill (kingchucky211) · 2024-03-24T20:05:50.125Z · comments (4)

Childhood and Education Roundup #6: College Edition
Zvi · 2024-06-26T11:40:03.990Z · comments (8)

Option control
Joe Carlsmith (joekc) · 2024-11-04T17:54:03.073Z · comments (0)

Auditing LMs with counterfactual search: a tool for control and ELK
Jacob Pfau (jacob-pfau) · 2024-02-20T00:02:09.575Z · comments (6)

AI #64: Feel the Mundane Utility
Zvi · 2024-05-16T15:20:02.956Z · comments (11)

Winning isn't enough
JesseClifton · 2024-11-05T11:37:39.486Z · comments (14)

Collection (Part 6 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-14T21:37:00.160Z · comments (0)

Evaporation of improvements
Viliam · 2024-06-20T18:34:40.969Z · comments (27)

Cicadas, Anthropic, and the bilateral alignment problem
kromem · 2024-05-22T11:09:56.469Z · comments (6)

Cryonics p(success) estimates are only weakly associated with interest in pursuing cryonics in the LW 2023 Survey
Andy_McKenzie · 2024-02-29T14:47:28.613Z · comments (6)

Trading Candy
jefftk (jkaufman) · 2024-11-01T01:10:08.024Z · comments (4)

Towards Quantitative AI Risk Management
Henry Papadatos (henry) · 2024-10-16T19:26:48.817Z · comments (1)

Two flavors of computational functionalism
EuanMcLean (euanmclean) · 2024-11-25T10:47:04.584Z · comments (9)

[question] Which things were you surprised to learn are metaphors?
Gordon Seidoh Worley (gworley) · 2024-11-22T03:46:02.845Z · answers+comments (18)

Gratitudes: Rational Thanks Giving
Seth Herd · 2024-11-29T03:09:47.410Z · comments (2)

First Solo Bus Ride
jefftk (jkaufman) · 2024-12-03T12:20:02.344Z · comments (1)

Mini Go: Gateway Game
jefftk (jkaufman) · 2025-01-14T03:30:02.020Z · comments (1)

NYC Congestion Pricing: Early Days
Zvi · 2025-01-14T14:00:07.445Z · comments (0)

Corrigibility's Desirability is Timing-Sensitive
RobertM (T3t) · 2024-12-26T22:24:17.435Z · comments (4)

Is AI Alignment Enough?
Aram Panasenco (panasenco) · 2025-01-10T18:57:48.409Z · comments (6)

[link] ML Safety Research Advice - GabeM
Gabe M (gabe-mukobi) · 2024-07-23T01:45:42.288Z · comments (2)

An explanation of evil in an organized world
KatjaGrace · 2024-05-02T05:20:06.240Z · comments (9)

An Affordable CO2 Monitor
Pretentious Penguin (dylan-mahoney) · 2024-03-21T03:06:53.255Z · comments (1)

[link] Quick Thoughts on Scaling Monosemanticity
[deleted] · 2024-05-23T16:22:48.035Z · comments (1)

Tackling Moloch: How YouCongress Offers a Novel Coordination Mechanism
Hector Perez Arenas (hector-perez-arenas) · 2024-05-15T23:13:48.501Z · comments (9)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

DIY RLHF: A simple implementation for hands on experience
Mike Vaiana (mike-vaiana) · 2024-07-10T12:07:03.047Z · comments (0)

Reading More Each Day: A Simple $35 Tool
aysajan · 2024-07-24T13:54:04.290Z · comments (2)

AI #65: I Spy With My AI
Zvi · 2024-05-23T12:40:02.793Z · comments (7)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

elriggs on Logan Riggs's Shortform

A trending youtube video w/ 500k views in a day brings up Dario Amodei's Machines of Loving Grace (Timestamp for the quote):
[Note: I had Claude help format, but personally verified the text's faithfulness]

I am an AI optimist. I think our world will be better because of AI. One of the best expressions of that I've seen is this blog post by Dario Amodei, who is the CEO of Anthropic, one of the biggest AI companies. I would really recommend reading this - it's one of the more interesting articles and arguments I have read. He's basically saying AI is going to have an incredibly positive impact, but he's also a realist and is like "AI is going to really potentially fuck up our world"

He's notable and more trustworthy because his company Anthropic has put WAY more effort into safety, way way more effort into making sure there are really high standards for safety and that there isn't going to be danger what these AIs are doing. So I really really like Dario and I've listened to a lot of what he's said. Whereas with some other AI leaders like Sam Altman who runs OpenAI, you don't know what the fuck he's thinking. I really like [Dario] - he also has an interesting background in biological work and biotech, so he's not just some tech-bro; he's a bio-tech-bro. But his background is very interesting.

But he's very realistic. There is a lot of bad shit that is going to happen with AI. I'm not denying that at all. It's about how we maximize the positive while reducing the negatives. I really want AI to solve all of our diseases. I would really like AI to fix cancer - I think that will happen in our lifetimes. To me, I'd rather we fight towards that future rather than say 'there will be problems, let's abandon the whole thing.'

Other notes: This is youtuber/Streamer DougDoug (2.8M subscribers), with this video posted on his other channel DougDougDoug ("DougDoug content that's too rotten for the main channel") who often streams/posts coding/AI integrated content.

The full video is also an entertaining summary of case law on AI generated art/text copyright.

oliver-sourbut on What Is The Alignment Problem?

Organisms in general typically sense their environment and take different actions across a wide variety of environmental conditions, so as to cause there to be approximate copies of themselves in the future.^[4] That's basic agency.^[5]

I agree with this breakdown, except I start the analysis with moment-to-moment deliberation [? · GW], and note that having there (continue to) be relevantly similar deliberators is a very widely-applicable intermediate objective [LW · GW], from where we get control ('basic agency') but also delegation and replication [LW · GW].

The way the terms have typically been used historically, the simplest summary would be:
Today's LLMs and image generators are generative models of (certain parts of) the world.
Systems like e.g. o1 are somewhat-general planners/solvers on top of those models. Also, LLMs can be used directly as planners/solvers when suitably prompted or tuned.
To go from a general planner/solver to an agent, one can simply hook the system up to some sensors and actuators (possibly a human user) and specify a nominal goal... assuming the planner/solver is capable enough to figure it out from there.

Yep! But (I think maybe you'd agree) there's a lot of bleed between these abstractions, especially when we get to heavily finetuned models. For example...

Applying all that to typical usage of LLMs (including o1-style models): an LLM isn't the kind of thing which is aligned or unaligned, in general. If we specify how the LLM is connected to the environment (e.g. via some specific sensors and actuators, or via a human user), then we can talk about both (a) how aligned to human values is the nominal objective given to the LLM^[8], and (b) how aligned to the nominal objective is the LLM's actual effects on its environment. Alignment properties depend heavily on how the LLM is wired up to the environment, so different usage or different scaffolding will yield different alignment properties.

Yes and no? I'd say that the LLM-plus agent's objectives are some function of

incompletely-specified objectives provided by operators
priors and biases from training/development
- pretraining
- finetuning
scaffolding/reasoning structure (including any multi-context/multi-persona interactions, internal ratings, reflection, refinement, ...)
- or these things developed implicitly through structured CoT
drift of various kinds

and I'd emphasise that the way that these influences interact is currently very poorly characterised. But plausibly the priors and biases from training could have nontrivial influence across a wide variety of scenarios (especially combined with incompletely-specified natural-language objectives), at which point it's sensible to ask 'how aligned' the LLM is. I appreciate you're talking in generalities, but I think in practice this case might take up a reasonable chunk of the space! For what it's worth, the perspective of LLMs as pre-agent building blocks and conditioned-LLMs as closer to agents is underrepresented, and I appreciate you conceptually distinguishing those things here.

alleged-wisdom on Passages I Highlighted in The Letters of J.R.R.Tolkien

These quotes show how anti-progress and reactionary Tolkien was. He hated machines, he hated housing construction, he hated innovation. He would condemn humanity to be tenant farmers ruled by a warrior aristocracy at a medieval tech level, forever. If you want to live in Tolkien's utopia, move to Zambia.

Basically, Tolkien is very much like the Unabomber. He saw real problems, but his proposed solutions are destructive. He was a master of using the bouba–kiki effect to incept his worldview in the minds of millions, so he did far more to stop progress than the Unabomber ever did. He bears significant responsibility for the productivity slowdown, and for your rent being too damn high.

kabir-kumar on Kabir Kumar's Shortform

Thinking about judgement criteria for the coming ai safety evals hackathon (https://lu.ma/xjkxqcya )
These are the things that need to be judged:
1. Is the benchmark actually measuring alignment (the real, scale, if we dont get this fully right right we die, problem)
2. Is the way of Deceiving the benchmark to get high scores actually deception, or have they somehow done alignment?

Both of these things need:
- a strong deep learning & ml background (ideally, muliple influential papers where they're one of the main authors/co-authors, or doing ai research at a significant lab, or they have, in the last 4 years)
- a good understanding of what the real alignment problem actually means - can judge this by looking at their papers, activity on lesswrong, alignmentforum, blog, etc
- a good understanding of evals/benchmarks (1 great or two pretty good papers/repos/works on this, ideally for alignment)

Do these seem loose? Strict? Off base?

wassname on Implications of the inference scaling paradigm for AI safety

To illustrate Gwerns idea here is an image from Jones 2021 that shows some of these self play training curves

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality

And so OAI employees may internally see that they are on the steady upward slope

ete on Six Small Cohabitive Games

Nice!

(I wrote the bit about not having to tell people your favourite suit or what cards you have leaves things open for some sharp or clever negotiation, but looking back I think it's mostly a trap. I haven't seen anyone get things to go better for them by hiding the suit.)

To add some layer of this strategy: Giving each person one specific card on their suit that they want with much higher strength might be fun, as the other players can ransom that card if they know (but might be happy trading it anyway). Also having the four suits each having a different multiplier might be fun?

wassname on Implications of the inference scaling paradigm for AI safety

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality, from merely cutting-edge AI work which everyone else will replicate in a few years, to takeoff

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.

This means that outsiders may never see the intermediate models

Doubly so, if outsiders will just distil your models behaviour, and bootstrap from your elevated starting point.

Inference-time search is a stimulant drug that juices your score immediately, but asymptotes hard. Quickly, you have to use a smarter model to improve the search itself, instead of doing more.

It's worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.

we find that this is extremely sensitive to the quality of the verifier. If the verifier is slightly imperfect, in many realistic settings of a coding task, performance maxes out and actually starts to decrease after about 10 attempts." - Inference Scaling fLaws

Jones 2021 also says something similar: "The error in the prediction decays exponentially as more boards are used"

But maybe the counterpoint is just, GPU's go brrrr.

wassname on Implications of the inference scaling paradigm for AI safety

Gwern and Daniel Kokotajlo [LW · GW] have a pretty notable track records at predicting AI scaling too, and they have comments in this thread.

rogerdearnaley on What Is The Alignment Problem?

If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn't have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don't need to worry about it.

alexander-gietelink-oldenziel on Shortform

See also geometric rationality. [LW · GW]