LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

How to Play a Support Role in Research Conversations
johnswentworth · 2021-04-23T20:57:50.075Z · comments (4)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq (Lblack) · 2024-05-20T17:53:25.985Z · comments (4)

[link] My emotional reaction to the current funding situation
Sam F. Brown (sam-4) · 2022-09-09T22:02:46.301Z · comments (36)

Consider Joining the UK Foundation Model Taskforce
Zvi · 2023-07-10T13:50:05.097Z · comments (12)

Summary of and Thoughts on the Hotz/Yudkowsky Debate
Zvi · 2023-08-16T16:50:02.808Z · comments (47)

A transcript of the TED talk by Eliezer Yudkowsky
Mikhail Samin (mikhail-samin) · 2023-07-12T12:12:34.399Z · comments (13)

Caution when interpreting Deepmind's In-context RL paper
Sam Marks (samuel-marks) · 2022-11-01T02:42:06.766Z · comments (8)

Instrumental convergence is what makes general intelligence possible
tailcalled · 2022-11-11T16:38:14.390Z · comments (11)

Picking Mentors For Research Programmes
Raymond D · 2023-11-10T13:01:14.197Z · comments (8)

Call for research on evaluating alignment (funding + advice available)
Beth Barnes (beth-barnes) · 2021-08-31T23:28:49.121Z · comments (11)

[link] A case for AI alignment being difficult
jessicata (jessica.liu.taylor) · 2023-12-31T19:55:26.130Z · comments (58)

Why comparative advantage does not help horses
Sherrinford · 2024-09-30T22:27:57.450Z · comments (15)

[link] Priorities for the UK Foundation Models Taskforce
Andrea_Miotti (AndreaM) · 2023-07-21T15:23:34.029Z · comments (4)

[link] ActAdd: Steering Language Models without Optimization
technicalities · 2023-09-06T17:21:56.214Z · comments (3)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

In favour of exploring nagging doubts about x-risk
owencb · 2024-06-25T23:52:01.322Z · comments (2)

Betting with Mandatory Post-Mortem
abramdemski · 2020-06-24T20:04:34.177Z · comments (14)

TOMORROW: the largest AI Safety protest ever!
Holly_Elmore · 2023-10-20T18:15:18.276Z · comments (26)

SAE reconstruction errors are (empirically) pathological
wesg (wes-gurnee) · 2024-03-29T16:37:29.608Z · comments (16)

[link] Book review: WEIRDest People
PeterMcCluskey · 2020-11-30T03:33:17.510Z · comments (57)

Language models are nearly AGIs but we don't notice it because we keep shifting the bar
philosophybear · 2022-12-30T05:15:15.625Z · comments (13)

Another Way to Be Okay
Gretta Duleba (gretta-duleba) · 2023-02-19T20:49:31.895Z · comments (15)

[question] ($1000 bounty) How effective are marginal vaccine doses against the covid delta variant?
jacobjacob · 2021-07-22T01:26:26.117Z · answers+comments (73)

Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers.
Cleo Nardo (strawberry calm) · 2023-03-16T03:08:52.618Z · comments (26)

Book review: "Feeling Great" by David Burns
Steven Byrnes (steve2152) · 2021-06-09T13:17:59.411Z · comments (12)

Predictions for shard theory mechanistic interpretability results
TurnTrout · 2023-03-01T05:16:48.043Z · comments (10)

Improving on the Karma System
Raelifin · 2021-11-14T18:01:30.049Z · comments (36)

[link] Sam Altman: "Planning for AGI and beyond"
LawrenceC (LawChan) · 2023-02-24T20:28:00.430Z · comments (54)

How likely is deceptive alignment?
evhub · 2022-08-30T19:34:25.997Z · comments (28)

Cultivating And Destroying Agency
hath · 2022-06-30T03:59:27.239Z · comments (11)

Anthropic Observations
Zvi · 2023-07-25T12:50:03.178Z · comments (1)

Retrospective: Lessons from the Failed Alignment Startup AISafety.com
Søren Elverlin (soren-elverlin-1) · 2023-05-12T18:07:20.857Z · comments (9)

History's Biggest Natural Experiment
jimrandomh · 2020-03-24T02:56:30.070Z · comments (7)

[link] Why did we wait so long for the threshing machine?
jasoncrawford · 2021-06-29T19:55:38.883Z · comments (20)

I Don’t Know How To Count That Low
Elizabeth (pktechgirl) · 2021-10-22T22:00:02.708Z · comments (10)

Ukraine Post #12
Zvi · 2022-09-22T14:40:03.753Z · comments (3)

What can the principal-agent literature tell us about AI risk?
apc (alexis-carlier) · 2020-02-08T21:28:09.800Z · comments (29)

[link] Direct effects matter!
Aaron Bergman (aaronb50) · 2021-03-14T04:33:11.493Z · comments (28)

PSA: The community is in Berkeley/Oakland, not "the Bay Area"
maia · 2023-09-11T15:59:47.132Z · comments (7)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (23)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

But is it really in Rome? An investigation of the ROME model editing technique
jacquesthibs (jacques-thibodeau) · 2022-12-30T02:40:36.713Z · comments (2)

Bayeswatch 10: Spyware
lsusr · 2021-09-29T07:01:25.529Z · comments (7)

Apply for MATS Winter 2023-24!
utilistrutil · 2023-10-21T02:27:34.350Z · comments (6)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

A mostly critical review of infra-Bayesianism
David Matolcsi (matolcsid) · 2023-02-28T18:37:58.448Z · comments (9)

Deception Chess: Game #1
Zane · 2023-11-03T21:13:55.777Z · comments (21)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (2)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (7)

[link] How to replicate and extend our alignment faking demo
Fabien Roger (Fabien) · 2024-12-19T21:44:13.059Z · comments (5)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

elriggs on Logan Riggs's Shortform

A trending youtube video w/ 500k views in a day brings up Dario Amodei's Machines of Loving Grace (Timestamp for the quote):
[Note: I had Claude help format, but personally verified the text's faithfulness]

I am an AI optimist. I think our world will be better because of AI. One of the best expressions of that I've seen is this blog post by Dario Amodei, who is the CEO of Anthropic, one of the biggest AI companies. I would really recommend reading this - it's one of the more interesting articles and arguments I have read. He's basically saying AI is going to have an incredibly positive impact, but he's also a realist and is like "AI is going to really potentially fuck up our world"

He's notable and more trustworthy because his company Anthropic has put WAY more effort into safety, way way more effort into making sure there are really high standards for safety and that there isn't going to be danger what these AIs are doing. So I really really like Dario and I've listened to a lot of what he's said. Whereas with some other AI leaders like Sam Altman who runs OpenAI, you don't know what the fuck he's thinking. I really like [Dario] - he also has an interesting background in biological work and biotech, so he's not just some tech-bro; he's a bio-tech-bro. But his background is very interesting.

But he's very realistic. There is a lot of bad shit that is going to happen with AI. I'm not denying that at all. It's about how we maximize the positive while reducing the negatives. I really want AI to solve all of our diseases. I would really like AI to fix cancer - I think that will happen in our lifetimes. To me, I'd rather we fight towards that future rather than say 'there will be problems, let's abandon the whole thing.'

Other notes: This is youtuber/Streamer DougDoug (2.8M subscribers), with this video posted on his other channel DougDougDoug ("DougDoug content that's too rotten for the main channel") who often streams/posts coding/AI integrated content.

The full video is also an entertaining summary of case law on AI generated art/text copyright.

oliver-sourbut on What Is The Alignment Problem?

Organisms in general typically sense their environment and take different actions across a wide variety of environmental conditions, so as to cause there to be approximate copies of themselves in the future.^[4] That's basic agency.^[5]

I agree with this breakdown, except I start the analysis with moment-to-moment deliberation [? · GW], and note that having there (continue to) be relevantly similar deliberators is a very widely-applicable intermediate objective [LW · GW], from where we get control ('basic agency') but also delegation and replication [LW · GW].

The way the terms have typically been used historically, the simplest summary would be:
Today's LLMs and image generators are generative models of (certain parts of) the world.
Systems like e.g. o1 are somewhat-general planners/solvers on top of those models. Also, LLMs can be used directly as planners/solvers when suitably prompted or tuned.
To go from a general planner/solver to an agent, one can simply hook the system up to some sensors and actuators (possibly a human user) and specify a nominal goal... assuming the planner/solver is capable enough to figure it out from there.

Yep! But (I think maybe you'd agree) there's a lot of bleed between these abstractions, especially when we get to heavily finetuned models. For example...

Applying all that to typical usage of LLMs (including o1-style models): an LLM isn't the kind of thing which is aligned or unaligned, in general. If we specify how the LLM is connected to the environment (e.g. via some specific sensors and actuators, or via a human user), then we can talk about both (a) how aligned to human values is the nominal objective given to the LLM^[8], and (b) how aligned to the nominal objective is the LLM's actual effects on its environment. Alignment properties depend heavily on how the LLM is wired up to the environment, so different usage or different scaffolding will yield different alignment properties.

Yes and no? I'd say that the LLM-plus agent's objectives are some function of

incompletely-specified objectives provided by operators
priors and biases from training/development
- pretraining
- finetuning
scaffolding/reasoning structure (including any multi-context/multi-persona interactions, internal ratings, reflection, refinement, ...)
- or these things developed implicitly through structured CoT
drift of various kinds

and I'd emphasise that the way that these influences interact is currently very poorly characterised. But plausibly the priors and biases from training could have nontrivial influence across a wide variety of scenarios (especially combined with incompletely-specified natural-language objectives), at which point it's sensible to ask 'how aligned' the LLM is. I appreciate you're talking in generalities, but I think in practice this case might take up a reasonable chunk of the space! For what it's worth, the perspective of LLMs as pre-agent building blocks and conditioned-LLMs as closer to agents is underrepresented, and I appreciate you conceptually distinguishing those things here.

alleged-wisdom on Passages I Highlighted in The Letters of J.R.R.Tolkien

These quotes show how anti-progress and reactionary Tolkien was. He hated machines, he hated housing construction, he hated innovation. He would condemn humanity to be tenant farmers ruled by a warrior aristocracy at a medieval tech level, forever. If you want to live in Tolkien's utopia, move to Zambia.

Basically, Tolkien is very much like the Unabomber. He saw real problems, but his proposed solutions are destructive. He was a master of using the bouba–kiki effect to incept his worldview in the minds of millions, so he did far more to stop progress than the Unabomber ever did. He bears significant responsibility for the productivity slowdown, and for your rent being too damn high.

kabir-kumar on Kabir Kumar's Shortform

Thinking about judgement criteria for the coming ai safety evals hackathon (https://lu.ma/xjkxqcya )
These are the things that need to be judged:
1. Is the benchmark actually measuring alignment (the real, scale, if we dont get this fully right right we die, problem)
2. Is the way of Deceiving the benchmark to get high scores actually deception, or have they somehow done alignment?

Both of these things need:
- a strong deep learning & ml background (ideally, muliple influential papers where they're one of the main authors/co-authors, or doing ai research at a significant lab, or they have, in the last 4 years)
- a good understanding of what the real alignment problem actually means - can judge this by looking at their papers, activity on lesswrong, alignmentforum, blog, etc
- a good understanding of evals/benchmarks (1 great or two pretty good papers/repos/works on this, ideally for alignment)

Do these seem loose? Strict? Off base?

wassname on Implications of the inference scaling paradigm for AI safety

To illustrate Gwerns idea here is an image from Jones 2021 that shows some of these self play training curves

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality

And so OAI employees may internally see that they are on the steady upward slope

ete on Six Small Cohabitive Games

Nice!

(I wrote the bit about not having to tell people your favourite suit or what cards you have leaves things open for some sharp or clever negotiation, but looking back I think it's mostly a trap. I haven't seen anyone get things to go better for them by hiding the suit.)

To add some layer of this strategy: Giving each person one specific card on their suit that they want with much higher strength might be fun, as the other players can ransom that card if they know (but might be happy trading it anyway). Also having the four suits each having a different multiplier might be fun?

wassname on Implications of the inference scaling paradigm for AI safety

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality, from merely cutting-edge AI work which everyone else will replicate in a few years, to takeoff

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.

This means that outsiders may never see the intermediate models

Doubly so, if outsiders will just distil your models behaviour, and bootstrap from your elevated starting point.

Inference-time search is a stimulant drug that juices your score immediately, but asymptotes hard. Quickly, you have to use a smarter model to improve the search itself, instead of doing more.

It's worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.

we find that this is extremely sensitive to the quality of the verifier. If the verifier is slightly imperfect, in many realistic settings of a coding task, performance maxes out and actually starts to decrease after about 10 attempts." - Inference Scaling fLaws

Jones 2021 also says something similar: "The error in the prediction decays exponentially as more boards are used"

But maybe the counterpoint is just, GPU's go brrrr.

wassname on Implications of the inference scaling paradigm for AI safety

Gwern and Daniel Kokotajlo [LW · GW] have a pretty notable track records at predicting AI scaling too, and they have comments in this thread.

rogerdearnaley on What Is The Alignment Problem?

If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn't have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don't need to worry about it.

alexander-gietelink-oldenziel on Shortform

See also geometric rationality. [LW · GW]