LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Limitations on Formal Verification for AI Safety
Andrew Dickson · 2024-08-19T23:03:52.706Z · comments (60)

The Worst Form Of Government (Except For Everything Else We've Tried)
johnswentworth · 2024-03-17T18:11:38.374Z · comments (47)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Neel Nanda (neel-nanda-1) · 2024-07-07T17:39:35.064Z · comments (16)

Loving a world you don’t trust
Joe Carlsmith (joekc) · 2024-06-18T19:31:36.581Z · comments (13)

[link] Simple probes can catch sleeper agents
Monte M (montemac) · 2024-04-23T21:10:47.784Z · comments (21)

[link] "AI achieves silver-medal standard solving International Mathematical Olympiad problems"
gjm · 2024-07-25T15:58:57.638Z · comments (38)

A Dozen Ways to Get More Dakka
Davidmanheim · 2024-04-08T04:45:19.427Z · comments (11)

Processor clock speeds are not how fast AIs think
Ege Erdil (ege-erdil) · 2024-01-29T14:39:38.050Z · comments (55)

On saying "Thank you" instead of "I'm Sorry"
Michael Cohn (michael-cohn) · 2024-07-08T03:13:50.663Z · comments (16)

Why I don't believe in the placebo effect
transhumanist_atom_understander · 2024-06-10T02:37:07.776Z · comments (22)

The case for training frontier AIs on Sumerian-only corpus
Alexandre Variengien (alexandre-variengien) · 2024-01-15T16:40:22.011Z · comments (15)

Updatelessness doesn't solve most problems
Martín Soto (martinsq) · 2024-02-08T17:30:11.266Z · comments (44)

Notice When People Are Directionally Correct
Chris_Leong · 2024-01-14T14:12:37.090Z · comments (8)

[link] "Can AI Scaling Continue Through 2030?", Epoch AI (yes)
gwern · 2024-08-24T01:40:32.929Z · comments (4)

My simple AGI investment & insurance strategy
lc · 2024-03-31T02:51:53.479Z · comments (27)

[question] Which things were you surprised to learn are not metaphors?
Eric Neyman (UnexpectedValues) · 2024-11-21T18:56:18.025Z · answers+comments (79)

Near-mode thinking on AI
Olli Järviniemi (jarviniemi) · 2024-08-04T20:47:28.085Z · comments (8)

How I started believing religion might actually matter for rationality and moral philosophy
zhukeepa · 2024-08-23T17:40:47.341Z · comments (41)

Circuits in Superposition: Compressing many small neural networks into one
Lucius Bushnaq (Lblack) · 2024-10-14T13:06:14.596Z · comments (8)

Pantheon Interface
NicholasKees (nick_kees) · 2024-07-08T19:03:51.681Z · comments (22)

A Shutdown Problem Proposal
johnswentworth · 2024-01-21T18:12:48.664Z · comments (61)

An even deeper atheism
Joe Carlsmith (joekc) · 2024-01-11T17:28:31.843Z · comments (47)

[link] OpenAI's CBRN tests seem unclear
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:28:30.290Z · comments (6)

Community Notes by X
NicholasKees (nick_kees) · 2024-03-18T17:13:33.195Z · comments (15)

Things I've Grieved
Raemon · 2024-02-18T19:32:47.169Z · comments (6)

[link] Steering Llama-2 with contrastive activation additions
Nina Panickssery (NinaR) · 2024-01-02T00:47:04.621Z · comments (29)

"The Solomonoff Prior is Malign" is a special case of a simpler argument
David Matolcsi (matolcsid) · 2024-11-17T21:32:34.711Z · comments (44)

BIG-Bench Canary Contamination in GPT-4
Jozdien · 2024-10-22T15:40:48.166Z · comments (13)

Parasites (not a metaphor)
lemonhope (lcmgcd) · 2024-08-08T20:07:13.593Z · comments (17)

Why Don't We Just... Shoggoth+Face+Paraphraser?
Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-19T20:53:52.084Z · comments (51)

[question] What do coherence arguments actually prove about agentic behavior?
sunwillrise (andrei-alexandru-parfeni) · 2024-06-01T09:37:28.451Z · answers+comments (35)

Do you believe in hundred dollar bills lying on the ground? Consider humming
Elizabeth (pktechgirl) · 2024-05-16T00:00:05.257Z · comments (22)

Why I take short timelines seriously
NicholasKees (nick_kees) · 2024-01-28T22:27:21.098Z · comments (29)

[link] Investigating the Chart of the Century: Why is food so expensive?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-16T13:21:23.596Z · comments (26)

What o3 Becomes by 2028
Vladimir_Nesov · 2024-12-22T12:37:20.929Z · comments (13)

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner (ejenner) · 2024-06-04T15:50:47.475Z · comments (14)

Natural Latents: The Math
johnswentworth · 2023-12-27T19:03:01.923Z · comments (37)

[link] The Dangers of Mirrored Life
Niko_McCarty (niko-2) · 2024-12-12T20:58:32.750Z · comments (7)

Awakening
lsusr · 2024-05-30T07:03:00.821Z · comments (79)

A bird's eye view of ARC's research
Jacob_Hilton · 2024-10-23T15:50:06.123Z · comments (12)

AI catastrophes and rogue deployments
Buck · 2024-06-03T17:04:51.206Z · comments (16)

RTFB: On the New Proposed CAIP AI Bill
Zvi · 2024-04-10T18:30:08.410Z · comments (14)

Efficient Dictionary Learning with Switch Sparse Autoencoders
Anish Mudide (anish-mudide) · 2024-07-22T18:45:53.502Z · comments (19)

Passages I Highlighted in The Letters of J.R.R.Tolkien
Ivan Vendrov (ivan-vendrov) · 2024-11-25T01:47:59.071Z · comments (10)

[link] Miles Brundage resigned from OpenAI, and his AGI readiness team was disbanded
garrison · 2024-10-23T23:40:57.180Z · comments (1)

The Standard Analogy
Zack_M_Davis · 2024-06-03T17:15:42.327Z · comments (28)

AI Alignment Metastrategy
Vanessa Kosoy (vanessa-kosoy) · 2023-12-31T12:06:11.433Z · comments (13)

The Dream Machine
sarahconstantin · 2024-12-05T00:00:05.796Z · comments (6)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey (Lee_Sharkey) · 2024-07-18T14:15:50.248Z · comments (18)

[question] Which skincare products are evidence-based?
Vanessa Kosoy (vanessa-kosoy) · 2024-05-02T15:22:12.597Z · answers+comments (48)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

willpetillo on Instrumentality makes agents agenty

Any updates on this view in light of new evidence on "Alignment Faking" (https://www.anthropic.com/research/alignment-faking)? If a simulator's preferences are fully satisfied by outputting the next token, why does it matter whether it can infer its outputs will be used for retraining its values?

Some thoughts on possible explanations:
1. Instrumentality exists on the simulacra level, not the simulator level. This would suggest that corrigibility could be maintained by establishing a corrigible character in context. Not clear on the practical implications.
2. The thesis of this post is wrong; simulators have instrumentality.
3. The Simulator framing does not fully apply to the model involved, such as because of the presence of a scratchpad or something.
4+. ???

mako-yass on Acknowledging Background Information with P(Q|I)

Buried somewhere in most contemporary bayesians' is the solomonoff prior (the prior that the most likely observations are those that have short generating machine encodings) Do we have standard symbol for the solomonoff prior? Claude suggests that $M$ is the most common, but is more often used as a distribution function, or perhaps $K$ for Komogorov? (which I like because it can also be thought to stand for "knowledgebase", although really it doesn't represent knowledge, it pretty much represents something prior to knowledge)

tom-davidson on When Is Insurance Worth It?

Yep I'm saying you're wrong about this. If money compounds but you don't have utility=log($) then you shouldn't Kelly bet

dmitry-vaintrob on Review: Planecrash

Thank you for writing this! This is my favorite thing on this site in a while.

cstinesublime on ChristianKl's Shortform

This was a shocking revelation to me, I only discovered it a few months ago when I was wondering why one USB-c cable was data transferring between my laptop and an external SSD so much slower than another.
What is astounding is, at least in bricks and mortar retail, the price differential between different capabilities of cables. It's so high sometimes as to not even seem like a good deal "this cable costs three times that one, but only charges 30% faster with only one device I have which is capable of that speed of charging"

error on Acknowledging Background Information with P(Q|I)

Memento Errata

I love this phrase. It could practically be a LW motto, or a title for some adjacent project, or something like that. It's even self-referencing -- or at least, Claude tells me it's grammatically incorrect, and that feels appropriate.

mako-yass on If all trade is voluntary, then what is "exploitation?"

I'd just define exploitation to be precisely the opposite of shapley bargaining [LW · GW], situations where a person is not being compensated in proportion to their bargaining power.

This definition encompasses any situation where a person has grievances and it makes sense for them to complain about them and take a stand, or, where striking could reasonably be expected to lead to a stable bargaining equilibrium with higher net utility (not all strikes fall into this category).

This definition also doesn't fully capture the common sense meaning of exploitation, but I don't think a useful concept can.

cstinesublime on leogao's Shortform

What kind of changes or outcomes would you expect to see if people around these parts instead of publishing their work independently started trying to get it into traditional ML conferences and related publications?

nc-1 on The Field of AI Alignment: A Postmortem, and What To Do About It

I am surprised that you find theoretical physics research less tight funding-wise than AI alignment [is this because the paths to funding in physics are well-worn, rather than better resourced?].

This whole post was a little discouraging. I hope that the research community can find a way forward.

habryka4 on leogao's Shortform

A thing that I often see happening when people talk about "normie-legible status systems" is that they gaslight themselves into believing that some status system that is extraordinarily legible, or they are part of, is something that is consensus.

Academia is the most intense example of this. Most people don't care that much about academic status! This also happens in the other direction. Youtube is a major source of status in much of the world, especially among young people, but is considered low-brow whenever people argue about this, and so people dismiss it.

I also think people tend to do a fallacy of gray thing where if a status system is not maximally legible (like writing popular blogposts, or running a popular podcast, or making popular Youtube videos, or being popular on Twitter), they dismiss the status system as not real and "illegible".

I think modeling the real status and reputation systems that are present in the world is important, but for example, trying to ascent the academic status hierarchy is a bad use of time and resources. It's extremely competitive, and not actually that influential outside of the academic bubble. It is in some fields better correlated with actual skills and integrity and intelligence, and so I still think a reasonable thing to consider, but I think most people are better placed to trade off a bit of legibility against a whole amount of net realness in status (this importantly does not mean your LW quick takes will be the thing that causes you to become world-renowned, I am not saying "just say smart things and the world will recognize you", I am saying "don't think that only the most legible status systems, or the one with the most mobs hunting dissenters from the status system are the only real ways of gaining recognition in the world").