LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

[link] Safety tax functions
owencb · 2024-10-20T14:08:38.099Z · comments (0)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

[question] Any real toeholds for making practical decisions regarding AI safety?
lukehmiles (lcmgcd) · 2024-09-29T12:03:08.084Z · answers+comments (6)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (11)

[link] ML Safety Research Advice - GabeM
Gabe M (gabe-mukobi) · 2024-07-23T01:45:42.288Z · comments (2)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (62)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (7)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (8)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (1)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (8)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

[LDSL#2] Latent variable models, network models, and linear diffusion of sparse lognormals
tailcalled · 2024-08-09T19:57:56.122Z · comments (2)

AI Safety University Organizing: Early Takeaways from Thirteen Groups
agucova · 2024-10-02T15:14:00.137Z · comments (0)

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates
Charlie George (charlie-george) · 2024-08-27T20:44:08.683Z · comments (7)

[link] The Offense-Defense Balance of Gene Drives
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-27T16:47:25.976Z · comments (1)

Rashomon - A newsbetting site
ideasthete · 2024-10-15T18:15:02.476Z · comments (8)

[link] Foundations - Why Britain has stagnated [crosspost]
Nathan Young · 2024-09-23T10:43:20.411Z · comments (1)

Would you benefit from, or object to, a page with LW users' reacts?
Raemon · 2024-08-20T16:35:47.568Z · comments (6)

The Garden of Eden
Alexander Turok · 2024-07-22T16:07:42.509Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

tsvibt on The Hidden Complexity of Wishes

Here's an argument that alignment is difficult which uses complexity of value as a subpoint:

A1. If you try to manually specify what you want, you fail.
A2. Therefore, you want something algorithmically complex.
B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.
B2. We don't understand how to affect the values-distribution toward something specific.
B3. If we don't affect the value-distribution toward something specific, then the values-distribution probably puts large penalties for absolute algorithmic complexity; any specific utility function with higher absolute algorithmic complexity will be less likely to be the one that the AGI ends up with.
C1. Because of A2 (our values are algorithmically complex) and B3 (a complex utility function is unlikely to show up in an AGI without us skillfully intervening), an AGI is unlikely to have our values without us skillfully intervening.
C2. Because of B2 (we don't know how to skillfully intervene on an AGI's values) and C1, an AGI is unlikely to have our values.

I think that you think that the argument under discussion is something like:

(same) A1. If you try to manually specify what you want, you fail.
(same) A2. Therefore, you want something algorithmically complex.
(same) B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.
B2. We'll affect the values-distribution, somehow.
B3. The work or difficultly involved in B2 is greater, the greater the complexity of our values.
C1. Because of A2 (our values are algorithmically complex) and B3 (a complex utility function requires more work to put in an AGI), it would take a lot of work to make an AGI pursue our values.

But these are different arguments, which make use of the complexity of values in different ways.

rokosbasilisk-1 on Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data

Thanks for the feedback! working on refining the writeup.

zac-hatfield-dodds on A Narrow Path: a plan to deal with AI extinction risk

(2) ✅ ... The first is from Chief of Staff at Anthropic.

The byline of that piece is "Avital Balwit lives in San Francisco and works as Chief of Staff to the CEO at Anthropic. This piece was written entirely in her personal capacity and does not reflect the views of Anthropic."

I do not think this is an appropriate citation for the claim. In any case, They publicly state that it is not a matter of “if” such artificial superintelligence might exist, but “when” simply seems to be untrue; both cited sources are peppered with phrases like 'possibility', 'I expect', 'could arrive', and so on.

leogao on leogao's Shortform

aiming directly for achieving some goal is not always the most effective way of achieving that goal.

leogao on Alexander Gietelink Oldenziel's Shortform

there is an obvious utilitarian reason of not getting sick

khafra on If far-UV is so great, why isn't it everywhere?

I'd be interested to know what the numbers on UV in ductwork look like over the past 5 years. When I had to get a new A/C system installed in 2020, they asked whether I wanted a UVC light installed in the air handler. I had, before then, been using a 70w UVC corn light I bought on Amazon to sterilize the exterior of groceries (back when we thought fomites might be a major transmission vector), and in improvised ductwork with fans and cardboard boxes taped together.
Getting a proper bulb--an optimal wavelength source--seemed like a big upgrade. Hard to come up with quantitative efficacy numbers, but we did have a friend over for the day, who turned out to have been in the early stages of covid, without getting infected. Our first infection was years later, at a music event.

jacob_drori on Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data

This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I'm sure the answers could be pieced together from the notebook, but most people won't click through and read the code.

joey-kl on Alexander Gietelink Oldenziel's Shortform

More reasons: people wear sunglasses when they’re doing fun things outdoors like going to the beach or vacationing so it’s associated with that, and also sometimes just hiding part of a picture can cause your brain to fill it in with a more attractive completion than is likely.

ryan_greenblatt on Sabotage Evaluations for Frontier Models

My guess would be that they're not saying that well-designed control evaluations become untrustworthy

It's a bit messy because we have some ability to check whether we should be able to evaluate things.

So, there are really three relevant "failure" states for well done control:

We can't find countermeasures such that our control evaluations indicate any real safety.
We no longer think that control evaluations work to evaluate safety as models greatly exceed our planning capabilites and/or can sandbag sufficiently well.
We think control evaluations can work well, but we're wrong and they actually don't.

I think (1) or (2) will likely happen prior to (3) if you do a good job.

We discuss this more here [LW · GW].

(Really, the actual system is more complex because we can vary how conservative control evaluations are and it is easier to be confident in the results of more conservative evaluations. E.g., we can be conservative about our AIs planning ability. So, there is really a somewhat more continuous tradeoff than the above list makes it seem.)

lukas-finnveden on Sabotage Evaluations for Frontier Models

There's at least two different senses in which "control" can "fail" for a powerful system:

Control evaluations can indicate that there's no way to deploy the system such that you both (i) get a lot of use out of it, and (ii) can get a low probability of catastrophe.
Control evaluations are undermined such that humans think that the model can be deployed safely, but actually the humans were misled and there's a high probability of catastrophe.

My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess would be that they're not saying that well-designed control evaluations become untrustworthy — just that they'll stop promising you safety.

But to be clear: In this question, you're asking about something more analogous to the second case, right? (Sabotage/sandbagging evaluations being misleading about models' actual capabilities at sabotage & sandbagging?)

My question posed in other words: Would you count "evaluations clearly say that models can sabotage & sandbag" as success or failure?