LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] On the Role of Proto-Languages
adamShimi · 2024-09-22T16:50:34.720Z · comments (1)

Dating Roundup #2: If At First You Don’t Succeed
Zvi · 2024-01-02T16:00:04.955Z · comments (29)

[link] Questions are usually too cheap
Nathan Young · 2024-05-11T13:00:54.302Z · comments (19)

Monthly Roundup #17: April 2024
Zvi · 2024-04-15T12:10:03.126Z · comments (4)

Ten Modes of Culture War Discourse
jchan · 2024-01-31T13:58:20.572Z · comments (15)

On “first critical tries” in AI alignment
Joe Carlsmith (joekc) · 2024-06-05T00:19:02.814Z · comments (8)

AI #44: Copyright Confrontation
Zvi · 2023-12-28T14:30:10.237Z · comments (13)

On Anthropic’s Sleeper Agents Paper
Zvi · 2024-01-17T16:10:05.145Z · comments (5)

[Closed] PIBBSS is hiring in a variety of roles (alignment research and incubation program)
Nora_Ammann · 2024-04-09T08:12:59.241Z · comments (0)

Thiel on AI & Racing with China
Ben Pace (Benito) · 2024-08-20T03:19:18.966Z · comments (10)

Towards a formalization of the agent structure problem
Alex_Altair · 2024-04-29T20:28:15.190Z · comments (5)

[link] Unlocking Solutions—By Understanding Coordination Problems
James Stephen Brown (james-brown) · 2024-07-27T04:52:13.435Z · comments (4)

Math-to-English Cheat Sheet
nahoj · 2024-04-08T09:19:40.814Z · comments (5)

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
leogao · 2023-12-16T05:39:10.558Z · comments (5)

[link] Theories of Change for AI Auditing
Lee Sharkey (Lee_Sharkey) · 2023-11-13T19:33:43.928Z · comments (0)

Safe Stasis Fallacy
Davidmanheim · 2024-02-05T10:54:44.061Z · comments (2)

[link] Come to Manifest 2024 (June 7-9 in Berkeley)
Saul Munn (saul-munn) · 2024-03-27T21:30:17.306Z · comments (2)

Complexity of value but not disvalue implies more focus on s-risk. Moral uncertainty and preference utilitarianism also do.
Chi Nguyen · 2024-02-23T06:10:05.881Z · comments (18)

We are headed into an extreme compute overhang
devrandom · 2024-04-26T21:38:21.694Z · comments (33)

[link] Open Phil releases RFPs on LLM Benchmarks and Forecasting
LawrenceC (LawChan) · 2023-11-11T03:01:09.526Z · comments (0)

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
Patrick Leask (patrickleask) · 2024-08-17T01:16:53.764Z · comments (0)

Fat Tails Discourage Compromise
niplav · 2024-06-17T09:39:16.489Z · comments (5)

Trading off Lives
jefftk (jkaufman) · 2024-01-03T03:40:05.603Z · comments (12)

AI #40: A Vision from Vitalik
Zvi · 2023-11-30T17:30:08.350Z · comments (12)

[link] LLMs seem (relatively) safe
JustisMills · 2024-04-25T22:13:06.221Z · comments (24)

AI #76: Six Shorts Stories About OpenAI
Zvi · 2024-08-08T13:50:04.659Z · comments (10)

[question] Can we get an AI to "do our alignment homework for us"?
Chris_Leong · 2024-02-26T07:56:22.320Z · answers+comments (33)

AI #71: Farewell to Chevron
Zvi · 2024-07-04T13:40:05.905Z · comments (9)

AMA: Earning to Give
jefftk (jkaufman) · 2023-11-07T16:20:10.972Z · comments (8)

Per protocol analysis as medical malpractice
braces · 2024-01-31T16:22:21.367Z · comments (8)

AI #37: Moving Too Fast
Zvi · 2023-11-09T17:50:04.324Z · comments (5)

[link] Breaking Circuit Breakers
mikes · 2024-07-14T18:57:20.251Z · comments (13)

[link] S-Risks: Fates Worse Than Extinction
aggliu · 2024-05-04T15:30:36.666Z · comments (2)

2022 (and All Time) Posts by Pingback Count
Raemon · 2023-12-16T21:17:00.572Z · comments (14)

Be More Katja
Nathan Young · 2024-03-11T21:12:14.249Z · comments (0)

AI #50: The Most Dangerous Thing
Zvi · 2024-02-08T14:30:13.168Z · comments (4)

Acting Wholesomely
owencb · 2024-02-26T21:49:16.526Z · comments (64)

Zvi's Manifold Markets House Rules
Zvi · 2023-11-13T00:28:02.147Z · comments (6)

Causal Graphs of GPT-2-Small's Residual Stream
David Udell · 2024-07-09T22:06:55.775Z · comments (7)

[link] OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns
Seth Herd · 2023-11-20T14:20:33.539Z · comments (28)

Anthropical Paradoxes are Paradoxes of Probability Theory
Ape in the coat · 2023-12-06T08:16:26.846Z · comments (18)

BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann (Stuckwork) · 2024-07-20T02:20:51.848Z · comments (0)

Announcing the Double Crux Bot
sanyer (santeri-koivula) · 2024-01-09T18:54:15.361Z · comments (8)

[link] Slightly More Than You Wanted To Know: Pregnancy Length Effects
JustisMills · 2024-10-21T01:26:02.030Z · comments (4)

AI #87: Staying in Character
Zvi · 2024-10-29T07:10:08.212Z · comments (3)

Reflections on my first year of AI safety research
Jay Bailey · 2024-01-08T07:49:08.147Z · comments (3)

Can we build a better Public Doublecrux?
Raemon · 2024-05-11T19:21:53.326Z · comments (6)

AI #45: To Be Determined
Zvi · 2024-01-04T15:00:05.936Z · comments (4)

The case for stopping AI safety research
catubc (cat-1) · 2024-05-23T15:55:18.713Z · comments (38)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

notfnofn on Both-Sidesism—When Fair & Balanced Goes Wrong

Okay Trump is president now. Hoping that things go well regardless.

The trans issue is a perennial touchstone that has been used as the consistent example of radical left woke-ness for years, and throughout this campaign.

The reason for this is the same reason democrats kept harping an abortion - it's something the majority of the country agreed with them on, and so they could use it to their advantage. It's not necessarily that it was the worst issue, but the easiest target.

I'm a bit concerned that you referred to cancel culture as "accountability culture", but I don't want to get distracted by that here. What I was hoping to get into is that people on the right have done it too when they had power, will probably do so now, and the key difference is that their leader will likely explicitly support it (instead of just being silent about it).

But the point is about motivations, one side is motivated by sympathy for a population with 10s of thousands of people being killed over a year, and millions being displaced and having their homes destroyed, the other is motivated by white supremacy. Do you not see this as a false equivalence?

Yes, but a more careful "apples to apples" analysis is necessary here. Are you comparing the opinions of US politicians on the left with US politicians on the right? Or are you comparing them with randos on the right? If you want to convince people on the right instead of just generating applause on the left, you probably need to do this more carefully.

I'm in the same position. It's at these points where I defer to experts, which is what I have advised in the post.

How seriously have you investigated the claim that "Harris's plan is based on what many top economists think is best" and not "Economists find Harris' plan overall better than Trump's, despite its many weaknesses"?

tyler-tracy on Anthropic: Three Sketches of ASL-4 Safety Case Components

I only read through the beginning and their control safety case. Some thoughts:

The honeypot case they described is similar to the upfront auditing protocol, but theirs is more complicated. I don't think you need to generate honeypots to evaluate a model. You can use the first X% of scenarios at deployment time and audit them with humans. Both protocols are beaten by the same red team strategy of "attack with a very low probability," so they shouldn't be used natively.
I don't like the term "monitoring game" because I think control evals will have much more than monitoring in the future.
I think elicitation of the model will eventually matter when the models are widely better than humans at most tasks, but it doesn't matter that much now. Your control protocols should be able to prevent the best advisory you can simulate from breaking them. In the beginning, this will be humans with access to models, so anywhere the model lacks performance, the human should make up for it. For example, in [P2.6] Red-Blue Game, they state that the read team needs to produce a sleeper agent that defects. Assuming that a sleeper agent is a single model, I think this is too complicated for the red team. The red team should be able to produce a scaffolded model that uses logic written in Python. In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn't done in practice since it is expensive and slow.
I don't think one needs to generate hyper-realistic-looking scenarios for monitoring, and I'm confused by their reasoning for believing one does. I guess it depends on the fact that they need to elicit the best possible capabilities from the model during the eval, but I don't think this is necessary for the reasons I stated above.

General Thoughts:
I think this is pretty good and shows knowledge of the core points of control. I think Antrhopic's control protocols are easily beaten, and there are much better things to do. They made it seem like control is much harder to implement than I think it will be. I don't think a perfect Elicitation Strategy or Simulation Procedure is needed for control to work in the settings laid out. Once we have much scarier AIs and deploy them in more crazy environments, we will likely need better techniques, but their setting seems to be before this.

douglas_knight on Is the Power Grid Sustainable?

Even in this last comment you keep making that very distinction. The regulator dictates the price but you assert that you know what the monopoly spends.

If you just want to assert that the current set of regulations are unsustainable, then I agree. But not a single one of the comments reflects a belief that this is the topic, not even any of your comments.

daniel-kokotajlo on Daniel Kokotajlo's Shortform

The expensiveness of demolishing tunnels scales with the density of the tunnel network. (Unless the blast effects of underground explosives are generally stronger than I expect; I haven't done calculations). For sufficiently dense tunnel networks, demolishing enough of them would actually be quite expensive. E.g. if there are 1000 tunnels that you need to demolish per 1km of frontline, the quantity of explosive needed to do that would probably be greater than the quantity you'd need to make a gigantic minefield on the surface. (Minefields can be penetrated... but also, demolished tunnels can be re-dug.)

towards_keeperhood on An alternative approach to superbabies

The post is not joking. (But thanks for feedback that you were confused in this way!)

I basically didn't know much about orcas before I learned that they have 2.05 times 3 days ago as many neurons in the neocortex than humans, and then yesterday and the day before I spent looking into how smart orcas might be and evaluating the evidence. So I'm far from an expert but I also didn't run over strong evidence that they are dumber than hunter-gatherer humans and some weak-medium strong evidence that they are might at least as smart as 15 year olds. But it's still possible that orca researchers have observed orcas not finding some strategies they would've thought of or sth.

But yeah the only piece of evidence that they might be significantly smarter than humans is their brains. I consider it reasonably strong though.

daniel-kokotajlo on Anthropic: Three Sketches of ASL-4 Safety Case Components

Some comments as I read:

I am pleased to see the sabotage threat models given prominence. I think this is the right sort of threat to focus on, by contrast with e.g. autonomous replication or directly causing harm to humans.

The ability for the model to, zero-shot, figure out how to hack its own neural activations without such a signal would require capabilities well beyond what we consider plausible. [Note: this claim will require justification in an actual safety case, but we don’t have a clear picture of what this justification would look like.]

DC evals presumably? Try to train models to hack their own neural activations, see if they can quickly learn to do it? (Or even, if they can do it without training, just prompting?) This can be done today right? Get the golden gate bridge activation vector thingy, and then tell Claude "Write me a travel guide to San Francisco but without activating the golden gate bridge activation vector, or, well, activating it as little as possible." See how good it does. Do scaling analysis to see if smarter bigger models are getting better at doing this. Etc.

As a validation of our entire methodology, we have run red-blue games where the red team tries to produce a misaligned model organism (e.g. a sleeper agent) that can successfully defect in a specific context while avoiding detection by the aforementioned tests.

Nice! Very excited to see this in there.

Overall I'm very happy to see this blog post go up on Anthropic's website. I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half. (I currently grimly expect that safety cases at this level of detail won't be constructed until AGI is basically already being trained, and it'll be done in a hurry and it won't be published, much less published with enough time for the scientific community to engage with it and for it to be updated in response to feedback. And it could be even worse -- I wouldn't be surprised if the actual safety cases for the first systems that ~completely automate AI R&D are significantly less rigorous than these.)

benito on Scissors Statements for President?

Minor mod note: I left this post on personal blog, as we generally avoid frontpaging content related to and during the US election for sanity protection [LW · GW]. To be clear, this post is pretty abstracted from any election details, so I'd normally frontpage it, but I'm erring on the side of leaving on personal while the election is so close (literally today).

lukehmiles on An alternative approach to superbabies

No good science without some good fun.

tailcalled on Scissors Statements for President?

And I guess I should say, I have a more sun-oriented [LW · GW] and less competition-oriented [LW · GW] view. A surplus (e.g. in energy from the sun or negentropy from the night) has a natural "shape" (e.g. trees or solar panels) that the surplus dissipates into. There is some flexibility in this shape that leaves room for choice, but a lot less than rationalists usually assume.

tailcalled on Scissors Statements for President?

Kind of. First, the big exception: If you manage to enforce global authoritarianism, you can stockpile surplus indefinitely, basically tiling the world with charged-up batteries. But what's the point of that?

Secondly, "waste/signaling cascade" is kind of in the eye of the beholder. If a forest is standing in some region, is it wasting sunlight that could've been used on farming? Even in a very literal sense, you could say the answer is yes since the trees are competing in a zero-sum game for height. But without that competition, you wouldn't have "trees" at all, so calling it a waste is a value judgement that trees are worthless. (Which of course you are entitled to make, but this is clearly a disagreement with the people who like solarpunk.)

But yeah, ultimately I'm kind of thinking of life as entropy maximization [LW · GW]. The surplus has to be used for something, the question is what. If you've got nothing to use it for, then it makes sense for you to withdraw, but then it's not clear why to worry that other people are fighting over it.