LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

When fine-tuning fails to elicit GPT-3.5's chess abilities
Theodore Chapman · 2024-06-14T18:50:52.855Z · comments (3)

Upgrading the AI Safety Community
trevor (TrevorWiesinger) · 2023-12-16T15:34:26.600Z · comments (9)

Estimating efficiency improvements in LLM pre-training
Daan · 2024-01-19T19:32:45.124Z · comments (3)

Unit economics of LLM APIs
dschwarz · 2024-08-27T16:51:22.692Z · comments (0)

[link] you should probably eat oatmeal sometimes
bhauth · 2024-08-25T14:50:37.570Z · comments (32)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues
aphyer · 2024-06-07T19:02:06.859Z · comments (16)

Navigating emotions in an uncertain & confusing world
Akash (akash-wasil) · 2023-11-20T18:16:09.492Z · comments (1)

Protocol evaluations: good analogies vs control
Fabien Roger (Fabien) · 2024-02-19T18:00:09.794Z · comments (10)

Take SCIFs, it’s dangerous to go alone
latterframe · 2024-05-01T08:02:38.067Z · comments (1)

[link] Rowing vs steering
Saul Munn (saul-munn) · 2024-08-10T07:00:17.594Z · comments (2)

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders
Evan Anders (evan-anders) · 2024-02-27T02:43:22.446Z · comments (16)

Case studies on social-welfare-based standards in various industries
HoldenKarnofsky · 2024-06-20T13:33:44.780Z · comments (0)

Taking responsibility and partial derivatives
Ruby · 2023-12-31T04:33:51.419Z · comments (1)

[link] Surgery Works Well Without The FDA
Maxwell Tabarrok (maxwell-tabarrok) · 2024-01-26T13:31:29.968Z · comments (28)

[question] What rationality failure modes are there?
Ulisse Mini (ulisse-mini) · 2024-01-19T09:12:57.924Z · answers+comments (11)

[link] Jailbreak steering generalization
Sarah Ball · 2024-06-20T17:25:24.110Z · comments (4)

GPT-4o My and Google I/O Day
Zvi · 2024-05-16T17:50:03.040Z · comments (2)

What makes teaching math special
Viliam · 2023-12-17T14:15:01.136Z · comments (27)

[link] Things I learned talking to the new breed of scientific institution
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-29T14:00:14.844Z · comments (6)

Matrix completion prize results
paulfchristiano · 2023-12-20T15:40:04.281Z · comments (0)

[link] An Interactive Shapley Value Explainer
James Stephen Brown (james-brown) · 2024-09-28T05:01:21.169Z · comments (9)

Notes on Dwarkesh Patel’s Podcast with Sholto Douglas and Trenton Bricken
Zvi · 2024-04-01T19:10:12.193Z · comments (1)

D&D Sci Coliseum: Arena of Data
aphyer · 2024-10-18T22:02:54.305Z · comments (23)

[question] What did you change your mind about in the last year?
mike_hawke · 2023-11-23T20:53:45.664Z · answers+comments (16)

Reflections on the Metastrategies Workshop
gw · 2024-10-24T18:30:46.255Z · comments (5)

[link] IAPS: Mapping Technical Safety Research at AI Companies
Zach Stein-Perlman · 2024-10-24T20:30:41.159Z · comments (12)

[link] Beyond the Board: Exploring AI Robustness Through Go
AdamGleave · 2024-06-19T16:40:06.594Z · comments (2)

One-shot strategy games?
Raemon · 2024-03-11T00:19:20.480Z · comments (42)

Goals selected from learned knowledge: an alternative to RL alignment
Seth Herd · 2024-01-15T21:52:06.170Z · comments (17)

A Teacher vs. Everyone Else
ronak69 · 2024-03-21T17:45:35.714Z · comments (8)

AI Risk and the US Presidential Candidates
Zane · 2024-01-06T20:18:04.945Z · comments (22)

Estimating effective dimensionality of MNIST models
Arjun Panickssery (arjun-panickssery) · 2023-11-02T14:13:09.012Z · comments (3)

(Approximately) Deterministic Natural Latents
johnswentworth · 2024-07-19T23:02:12.306Z · comments (0)

The Pointer Resolution Problem
Jozdien · 2024-02-16T21:25:57.374Z · comments (20)

Concrete positive visions for a future without AGI
Max H (Maxc) · 2023-11-08T03:12:42.590Z · comments (28)

How to partition teams to move fast? Debating "low-dimensional cuts"
jacobjacob · 2023-10-13T21:43:53.067Z · comments (2)

Pivotal Acts might Not be what You Think they are
Johannes C. Mayer (johannes-c-mayer) · 2023-11-05T17:23:50.464Z · comments (13)

The Perils of Professionalism
Screwtape · 2023-11-07T00:07:33.213Z · comments (1)

On plans for a functional society
kave · 2023-12-12T00:07:46.629Z · comments (8)

Surviving Seveneves
Yair Halberstadt (yair-halberstadt) · 2024-06-19T13:11:55.414Z · comments (4)

[link] energy landscapes of experts
bhauth · 2023-10-02T14:08:32.370Z · comments (2)

[link] Point of Failure: Semiconductor-Grade Quartz
Annapurna (jorge-velez) · 2024-09-30T15:57:40.495Z · comments (8)

[link] [Paper] Programming Refusal with Conditional Activation Steering
Bruce W. Lee (bruce-lee) · 2024-09-11T20:57:08.714Z · comments (0)

In memory of Louise Glück
Joe Carlsmith (joekc) · 2023-10-15T02:59:42.687Z · comments (1)

Book review: The Quincunx
cousin_it · 2024-06-05T21:13:55.055Z · comments (12)

Superintelligent AI is possible in the 2020s
HunterJay · 2024-08-13T06:03:26.990Z · comments (3)

[link] What's new at FAR AI
AdamGleave · 2023-12-04T21:18:03.951Z · comments (0)

[link] Podcast with Yoshua Bengio on Why AI Labs are “Playing Dice with Humanity’s Future”
garrison · 2024-05-10T17:23:20.436Z · comments (0)

[link] What's important in "AI for epistemics"?
Lukas Finnveden (Lanrian) · 2024-08-24T01:27:06.771Z · comments (0)

How ARENA course material gets made
CallumMcDougall (TheMcDouglas) · 2024-07-02T18:04:00.209Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

green_leaf on Quantum Immortality: A Perspective if AI Doomers are Probably Right

I don't know about similarity... but I was just making a point that QI doesn't require it.

cbiddulph on 5 ways to improve CoT faithfulness

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."

One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.

The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.

benito on Sabotage Evaluations for Frontier Models

You made lots of points, so I wrote a comment for each... probably this was too many replies? I didn't know what else to do that didn't feel like avoiding your points. I hereby state that I do not expect of you to respond to all five of my comments!

philip_b on Internal music player: phenomenology of earworms

Why do you hate earworms? To me, they are mildly pleasant. The only moments when I wish I didn’t have an earworm happening at that moment is when I’m trying to remember another tune and the earworm for musicianship purposes and the earworm prevents me from being able to do that.

benito on Sabotage Evaluations for Frontier Models

Another example would be Anthropic creating a dedicated team [LW · GW] for stress testing their alignment proposals. And as far as I can see, this team is lead by someone who has been actively engaged with the topic of AI safety on LessWrong, someone who you sort of praised [LW · GW] a few days ago.

I don't quite know what the point here is. This is marginally good stuff, it doesn't seem sufficient to me or close to it, I expect us all to probably die, and again: from the CEO and cofounders there has been no serious engagement or justification for the plan for them to personally make billions of dollars building potential omnicidal machines.

benito on Sabotage Evaluations for Frontier Models

I think the point about them not engaging with critics is also a bit too harsh. Here [LW · GW] is DeepMind's alignment team response to concerns raised by Yudkowski. I am not saying that their response is flawless or even correct, but it is a response nonetheless. They are engaging with this work. DeepMind's alignment team also seemed to engage with concerns raised by critics in their (relatively) recent work [LW · GW].

I don't disagree that it is good of the DeepMind alignment team to engage with arguments on LessWrong. I don't know that a few researchers at an org engaging with these arguments is meeting the basic standard here. The first post explicitly says it doesn't represent the leadership, and my sense is that the leadership have avoided talking about the subject, and that the people involved do not have the political power to push for the leadership to engage in open debate.

That said I do concede the point that DeepMind has generally been more cautious than OpenAI and Anthropic, and never created the race to building potential omnicidal machines (in that they were first – it was OpenAI and Anthropic who added major competitors).

benito on Sabotage Evaluations for Frontier Models

I don't think that money alone would've convinced CEOs of big companies to run this enterprise. Altman and Amodei, they both have families. If they don't care about their own families, then they at least care about themselves. After all, we are talking about scenarios where these guys would die the same deaths as the rest of us. No amounts of hoarded money would save them. They would have little motivation to do any of this if they believed that they would die as the result of their own actions. And that's not mentioning all of the other researchers working at their labs. Just Anthropic and OpenAI together have almost 2000 employees. Do they all not care about their and their families' well-being?

I'm not sure how quite to explain that I think a mass of people can do something that they each know on some level is the wrong thing and will hurt them later, but I believe it is common. I think partly it is a mistake to think of a mass of people as having the sum of the agency of all the people involved, or even the maximum.

I think it is easier than you do to simply not think about far away dangers that one can say one is not really responsible for. Does every trader involve in the '08 financial crisis take personal responsibility for it? Does every voter for a politician who turns out to ultimately be corrupt take personal responsibility for it? Do all the tens of thousands of people involved in various genocides take personal responsibility for stopping it as soon as they see it coming? I think it is very easy for people to erect a cartesian boundary between themselves and the levers of power. People are often aware that they are doing the wrong thing. I broke my diet two days ago and regret it, and on some level I knew I'd end up regretting it. And the was a situation I had complete agency over. The more indirectness, the more things are in far-mode, the less people take action on it or feel like they can do anything based on it today.

I agree it is not money alone. These people get to work alongside some of the most innovative and competent people of our age, connect with extremely prestigious journalists and institutions, be invited to halls of power in senior parts of government, and build systems mankind has never seen. All further incentive to find a good rationalization (rather than to stay home and not do that).

benito on Sabotage Evaluations for Frontier Models

I do get that point that you are making, but I think this is a little bit unfair to these organizations. Articles like Machines of Loving Grace, The Intelligence Age and Planning for AGI and Beyond are implicit public justifications for building AGI.

I don't believe that either of the two linked pieces are justifications for building potentially omnicidal AGI.

The former explicitly avoids talking about the risks and states no plan for navigating them. As I've said before [LW(p) · GW(p)], I believe the generator of that essay is attempting to build a narrative in society that leads to people support the author's company, not attempting to engage seriously with critics of him building potentially omnicidal machines, nor attempting to explain anything about how to navigate that risk.

The latter meets the low standard of mentioning the word 'existential' but mostly seems to hope that we can choose to have a smooth takeoff, rather than admitting that (a) there is no known theory of how novel capabilities will arrive with new architectures & data & compute, and (b) the company is essentially running as fast as it can. I mostly feel like it acknowledges reasons for concern and then says that it beliefs in itself, not entirely dissimilar to how a politician makes sure to state the wishes of their various constituents, before going on to do whatever they want.

There are no commitments. There are no promises. There is no argument that this can work. There is only an articulation of what they're going to do, the risks, and a belief that they are good enough to pull through.

Such responses are unserious.

saidachmiz on The Case For Giving To The Shrimp Welfare Project

I’m just going to link the comment I wrote the last time you mentioned that Rethink Priorities report [LW(p) · GW(p)]. That report continues to be of very little use in supporting such arguments as you present here.

elityre on Using Dangerous AI, But Safely?

Ok. So I haven't thought through these proposals in much detail, and I don't claim any confident take, but my first response is "holy fuck, that's a lot of complexity. It really seems like there will be some flaw in our control scheme that we don't notice, if we're stacking a bunch of clever ideas like this one on top of each other."

This is not at all to be taken as a disparagement of the authors. I salute them for their contribution. We should definitely explore ideas like these, and test them, and use the best ideas we have at AGI time.

But my intuitive first order response is "fuck."