LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Deconfusing In-Context Learning
Arjun Panickssery (arjun-panickssery) · 2024-02-25T09:48:17.690Z · comments (1)

[link] The Hippie Rabbit Hole -Nuggets of Gold in Rivers of Bullshit
Jonathan Moregård (JonathanMoregard) · 2024-01-05T18:27:01.769Z · comments (20)

Enhancing intelligence by banging your head on the wall
Bezzi · 2023-12-12T21:00:48.584Z · comments (26)

My best guess at the important tricks for training 1L SAEs
Arthur Conmy (arthur-conmy) · 2023-12-21T01:59:06.208Z · comments (4)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

What is wisdom?
TsviBT · 2023-11-14T02:13:49.681Z · comments (3)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

[link] How To Socialize With Psycho(logist)s
Sable · 2023-10-20T11:33:46.066Z · comments (11)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-01-02T18:15:54.168Z · comments (0)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

[link] Dark Skies Book Review
PeterMcCluskey · 2023-12-29T18:28:59.352Z · comments (3)

Review Report of Davidson on Takeoff Speeds (2023)
Trent Kannegieter · 2023-12-22T18:48:55.983Z · comments (11)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

Striking Implications for Learning Theory, Interpretability — and Safety?
RogerDearnaley (roger-d-1) · 2024-01-05T08:46:58.915Z · comments (4)

Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs
Burny · 2023-11-23T03:16:09.358Z · comments (25)

[question] Is a random box of gas predictable after 20 seconds?
Thomas Kwa (thomas-kwa) · 2024-01-24T23:00:53.184Z · answers+comments (35)

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI
WillPetillo · 2023-12-04T22:58:40.005Z · comments (0)

[link] ∀: a story
Richard_Ngo (ricraz) · 2023-12-17T22:42:32.857Z · comments (1)

Your LLM Judge may be biased
Henry Papadatos (henry) · 2024-03-29T16:39:22.534Z · comments (9)

Medical Roundup #2
Zvi · 2024-04-09T13:40:05.908Z · comments (18)

[link] I didn't have to avoid you; I was just insecure
Chipmonk · 2024-08-17T16:41:50.237Z · comments (7)

Super-Exponential versus Exponential Growth in Compute Price-Performance
moridinamael · 2023-10-06T16:23:56.714Z · comments (25)

[link] [Fiction] A Confession
Arjun Panickssery (arjun-panickssery) · 2024-04-18T16:28:48.194Z · comments (2)

On DeepMind’s Frontier Safety Framework
Zvi · 2024-06-18T13:30:21.154Z · comments (4)

The Defence production act and AI policy
[deleted] · 2024-03-01T14:26:09.064Z · comments (0)

[link] Dall-E 3
p.b. · 2023-10-02T20:33:18.294Z · comments (9)

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
cmathw · 2024-04-08T11:14:43.268Z · comments (4)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

UDT1.01: The Story So Far (1/10)
Diffractor · 2024-03-27T23:22:35.170Z · comments (6)

[link] A High Decoupling Failure
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-14T19:46:09.552Z · comments (5)

[link] Twitter thread on AI takeover scenarios
Richard_Ngo (ricraz) · 2024-07-31T00:24:33.866Z · comments (0)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

[link] WSJ: Inside Amazon’s Secret Operation to Gather Intel on Rivals
trevor (TrevorWiesinger) · 2024-04-23T21:33:08.049Z · comments (5)

AI #66: Oh to Be Less Online
Zvi · 2024-05-30T14:20:03.334Z · comments (6)

[question] Is there software to practice reading expressions?
lsusr · 2024-04-23T21:53:00.679Z · answers+comments (10)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

Turning Your Back On Traffic
jefftk (jkaufman) · 2024-07-17T01:00:08.627Z · comments (7)

AI #49: Bioweapon Testing Begins
Zvi · 2024-02-01T15:30:04.690Z · comments (11)

Principles For Product Liability (With Application To AI)
johnswentworth · 2023-12-10T21:27:41.403Z · comments (55)

Thousands of malicious actors on the future of AI misuse
Zershaaneh Qureshi (zershaaneh-qureshi) · 2024-04-01T10:08:42.357Z · comments (0)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (38)

AI companies' commitments
Zach Stein-Perlman · 2024-05-29T11:00:31.339Z · comments (0)

Closeness To the Issue (Part 5 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-09T00:36:47.388Z · comments (0)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sharmake-farah on Noosphere89's Shortform

One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.

This might already be enough to do so given somewhat favorable assumptions.

But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.

Thankfully, I believe this is likely to be the case by default.

See Seth Herd's comment below for a perspective:

https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1?commentId=cakcEJu389j7Epgqt [LW(p) · GW(p)]

seth-herd on If we solve alignment, do we die anyway?

My pleasure. Evan Hubinger made this point to me when I'd misunderstood his scalable oversight proposal.

Thanks again for engaging with my work!

dakara on If we solve alignment, do we die anyway?

"If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns."

Ah, that's the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!

seth-herd on If we solve alignment, do we die anyway?

Thanks for reading, and responding! It's very helpful to know where my arguments cease being convincing or understandable.

I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it.

Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don't work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we'll probably use in Internal independent review for language model agent alignment [AF · GW]. I'm not happy with the clarity of that post, though, so I'm currently working on two followups that might be clearer.

Or perhaps the missing link is going from aligned AI systems to aligned "Real AGI" [LW · GW]. I do think there's a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned - IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail.

So that's how I get to the first aligned AGI at roughly human level or below.

From there it seems easier, although still possible to fail.

If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.

I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there's a good chance that progression will look more generational, with several distinct systems/entities as successors with greater intelligence, designed by the previous system and/or humans. Those discontinuities seem to present more danger of getting alignment wrong

algon on Making a conservative case for alignment

If I squint, I can see where they're coming from. People often say that wars are foolish, and both sides would be better off if they didn't fight. And this is standardly called "naive" by those engaging in realpolitik. Sadly, for any particular war, there's a significant chance they're right. Even aside from human stupidity, game theory is not so kind as to allow for peace unending. But the China-America AI race is not like that. The Chinese don't want to race. They've shown no interest in being part of a race. It's just American hawks on a loud, Quixotic quest masking the silence.

If I were to continue the story, it'd show Simplicio asking Galactico not to play Chicken and Galacitco replying "race? What race?". Then Sophistico crashes into Galactico and Simplicio. Everyone dies, The End.

hmys on Reducing x-risk might be actively harmful

Seems unlikely to me. I mean, I think, in large part due to factory farming, that the current immediate existence of humanity, and also its history, are net negatives. The reason I'm not a full blown antinatalist is because these issues are likely to be remedied in the future, and the goodness of the future will astronomically dwarf the current negativity humanity has and is bringing about. (assuming we survive and realize a non-negligible fraction of our cosmic endowment)

The reason I think this is, well, the way I view it, its an immediate corollary of the standard yudkowsky/bostrom AI arguments. Animals existing and suffering is an extremely specific state of affairs, just like humans existing and being happy is an extremely specific state of affairs. This means that, if you optimize hard enough for anything, thats not exactly that (humans happy or animals suffering), you're not gonna get it.

And, maybe this is me being too optimistic (but I really hope not, and I really don't think so), but I don't think many humans want animals to suffer for its own sake. They'd eat lab-grown meat if it was cheaper and better tasting than animal-grown meat. Lab-grown meat is a good example of the general principle I'm talking about. Suffering of sentient minds is a complex thing. If you have a powerful optimizer, about its way optimizing the universe, you're virtually never gonna get suffering sentient minds unless that is what the optimizer is deliberately aiming for.

sharmake-farah on "The Solomonoff Prior is Malign" is a special case of a simpler argument

I think this is in fact the crux, in that I don't think they can do this in the general case, no matter how much compute is used, and even in the more specific cases, I still expect it to be extremely hard verging on impossible to actually get the distribution, primarily because you get equal evidence for almost every value, for the same reasons as why getting more compute is an instrumental convergent goal, so you cannot infer the values of basically anyone solely on the fact that you live in a simulation.

In the general case, the distribution/probability isn't even well defined at all.

turntrout on Announcing turntrout.com, my new digital home

Great point! I made this design choice back in April, so I wasn't as aware of the implications of localStorage.

Adds his 61st outstanding to-do item.

seth-herd on OpenAI Email Archives (from Musk v. Altman)

Very interesting. This does imply that Page was pretty committed to this view.

Note that he doesn't explicitly state that non-sentient machine successors would be fine; he could be assuming that the winning machines would be human-plus in all ways we value.

I think that's a foolish thing to assume and a foolish aspect of the question to overlook. That's why I think more careful philosophy would have helped resolve this disagreement with words instead of a gigantic industrial competition that's now putting as all at risk.

fabien-roger on Buck's Shortform

Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks

Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.

If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity.

There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.