LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] MIRI's May 2024 Newsletter
Harlan · 2024-05-15T00:13:30.153Z · comments (1)

[Intuitive self-models] 2. Conscious Awareness
Steven Byrnes (steve2152) · 2024-09-25T13:29:02.820Z · comments (47)

Quick look: applications of chaos theory
Elizabeth (pktechgirl) · 2024-08-18T15:00:07.853Z · comments (45)

Thomas Kwa's research journal
Thomas Kwa (thomas-kwa) · 2023-11-23T05:11:08.907Z · comments (1)

[link] My thesis (Algorithmic Bayesian Epistemology) explained in more depth
Eric Neyman (UnexpectedValues) · 2024-05-09T19:43:16.543Z · comments (4)

LessWrong Community Weekend 2024, open for applications
UnplannedCauliflower · 2024-05-01T10:18:21.992Z · comments (2)

[link] Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety
titotal (lombertini) · 2024-09-18T13:07:40.754Z · comments (3)

Secular interpretations of core perennialist claims
zhukeepa · 2024-08-25T23:41:02.683Z · comments (32)

EU policymakers reach an agreement on the AI Act
tlevin (trevor) · 2023-12-15T06:02:44.668Z · comments (7)

Corrigibility = Tool-ness?
johnswentworth · 2024-06-28T01:19:48.883Z · comments (8)

Spaciousness In Partner Dance: A Naturalism Demo
LoganStrohl (BrienneYudkowsky) · 2023-11-19T07:00:19.555Z · comments (5)

A couple productivity tips for overthinkers
Steven Byrnes (steve2152) · 2024-04-20T16:05:50.332Z · comments (13)

Reactions to the Executive Order
Zvi · 2023-11-01T20:40:02.438Z · comments (4)

MATS Summer 2023 Retrospective
utilistrutil · 2023-12-01T23:29:47.958Z · comments (34)

Attention SAEs Scale to GPT-2 Small
Connor Kissane (ckkissane) · 2024-02-03T06:50:22.583Z · comments (4)

[link] [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate
trevor (TrevorWiesinger) · 2024-03-28T16:03:36.452Z · comments (22)

OpenAI: Leaks Confirm the Story
Zvi · 2023-12-12T14:00:04.812Z · comments (9)

ACX Covid Origins Post convinced readers
ErnestScribbler · 2024-05-01T13:06:20.818Z · comments (7)

The Parable Of The Fallen Pendulum - Part 2
johnswentworth · 2024-03-12T21:41:30.180Z · comments (8)

Questions for labs
Zach Stein-Perlman · 2024-04-30T22:15:55.362Z · comments (11)

Send us example gnarly bugs
Beth Barnes (beth-barnes) · 2023-12-10T05:23:00.773Z · comments (10)

Reward hacking behavior can generalize across tasks
Kei · 2024-05-28T16:33:50.674Z · comments (5)

Secondary forces of debt
KatjaGrace · 2024-06-27T21:10:06.131Z · comments (18)

On Claude 3.0
Zvi · 2024-03-06T18:50:04.766Z · comments (5)

Bitter lessons about lucid dreaming
avturchin · 2024-10-16T21:27:04.725Z · comments (52)

Grief is a fire sale
Nathan Young · 2024-03-04T01:11:06.882Z · comments (1)

[link] AI takeoff and nuclear war
owencb · 2024-06-11T19:36:24.710Z · comments (6)

Lying Alignment Chart
Zack_M_Davis · 2023-11-29T16:15:28.102Z · comments (17)

Value fragility and AI takeover
Joe Carlsmith (joekc) · 2024-08-05T21:28:07.306Z · comments (5)

Creating unrestricted AI Agents with Command R+
Simon Lermen (dalasnoin) · 2024-04-16T14:52:50.917Z · comments (13)

Scaffolding for "Noticing Metacognition"
Raemon · 2024-10-09T17:54:13.657Z · comments (4)

Universal Love Integration Test: Hitler
Raemon · 2024-01-10T23:55:35.526Z · comments (65)

[link] The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review
jessicata (jessica.liu.taylor) · 2024-03-27T19:59:27.893Z · comments (36)

Darwinian Traps and Existential Risks
KristianRonn · 2024-08-25T22:37:14.142Z · comments (14)

[link] Are language models good at making predictions?
dynomight · 2023-11-06T13:10:36.379Z · comments (14)

[question] What could a policy banning AGI look like?
TsviBT · 2024-03-13T14:19:07.783Z · answers+comments (23)

Coherence of Caches and Agents
johnswentworth · 2024-04-01T23:04:31.320Z · comments (9)

Mid-conditional love
KatjaGrace · 2024-04-17T04:00:08.341Z · comments (21)

Vote on Anthropic Topics to Discuss
Ben Pace (Benito) · 2024-03-06T19:43:47.194Z · comments (55)

[link] Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"
mattmacdermott · 2024-02-29T13:59:34.959Z · comments (19)

AISC9 has ended and there will be an AISC10
Linda Linsefors · 2024-04-29T10:53:18.812Z · comments (4)

[link] The Offense-Defense Balance Rarely Changes
Maxwell Tabarrok (maxwell-tabarrok) · 2023-12-09T15:21:23.340Z · comments (23)

The Obliqueness Thesis
jessicata (jessica.liu.taylor) · 2024-09-19T00:26:30.677Z · comments (16)

[Valence series] 3. Valence & Beliefs
Steven Byrnes (steve2152) · 2023-12-11T20:21:30.570Z · comments (11)

[link] The problems with the concept of an infohazard as used by the LW community [Linkpost]
Noosphere89 (sharmake-farah) · 2023-12-22T16:13:54.822Z · comments (43)

My 10-year retrospective on trying SSRIs
Kaj_Sotala · 2024-09-22T20:30:02.483Z · comments (10)

Rationality Quotes - Fall 2024
Screwtape · 2024-10-10T18:37:55.013Z · comments (21)

[link] Claude 3.5 Sonnet
Zach Stein-Perlman · 2024-06-20T18:00:35.443Z · comments (41)

Analogies between scaling labs and misaligned superintelligent AI
scasper · 2024-02-21T19:29:39.033Z · comments (5)

My guess at Conjecture's vision: triggering a narrative bifurcation
Alexandre Variengien (alexandre-variengien) · 2024-02-06T19:10:42.690Z · comments (12)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

daniel-kokotajlo on avturchin's Shortform

Huh. If you pretend to throw the stone, does that mean you make a throwing motion with your arm, but just don't actually release the object you are holding? If so, how come they run away instead of e.g. cringing and expecting to get hit, and then not getting hit, and figuring that you missed and are now out of ammo?

Or does it mean you make menacing gestures as if to throw, but don't actually make the whole throwing motion?

avturchin on I turned decision theory problems into memes about trolleys

Can you make Trolley meme for Death in Damascus and Doomsday Argument?

Can prove that you can express any decision theory problem as some Trolley problem?

nathan-helm-burger on Three Notions of "Power"

Thinking a bit more about this, I might group types of power into:

Power through relating: Social/economic/government/negotiating/threatening, reshaping the social world and the behavior of others

Power through understanding: having intellect and knowledge affordances, being able to solve clever puzzles in the world to achieve aims

Power through control: having physical affordances that allow for taking potent actions, reshaping the physical world

They all bleed together at the edges and are somewhat fungible in various ways, but I think it makes sense to talk of clusters despite their fuzzy edges.

johnswentworth on Three Notions of "Power"

Human psychology, mainly. "Dominance"-in-the-human-intuitive-sense was in the original post mainly because I think that's how most humans intuitively understand "power", despite (I claimed) not being particularly natural for more-powerful agents. So I'd expect humans to be confused insofar as they try to apply those dominance-in-the-human-intuitive-sense intuitions to more powerful agents.

And like, sure, one could use a notion of "dominance" which is general enough to encompass all forms of conflict, but at that point we can just talk about "conflict" and the like without the word "dominance"; using the word "dominance" for that is unnecessarily confusing, because most humans' intuitive notion of "dominance" is narrower.

johnswentworth on Three Notions of "Power"

Because there'd be an unexploitable-equillibrium condition where a government that isn't focused on dominance is weaker than a government more focused on government, it would generally be held by those who have the strongest focus on dominance.

This argument only works insofar as governments less focused on dominance are, in fact, weaker militarily, which seems basically-false in practice in the long run. For instance, autocratic regimes just can't compete industrially with a market economy like e.g. most Western states today, and that industrial difference turns into a comprehensive military advantage with relatively moderate time and investment. And when countries switch to full autocracy, there's sometimes a short-term military buildup but they tend to end up waaaay behind militarily a few years down the road IIUC.

nathan-helm-burger on Three Notions of "Power"

The post seems to me to be about notions of power, and the affordances of intelligent agents. I think this is a relevant kind of power to keep in mind.

tailcalled on Three Notions of "Power"

What phenomenon are you modelling where this distinction is relevant?

nathan-helm-burger on Three Notions of "Power"

I think we're using different concepts of 'dominance' here. I usually think of 'dominance' as a social relationship between a strong party and a submissive party, a hierarchy. A relationship between a ruler and the ruled, or an abuser and abused. I don't think that a human driving a bulldozer which destroys an anthill without the human even noticing that the anthill existed is the same sort of relationship. I think we need some word other than 'dominant' to describe the human wiping out the ants in an instant without sparing them a thought. It doesn't particularly seem like a conflict even. The human in a bulldozer didn't perceive themselves to be in a conflict, the ants weren't powerful enough to register as an opponent or obstacle at all.

fabien-roger on The case for unlearning that removes information from LLM weights

I am not sure that it is over-conservative. If you have an HP-shaped that can easily be transformed in HP-data using fine-tuning, does it give you a high level of confidence that people misusing the model won't be able to extract the information from the HP-shaped hole or that a misaligned model won't be able to notice to HP-shaped hole and use that to answer to question to HP when it really wants to?

I think that it depends on the specifics of how you built the HP-shaped hole (without scrambling the information). I don't have a good intuition for what a good technique like that could look like. A naive thing that comes to mind would be something like "replace all facts in HP by their opposite" (if you had a magic fact-editing tool), but I feel like in this situation it would be pretty easy for an attacker (human misuse or misaligned model) to notice "wow all HP knowledge has been replaced by anti-HP knowledge" and then extract all the HP information by just swapping the answers.

tailcalled on Three Notions of "Power"

Except for the child and the blacksmith, all of these seem like dominance conflicts to me. The blacksmith plausibly becomes a dominance conflict too once you consider how he ended up with the resources and what tasks he's likely to face. You contrast these with conflicts between human groups, but I'd compare to e.g. a conflict between a drunk middle-aged loner who is looking for a brawl vs two young policemen and a bar owner.