LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Interpreting the Learning of Deceit
RogerDearnaley (roger-d-1) · 2023-12-18T08:12:39.682Z · comments (14)

Putting multimodal LLMs to the Tetris test
Lovre · 2024-02-01T16:02:12.367Z · comments (5)

Some additional SAE thoughts
Hoagy · 2024-01-13T19:31:40.089Z · comments (4)

A Case for Superhuman Governance, using AI
ozziegooen · 2024-06-07T00:10:10.902Z · comments (0)

The Math of Suspicious Coincidences
Roko · 2024-02-07T13:32:35.513Z · comments (3)

Protestants Trading Acausally
Martin Sustrik (sustrik) · 2024-04-01T14:46:26.374Z · comments (4)

[link] There is no IQ for AI
Gabriel Alfour (gabriel-alfour-1) · 2023-11-27T18:21:26.196Z · comments (10)

The Third Gemini
Zvi · 2024-02-20T19:50:05.195Z · comments (2)

Adversarial Robustness Could Help Prevent Catastrophic Misuse
aogara (Aidan O'Gara) · 2023-12-11T19:12:26.956Z · comments (18)

Against "argument from overhang risk"
RobertM (T3t) · 2024-05-16T04:44:00.318Z · comments (11)

[question] What are things you're allowed to do as a startup?
Elizabeth (pktechgirl) · 2024-06-20T00:01:59.257Z · answers+comments (9)

Understanding Subjective Probabilities
Isaac King (KingSupernova) · 2023-12-10T06:03:27.958Z · comments (16)

[link] When scientists consider whether their research will end the world
Harlan · 2023-12-19T03:47:06.645Z · comments (4)

[link] The origins of the steam engine: An essay with interactive animated diagrams
jasoncrawford · 2023-11-29T18:30:36.315Z · comments (1)

Sparse MLP Distillation
slavachalnev · 2024-01-15T19:39:02.926Z · comments (3)

AI #62: Too Soon to Tell
Zvi · 2024-05-02T15:40:04.364Z · comments (8)

Information-Theoretic Boxing of Superintelligences
JustinShovelain · 2023-11-30T14:31:11.798Z · comments (0)

Fun With CellxGene
sarahconstantin · 2024-09-06T22:00:03.461Z · comments (2)

[link] Safety tax functions
owencb · 2024-10-20T14:08:38.099Z · comments (0)

AI #85: AI Wins the Nobel Prize
Zvi · 2024-10-10T13:40:07.286Z · comments (6)

AIS terminology proposal: standardize terms for probability ranges
eggsyntax · 2024-08-30T15:43:39.857Z · comments (12)

[link] [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew (ymath) · 2024-09-25T14:52:48.263Z · comments (1)

[link] AI forecasting bots incoming
Dan H (dan-hendrycks) · 2024-09-09T19:14:31.050Z · comments (44)

Examples of How I Use LLMs
jefftk (jkaufman) · 2024-10-14T17:10:04.597Z · comments (2)

[link] My Methodological Turn
adamShimi · 2024-09-29T15:01:45.986Z · comments (0)

[question] Where to find reliable reviews of AI products?
Elizabeth (pktechgirl) · 2024-09-17T23:48:25.899Z · answers+comments (6)

Weekly newsletter for AI safety events and training programs
Bryce Robertson (bryceerobertson) · 2024-05-03T00:33:29.418Z · comments (0)

Scorable Functions: A Format for Algorithmic Forecasting
ozziegooen · 2024-05-21T04:14:11.749Z · comments (0)

Results from the Turing Seminar hackathon
Charbel-Raphaël (charbel-raphael-segerie) · 2023-12-07T14:50:38.377Z · comments (1)

[link] GDP per capita in 2050
Hauke Hillebrandt (hauke-hillebrandt) · 2024-05-06T15:14:30.934Z · comments (8)

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
Joe Carlsmith (joekc) · 2023-11-29T16:32:30.068Z · comments (1)

AI #61: Meta Trouble
Zvi · 2024-05-02T18:40:03.242Z · comments (0)

Big-endian is better than little-endian
Menotim · 2024-04-29T02:30:48.053Z · comments (17)

Quick Thoughts on Our First Sampling Run
jefftk (jkaufman) · 2024-05-23T00:20:02.050Z · comments (3)

A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans
Thane Ruthenis · 2023-12-17T20:28:57.854Z · comments (7)

Throughput vs. Latency
alkjash · 2024-01-12T21:37:07.632Z · comments (2)

Offering Completion
jefftk (jkaufman) · 2024-06-07T01:40:02.137Z · comments (6)

Deception Chess: Game #2
Zane · 2023-11-29T02:43:22.375Z · comments (17)

Investigating Bias Representations in LLMs via Activation Steering
DawnLu · 2024-01-15T19:39:14.077Z · comments (4)

[link] The Poker Theory of Poker Night
omark · 2024-04-07T09:47:01.658Z · comments (13)

End-to-end hacking with language models
tchauvin (timot.cool) · 2024-04-05T15:06:53.689Z · comments (0)

Please Understand
samhealy · 2024-04-01T12:33:20.459Z · comments (11)

[question] How does it feel to switch from earn-to-give?
Neil (neil-warren) · 2024-03-31T16:27:22.860Z · answers+comments (4)

[link] What fuels your ambition?
Cissy · 2024-01-31T18:30:53.274Z · comments (1)

[link] Debate helps supervise human experts [Paper]
habryka (habryka4) · 2023-11-17T05:25:17.030Z · comments (6)

[question] Weighing reputational and moral consequences of leaving Russia or staying
spza · 2024-02-18T19:36:40.676Z · answers+comments (24)

[link] Abs-E (or, speak only in the positive)
dkl9 · 2024-02-19T21:14:32.095Z · comments (24)

Wholesome Culture
owencb · 2024-03-01T12:08:17.877Z · comments (3)

Glomarization FAQ
Zane · 2023-11-15T20:20:49.488Z · comments (5)

Non-myopia stories
lberglund (brglnd) · 2023-11-13T17:52:31.933Z · comments (10)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

daniel-kokotajlo on avturchin's Shortform

Huh. If you pretend to throw the stone, does that mean you make a throwing motion with your arm, but just don't actually release the object you are holding? If so, how come they run away instead of e.g. cringing and expecting to get hit, and then not getting hit, and figuring that you missed and are now out of ammo?

Or does it mean you make menacing gestures as if to throw, but don't actually make the whole throwing motion?

avturchin on I turned decision theory problems into memes about trolleys

Can you make Trolley meme for Death in Damascus and Doomsday Argument?

Can prove that you can express any decision theory problem as some Trolley problem?

nathan-helm-burger on Three Notions of "Power"

Thinking a bit more about this, I might group types of power into:

Power through relating: Social/economic/government/negotiating/threatening, reshaping the social world and the behavior of others

Power through understanding: having intellect and knowledge affordances, being able to solve clever puzzles in the world to achieve aims

Power through control: having physical affordances that allow for taking potent actions, reshaping the physical world

They all bleed together at the edges and are somewhat fungible in various ways, but I think it makes sense to talk of clusters despite their fuzzy edges.

johnswentworth on Three Notions of "Power"

Human psychology, mainly. "Dominance"-in-the-human-intuitive-sense was in the original post mainly because I think that's how most humans intuitively understand "power", despite (I claimed) not being particularly natural for more-powerful agents. So I'd expect humans to be confused insofar as they try to apply those dominance-in-the-human-intuitive-sense intuitions to more powerful agents.

And like, sure, one could use a notion of "dominance" which is general enough to encompass all forms of conflict, but at that point we can just talk about "conflict" and the like without the word "dominance"; using the word "dominance" for that is unnecessarily confusing, because most humans' intuitive notion of "dominance" is narrower.

johnswentworth on Three Notions of "Power"

Because there'd be an unexploitable-equillibrium condition where a government that isn't focused on dominance is weaker than a government more focused on government, it would generally be held by those who have the strongest focus on dominance.

This argument only works insofar as governments less focused on dominance are, in fact, weaker militarily, which seems basically-false in practice in the long run. For instance, autocratic regimes just can't compete industrially with a market economy like e.g. most Western states today, and that industrial difference turns into a comprehensive military advantage with relatively moderate time and investment. And when countries switch to full autocracy, there's sometimes a short-term military buildup but they tend to end up waaaay behind militarily a few years down the road IIUC.

nathan-helm-burger on Three Notions of "Power"

The post seems to me to be about notions of power, and the affordances of intelligent agents. I think this is a relevant kind of power to keep in mind.

tailcalled on Three Notions of "Power"

What phenomenon are you modelling where this distinction is relevant?

nathan-helm-burger on Three Notions of "Power"

I think we're using different concepts of 'dominance' here. I usually think of 'dominance' as a social relationship between a strong party and a submissive party, a hierarchy. A relationship between a ruler and the ruled, or an abuser and abused. I don't think that a human driving a bulldozer which destroys an anthill without the human even noticing that the anthill existed is the same sort of relationship. I think we need some word other than 'dominant' to describe the human wiping out the ants in an instant without sparing them a thought. It doesn't particularly seem like a conflict even. The human in a bulldozer didn't perceive themselves to be in a conflict, the ants weren't powerful enough to register as an opponent or obstacle at all.

fabien-roger on The case for unlearning that removes information from LLM weights

I am not sure that it is over-conservative. If you have an HP-shaped that can easily be transformed in HP-data using fine-tuning, does it give you a high level of confidence that people misusing the model won't be able to extract the information from the HP-shaped hole or that a misaligned model won't be able to notice to HP-shaped hole and use that to answer to question to HP when it really wants to?

I think that it depends on the specifics of how you built the HP-shaped hole (without scrambling the information). I don't have a good intuition for what a good technique like that could look like. A naive thing that comes to mind would be something like "replace all facts in HP by their opposite" (if you had a magic fact-editing tool), but I feel like in this situation it would be pretty easy for an attacker (human misuse or misaligned model) to notice "wow all HP knowledge has been replaced by anti-HP knowledge" and then extract all the HP information by just swapping the answers.

tailcalled on Three Notions of "Power"

Except for the child and the blacksmith, all of these seem like dominance conflicts to me. The blacksmith plausibly becomes a dominance conflict too once you consider how he ended up with the resources and what tasks he's likely to face. You contrast these with conflicts between human groups, but I'd compare to e.g. a conflict between a drunk middle-aged loner who is looking for a brawl vs two young policemen and a bar owner.