LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Aggregative Principles of Social Justice
Cleo Nardo (strawberry calm) · 2024-06-05T13:44:47.499Z · comments (10)

Reviewing the Structure of Current AI Regulations
Deric Cheng (deric-cheng) · 2024-05-07T12:34:17.820Z · comments (0)

Experience Report - ML4Good AI Safety Bootcamp
Kieron Kretschmar · 2024-04-11T18:03:41.040Z · comments (0)

Big-endian is better than little-endian
Menotim · 2024-04-29T02:30:48.053Z · comments (17)

Paper Summary: Princes and Merchants: European City Growth Before the Industrial Revolution
Jeffrey Heninger (jeffrey-heninger) · 2024-07-15T21:30:04.043Z · comments (1)

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking
tailcalled · 2024-06-10T21:20:11.938Z · comments (13)

Two Tales of AI Takeover: My Doubts
Violet Hour · 2024-03-05T15:51:05.558Z · comments (8)

AI labs can boost external safety research
Zach Stein-Perlman · 2024-07-31T19:30:16.207Z · comments (1)

[question] What Other Lines of Work are Safe from AI Automation?
RogerDearnaley (roger-d-1) · 2024-07-11T10:01:12.616Z · answers+comments (35)

End-to-end hacking with language models
tchauvin (timot.cool) · 2024-04-05T15:06:53.689Z · comments (0)

[link] The Poker Theory of Poker Night
omark · 2024-04-07T09:47:01.658Z · comments (13)

Investigating Bias Representations in LLMs via Activation Steering
DawnLu · 2024-01-15T19:39:14.077Z · comments (4)

[LDSL#4] Root cause analysis versus effect size estimation
tailcalled · 2024-08-11T16:12:14.604Z · comments (0)

AI #64: Feel the Mundane Utility
Zvi · 2024-05-16T15:20:02.956Z · comments (11)

Experiments with an alternative method to promote sparsity in sparse autoencoders
Eoin Farrell · 2024-04-15T18:21:48.771Z · comments (7)

Childhood and Education Roundup #6: College Edition
Zvi · 2024-06-26T11:40:03.990Z · comments (8)

{Book Summary} The Art of Gathering
Tristan Williams (tristan-williams) · 2024-04-16T10:48:41.528Z · comments (0)

[link] AI Impacts 2023 Expert Survey on Progress in AI
habryka (habryka4) · 2024-01-05T19:42:17.226Z · comments (1)

Employee Incentives Make AGI Lab Pauses More Costly
nikola (nikolaisalreadytaken) · 2023-12-22T05:04:15.598Z · comments (12)

[link] New blog: Expedition to the Far Lands
Connor Leahy (NPCollapse) · 2024-08-17T11:07:48.537Z · comments (3)

DIY RLHF: A simple implementation for hands on experience
Mike Vaiana (mike-vaiana) · 2024-07-10T12:07:03.047Z · comments (0)

Reading More Each Day: A Simple $35 Tool
aysajan · 2024-07-24T13:54:04.290Z · comments (2)

Monthly Roundup #19: June 2024
Zvi · 2024-06-25T12:00:03.333Z · comments (9)

Ackshually, many worlds is wrong
tailcalled · 2024-04-11T20:23:59.416Z · comments (42)

An explanation of evil in an organized world
KatjaGrace · 2024-05-02T05:20:06.240Z · comments (9)

Cryonics p(success) estimates are only weakly associated with interest in pursuing cryonics in the LW 2023 Survey
Andy_McKenzie · 2024-02-29T14:47:28.613Z · comments (6)

Can quantised autoencoders find and interpret circuits in language models?
charlieoneill (kingchucky211) · 2024-03-24T20:05:50.125Z · comments (4)

Auditing LMs with counterfactual search: a tool for control and ELK
Jacob Pfau (jacob-pfau) · 2024-02-20T00:02:09.575Z · comments (6)

[link] Cellular reprogramming, pneumatic launch systems, and terraforming Mars: Some things I learned about at Foresight Vision Weekend
jasoncrawford · 2024-01-04T19:33:57.887Z · comments (0)

[link] Quick Thoughts on Scaling Monosemanticity
Joel Burget (joel-burget) · 2024-05-23T16:22:48.035Z · comments (1)

[link] Conversation Visualizer
ethanmorse · 2023-12-31T01:18:01.424Z · comments (4)

Cicadas, Anthropic, and the bilateral alignment problem
kromem · 2024-05-22T11:09:56.469Z · comments (6)

Updates to Open Phil’s career development and transition funding program
abergal · 2023-12-04T18:10:29.394Z · comments (0)

AI #65: I Spy With My AI
Zvi · 2024-05-23T12:40:02.793Z · comments (7)

Collection (Part 6 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-14T21:37:00.160Z · comments (0)

Evaporation of improvements
Viliam · 2024-06-20T18:34:40.969Z · comments (27)

Tackling Moloch: How YouCongress Offers a Novel Coordination Mechanism
Hector Perez Arenas (hector-perez-arenas) · 2024-05-15T23:13:48.501Z · comments (9)

Aggregative principles approximate utilitarian principles
Cleo Nardo (strawberry calm) · 2024-06-12T16:27:22.179Z · comments (3)

[link] Memo on some neglected topics
Lukas Finnveden (Lanrian) · 2023-11-11T02:01:55.834Z · comments (2)

Heuristics for preventing major life mistakes
SK2 (lunchbox) · 2023-12-20T08:01:09.340Z · comments (2)

Escaping Skeuomorphism
Stuart Johnson (stuart-johnson) · 2023-12-20T03:51:00.489Z · comments (0)

Deconfusing “ontology” in AI alignment
Dylan Bowman (dylan-bowman) · 2023-11-08T20:03:43.205Z · comments (3)

[question] How did you integrate voice-to-text AI into your workflow?
ChristianKl · 2023-11-20T12:01:37.696Z · answers+comments (12)

[link] A new process for mapping discussions
Nathan Young · 2024-09-30T08:57:20.029Z · comments (7)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (8)

[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations
Steven Byrnes (steve2152) · 2024-10-29T13:36:16.325Z · comments (2)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

Towards Quantitative AI Risk Management
Henry Papadatos (henry) · 2024-10-16T19:26:48.817Z · comments (1)

[link] Our Digital and Biological Children
Eneasz · 2024-10-24T18:36:38.719Z · comments (0)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (86)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

vladimir_nesov on The Alignment Trap: AI Safety as Path to Power

The point is that the "controller" of a "controllable AI" is a role that can be filled by an AI and not only by a human or a human institution. AI is going to quickly grow the pie to the extent that makes current industry and economy (controlled by humans) a rounding error, so it seems unlikely that among the entities vying for control over controllable AIs, humans and human institutions are going to be worth mentioning. It's not even about a takeover, Google didn't take over Gambia.

andrew-sauer on Living Metaphorically

Another important one: Height/Altitude is authority. Your boss is "above" you, the king, president or CEO is "at the top", you "climb the corporate ladder"

d0themath on Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence

Sorry to give only a surface-level point of feedback, but I think this post would be much, much better if you shortened it significantly. As far as I can tell, pretty much every paragraph is 3x longer than it could be, which makes it a slog to read through.

ryan_greenblatt on How might we solve the alignment problem? (Part 1: Intro, summary, ontology)

Joe's argument here would actually be locally valid if we changed:

a sufficient number of IQ 100 agents with sufficient time can do anything that an IQ 101 agent can do

to:

a sufficient number of IQ 100 agents with sufficient time can do anything that some number of IQ 101 agents can do eventually

We can see why this works when applied to your analogy. If we change:

A sufficient number of 4yo’s could pick up any weight that a 5yo could pick up

to

A sufficient number of 4yo’s could pick up any weight that some number of 5yo's could pick up

Then we can see where the issue comes in. The problem is that while a team of 4yo's can always beat a single 5yo, there exists some number of 5yo's which can beat any number of 4yo's.

If we fix the local validity issue in Joe's argument like this, it is easier to see where issues might crop up.

sharmake-farah on The Alignment Trap: AI Safety as Path to Power

This honestly depends on the level of control achieved over AI in practice.

I do agree with the claim that there are pretty strong incentives to have AI peacefully takeover everything, but this is a long-term incentive, and more importantly if control gets good enough, at least some people would wield control of AI because of AIs wanting to be controlled by humans, combined with AI control strategies being good enough that you might avoid takeover at least in the early regime.

To be clear, in the long run, I expect an AI to likely (as in 70-85% likely) to wield the fruits of control, but I think that humans will at least at first wield the control for a number of years, maybe followed by uploads of humans, like virtual dictators and leaders next in line for control.

julian-stastny on The case for unlearning that removes information from LLM weights

I wonder if the approach from your paper is in some sense too conservative to evaluate whether information has been removed: Suppose I used some magical scalpel and removed all information about Harry Potter from the model.

Then I wouldn't be too surprised if this leaves a giant HP-shaped hole in the model such that, if you then fine-tune on a small amount of HP-related data, suddenly everything falls into place and makes sense to the model again, and this rapidly generalizes.

Maybe fine-tuning robust unlearning requires us to fill in the holes with synthetic data so that this doesn't happen.

julian-stastny on The case for unlearning that removes information from LLM weights

By tamper-resistant fine-tuning, are you referring to this paper by Tamirisa et al? (That'd be a pretty devastating issue with the whole motivation to their paper since no one actually does anything but use LoRA for fine-tuning open-weight models...)

vladimir_nesov on The Alignment Trap: AI Safety as Path to Power

If your work makes AI systems more controllable, who will ultimately wield that control?

A likely answer is "an AI".

vladimir_nesov on The Alignment Trap: AI Safety as Path to Power

Recent discussions about artificial intelligence safety have focused heavily on ensuring AI systems remain under human control. While this goal seems laudable on its surface, we should carefully examine whether some proposed safety measures could paradoxically enable rather than prevent dangerous concentrations of power.

The aim of avoiding AI takeover that ends poorly for humanity is not about preventing dangerous concentrations of power. Power that is distributed among AIs and not concentrated is entirely compatible with an AI takeover than ends poorly for humanity.

danielfilan on Habryka's Shortform Feed

It looks kinda small to me, someone who uses Firefox on Ubuntu.