LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Rational Animations' intro to mechanistic interpretability
Writer · 2024-06-14T16:10:57.015Z · comments (1)

Some open-source dictionaries and dictionary learning infrastructure
Sam Marks (samuel-marks) · 2023-12-05T06:05:21.903Z · comments (7)

The Gemini Incident Continues
Zvi · 2024-02-27T16:00:05.648Z · comments (6)

Work with me on agent foundations: independent fellowship
Alex_Altair · 2024-09-21T13:59:16.706Z · comments (5)

We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap
johnswentworth · 2024-09-19T22:22:05.307Z · comments (47)

AI #80: Never Have I Ever
Zvi · 2024-09-10T17:50:08.074Z · comments (20)

Dating Roundup #3: Third Time’s the Charm
Zvi · 2024-05-08T13:30:03.232Z · comments (26)

A starting point for making sense of task structure (in machine learning)
Kaarel (kh) · 2024-02-24T01:51:49.227Z · comments (2)

[link] How people stopped dying from diarrhea so much (& other life-saving decisions)
Writer · 2024-03-16T16:00:47.830Z · comments (0)

Atlantis: Berkeley event venue available for rent
Jonas V (Jonas Vollmer) · 2023-11-22T01:47:12.026Z · comments (0)

Things Solenoid Narrates
Solenoid_Entity · 2024-04-12T23:57:16.169Z · comments (2)

[link] AlphaGeometry: An Olympiad-level AI system for geometry
alyssavance · 2024-01-17T17:17:30.913Z · comments (9)

[link] I'd also take $7 trillion
bhauth · 2024-02-19T03:31:45.552Z · comments (12)

[link] Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize
Owain_Evans · 2023-12-19T19:14:26.423Z · comments (4)

AI #53: One More Leap
Zvi · 2024-02-29T16:10:04.049Z · comments (0)

AI #36: In the Background
Zvi · 2023-11-02T18:00:01.803Z · comments (5)

[link] Towards Evaluating AI Systems for Moral Status Using Self-Reports
Ethan Perez (ethan-perez) · 2023-11-16T20:18:51.730Z · comments (3)

On Tapping Out
Screwtape · 2023-11-17T03:23:55.880Z · comments (13)

Apply to LASR Labs: a London-based technical AI safety research programme
Erin Robertson · 2024-04-09T17:34:06.847Z · comments (1)

[link] Open Sourcing Metaculus
ChristianWilliams · 2024-07-02T22:30:01.339Z · comments (0)

[link] Chinese scientists acknowledge xrisk & call for international regulatory body [Linkpost]
Akash (akash-wasil) · 2023-11-01T13:28:43.723Z · comments (4)

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
Sonia Joseph (redhat) · 2024-03-13T17:09:17.027Z · comments (13)

Higher-Order Forecasts
ozziegooen · 2024-05-22T21:49:42.802Z · comments (1)

Userscript to always show LW comments in context vs at the top
Vlad Sitalo (harcisis) · 2023-11-21T17:53:30.418Z · comments (8)

On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche
Zack_M_Davis · 2024-01-09T23:12:20.349Z · comments (31)

When Does Altruism Strengthen Altruism?
jefftk (jkaufman) · 2024-01-21T18:50:05.424Z · comments (2)

AI #60: Oh the Humanity
Zvi · 2024-04-18T14:10:02.281Z · comments (7)

On Trust
johnswentworth · 2023-12-06T19:19:07.680Z · comments (26)

AI #38: Let’s Make a Deal
Zvi · 2023-11-16T19:50:05.442Z · comments (2)

Announcing Atlas Computing
miyazono · 2024-04-11T15:56:31.241Z · comments (4)

[link] Fluent dreaming for language models (AI interpretability method)
tbenthompson (ben-thompson) · 2024-02-06T06:02:59.296Z · comments (4)

[link] Level up your spreadsheeting
angelinahli · 2024-05-25T14:57:19.730Z · comments (11)

~80 Interesting Questions about Foundation Model Agent Safety
RohanS · 2024-10-28T16:37:04.713Z · comments (4)

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (9)

Start an Upper-Room UV Installation Company?
jefftk (jkaufman) · 2024-10-19T02:00:10.691Z · comments (9)

Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders
Gytis Daujotas (gytis-daujotas) · 2024-08-01T21:08:38.800Z · comments (6)

An Introduction to AI Sandbagging
Teun van der Weij (teun-van-der-weij) · 2024-04-26T13:40:00.126Z · comments (8)

[link] Against Student Debt Cancellation From All Sides of the Political Compass
Maxwell Tabarrok (maxwell-tabarrok) · 2024-05-13T14:55:57.525Z · comments (16)

D&D.Sci Long War: Defender of Data-mocracy
aphyer · 2024-04-26T22:30:15.780Z · comments (20)

[link] LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery (arjun-panickssery) · 2024-04-17T21:09:12.007Z · comments (1)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
Rubi J. Hudson (Rubi) · 2024-07-16T22:44:17.128Z · comments (27)

Auditing failures vs concentrated failures
ryan_greenblatt · 2023-12-11T02:47:35.703Z · comments (0)

What does davidad want from «boundaries»?
Chipmonk · 2024-02-06T17:45:42.348Z · comments (1)

[question] "Deception Genre" What Books are like Project Lawful?
Double · 2024-08-28T17:19:52.172Z · answers+comments (20)

Back to Basics: Truth is Unitary
lsusr · 2024-03-29T21:10:33.399Z · comments (13)

ProLU: A Nonlinearity for Sparse Autoencoders
Glen Taggart · 2024-04-23T14:09:21.592Z · comments (4)

D&D.Sci Long War: Defender of Data-mocracy Evaluation & Ruleset
aphyer · 2024-05-14T03:35:10.586Z · comments (3)

Why does generalization work?
Martín Soto (martinsq) · 2024-02-20T17:51:10.424Z · comments (16)

[link] How bad is chlorinated water?
bhauth · 2023-12-13T18:00:12.640Z · comments (18)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

vladimir_nesov on The Alignment Trap: AI Safety as Path to Power

The point is that the "controller" of a "controllable AI" is a role that can be filled by an AI and not only by a human or a human institution. AI is going to quickly grow the pie to the extent that makes current industry and economy (controlled by humans) a rounding error, so it seems unlikely that among the entities vying for control over controllable AIs, humans and human institutions are going to be worth mentioning. It's not even about a takeover, Google didn't take over Gambia.

andrew-sauer on Living Metaphorically

Another important one: Height/Altitude is authority. Your boss is "above" you, the king, president or CEO is "at the top", you "climb the corporate ladder"

d0themath on Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence

Sorry to give only a surface-level point of feedback, but I think this post would be much, much better if you shortened it significantly. As far as I can tell, pretty much every paragraph is 3x longer than it could be, which makes it a slog to read through.

ryan_greenblatt on How might we solve the alignment problem? (Part 1: Intro, summary, ontology)

Joe's argument here would actually be locally valid if we changed:

a sufficient number of IQ 100 agents with sufficient time can do anything that an IQ 101 agent can do

to:

a sufficient number of IQ 100 agents with sufficient time can do anything that some number of IQ 101 agents can do eventually

We can see why this works when applied to your analogy. If we change:

A sufficient number of 4yo’s could pick up any weight that a 5yo could pick up

to

A sufficient number of 4yo’s could pick up any weight that some number of 5yo's could pick up

Then we can see where the issue comes in. The problem is that while a team of 4yo's can always beat a single 5yo, there exists some number of 5yo's which can beat any number of 4yo's.

If we fix the local validity issue in Joe's argument like this, it is easier to see where issues might crop up.

sharmake-farah on The Alignment Trap: AI Safety as Path to Power

This honestly depends on the level of control achieved over AI in practice.

I do agree with the claim that there are pretty strong incentives to have AI peacefully takeover everything, but this is a long-term incentive, and more importantly if control gets good enough, at least some people would wield control of AI because of AIs wanting to be controlled by humans, combined with AI control strategies being good enough that you might avoid takeover at least in the early regime.

To be clear, in the long run, I expect an AI to likely (as in 70-85% likely) to wield the fruits of control, but I think that humans will at least at first wield the control for a number of years, maybe followed by uploads of humans, like virtual dictators and leaders next in line for control.

julian-stastny on The case for unlearning that removes information from LLM weights

I wonder if the approach from your paper is in some sense too conservative to evaluate whether information has been removed: Suppose I used some magical scalpel and removed all information about Harry Potter from the model.

Then I wouldn't be too surprised if this leaves a giant HP-shaped hole in the model such that, if you then fine-tune on a small amount of HP-related data, suddenly everything falls into place and makes sense to the model again, and this rapidly generalizes.

Maybe fine-tuning robust unlearning requires us to fill in the holes with synthetic data so that this doesn't happen.

julian-stastny on The case for unlearning that removes information from LLM weights

By tamper-resistant fine-tuning, are you referring to this paper by Tamirisa et al? (That'd be a pretty devastating issue with the whole motivation to their paper since no one actually does anything but use LoRA for fine-tuning open-weight models...)

vladimir_nesov on The Alignment Trap: AI Safety as Path to Power

If your work makes AI systems more controllable, who will ultimately wield that control?

A likely answer is "an AI".

vladimir_nesov on The Alignment Trap: AI Safety as Path to Power

Recent discussions about artificial intelligence safety have focused heavily on ensuring AI systems remain under human control. While this goal seems laudable on its surface, we should carefully examine whether some proposed safety measures could paradoxically enable rather than prevent dangerous concentrations of power.

The aim of avoiding AI takeover that ends poorly for humanity is not about preventing dangerous concentrations of power. Power that is distributed among AIs and not concentrated is entirely compatible with an AI takeover than ends poorly for humanity.

danielfilan on Habryka's Shortform Feed

It looks kinda small to me, someone who uses Firefox on Ubuntu.