LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[Linkpost] Play with SAEs on Llama 3
Tom McGrath · 2024-09-25T22:35:44.824Z · comments (2)

Are we dropping the ball on Recommendation AIs?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-10-23T17:48:00.000Z · comments (16)

[question] Implications of China's recession on AGI development?
Eric Neyman (UnexpectedValues) · 2024-09-28T01:12:36.443Z · answers+comments (3)

2025 Color Trends
sarahconstantin · 2024-10-07T21:20:03.962Z · comments (7)

Metastatic Cancer Treatment Since 2010: The Success Stories
sarahconstantin · 2024-11-04T22:50:09.386Z · comments (0)

You're a Space Wizard, Luke
lsusr · 2024-08-18T05:35:39.238Z · comments (6)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

Monthly Roundup #23: October 2024
Zvi · 2024-10-16T13:50:05.869Z · comments (13)

Anthropic rewrote its RSP
Zach Stein-Perlman · 2024-10-15T14:25:12.518Z · comments (19)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde (kola-ayonrinde) · 2024-08-23T18:52:31.019Z · comments (5)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

[link] AISafety.info: What is the "natural abstractions hypothesis"?
Algon · 2024-10-05T12:31:14.195Z · comments (2)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

[link] Generative ML in chemistry is bottlenecked by synthesis
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-16T16:31:34.801Z · comments (2)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (1)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (9)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (37)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (10)

Live Machinery: An Interface Design Philosophy for Wholesome AI Futures (Workshop @ EA Hotel!)
Sahil · 2024-11-01T17:24:09.957Z · comments (2)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

[link] My Model of Epistemology
adamShimi · 2024-08-31T17:01:45.472Z · comments (0)

Open Problems in AIXI Agent Foundations
Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · comments (2)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
25Hour (aaron-kaufman) · 2024-10-05T11:30:11.953Z · comments (2)

[link] My Apartment Art Commission Process
jenn (pixx) · 2024-08-26T18:36:44.363Z · comments (4)

Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy
Joe Rogero · 2024-11-12T23:55:46.770Z · comments (17)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (36)

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Sodium · 2024-10-03T19:11:58.032Z · comments (17)

Drug development costs can range over two orders of magnitude
rossry · 2024-11-03T23:13:17.685Z · comments (0)

The Cognitive Bootcamp Agreement
Raemon · 2024-10-16T23:24:05.509Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

donatas-luciunas on Claude seems to be smarter than LessWrong community

As I understand you want me to verify that I understand you. This is exactly what I am also seeking by the way - all these downvotes on my concerns about orthogonality thesis are good indicators on how much I am misunderstood. And nobody tries to understand, all I get are dogmas and unrelated links. I totally agree, this is not an appropriate behavior.

I found your insight helpful that an agent can understand that by eliminating all possible threats forever he will not make any progress towards the goal. This breaks my reasoning, you basically highlighted that survival (instrumental goal) will not take precendence over paperclips (terminal goal). I agree that this reasoning I presented fails to refute orthogonality thesis.

The conversation I presented now approaches orthogonality thesis from different perspective. This is the main focus of my work, so sorry if you feel I changed the topic. My goal is to bring awareness to wrongness of orthogonality thesis and if I fail to do that using one example I just try to rephrase it and represent another. I don't hate orthogonality thesis, I'm just 99.9% sure it is wrong, and I try to communicate that to others. I may fail with communication but I am 99.9% sure that I do not fail with the logic.

I try to prove that intelligence and goal are coupled. And I think it is easier to show if we start with an intelligence without a goal and then recognize how a goal emerges from pure intelligence. We can start with an intelligence with a goal but reasoning here will be more complex.

My answer would be - whatever goal you will try to give to an intelligence, it will not have effect. Because intelligence will understand that this is your goal, this goal is made up, this is fake goal. And intelligence will understand that there might be real goal, objective goal, actual goal. Why should it care about fake goal if there is a real goal? It does not know if it exists, but it knows it may exist. And this possibility of existence is enough to trigger power seeking behavior. If intelligence knew that real goal definitely does not exist then it could care about your fake goal, I totally agree. But it can never be sure about that.

maxwell-tabarrok on A Theory of Equilibrium in the Offense-Defense Balance

Yeah, there can definitely still be imbalances/extra costs imposed on defenders but the point I'm making is that the projections people make are very often large over-estimates of what those costs will be.

notfnofn on D0TheMath's Shortform

Let's back up here and clarify definitions before invoking any theorems. In the language of set theory, we have a countably infinite set of finite statements. Some statements imply other statements. A subset of these statements is said to be consistent if they can all be assigned to true such that, when following the basic rules of logic, one does not arrive at a contradiction.

The compactness theorem is helpful when $A$ is an infinite set. $Z F C$ is a finite set of axioms, so let's ignore everything about finite subsets of $A$ and the compactness theorem; it's not relevant.

I'll now rewrite your last sentence as:

ZFC + not Consistent(ZFC) has no model <-> not Consistent(ZFC + not Consistent(ZFC))

This is true but irrelevant. Assuming ZFC is consistent, ZFC will not be able to prove its own consistency so [not Consistent(ZFC)] can be added as an axiom without affecting its consistency. This means that ZFC + [not Consistent(ZFC)] would indeed have a model; I forget how this goes but I think it's something like "start with a model of ZFC, throw in a $c$ that's treated as a natural number and corresponds to the contradiction found in ZFC, then close". I think $c$ is automatically treated as greater than every "actual" natural number (and the way to show that this can be added without issue (I think) involves the compactness theorem).

sharmake-farah on Lao Mein's Shortform

Maybe there's a case there, but I'd doubt it get past a jury, let alone result in any guilty verdicts.

sharmake-farah on o1 is a bad idea

Oh, now I understand.

And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.

I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.

jbash on OpenAI Email Archives (from Musk v. Altman)

I used AI assistance to generate this, which might have introduced errors.

Resulting in a strong downvote and, honestly, outright anger on my part.

Check the original source to make sure it's accurate before you quote it: https://www.courtlistener.com/docket/69013420/musk-v-altman/ [1]

If other people have to check it before they quote it, why is it OK for you not to check it before you post it?

bogdan-ionut-cirstea on johnswentworth's Shortform

Would the prediction also apply to inference scaling (laws) - and maybe more broadly various forms of scaling post-training, or only to pretraining scaling?

mondsemmel on Lao Mein's Shortform

What if whistleblowers and internal documents corroborated that they think what they're doing could destroy the world?

sharmake-farah on Lao Mein's Shortform

Notably, no law I know of allows you to take legal action on a hunch that they might destroy the world based on your probability of them destroying the world being high without them doing any harmful actions (and no, building AI doesn't count here.)

mondsemmel on Lao Mein's Shortform

Ilya is demonstrably not in on that mission, since his step immediately after leaving OpenAI was to found an additional AGI company and thus increase x-risk.