LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

Trying to understand Hanson's Cultural Drift argument
Kemp (ethan-kemp) · 2024-07-22T20:20:32.734Z · comments (1)
Efficient Dictionary Learning with Switch Sparse Autoencoders
Anish Mudide (anish-mudide) · 2024-07-22T18:45:53.502Z · comments (15)
Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark (hojmax) · 2024-07-22T16:17:07.665Z · comments (0)
The Garden of Eden
Alexander Turok · 2024-07-22T16:07:42.509Z · comments (1)
Caring about excellence
owencb · 2024-07-22T14:24:37.892Z · comments (4)
[link] Tim Dillon's fake business is the most influential video I have watched in the last 24 months
Stuart Johnson (stuart-johnson) · 2024-07-22T12:54:43.749Z · comments (0)
On the CrowdStrike Incident
Zvi · 2024-07-22T12:40:05.894Z · comments (14)
Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents
Sam F. Brown (sam-4) · 2024-07-22T12:33:57.656Z · comments (0)
Initial Experiments Using SAEs to Help Detect AI Generated Text
Aaron_Scher · 2024-07-22T05:16:20.516Z · comments (0)
Categories of leadership on technical teams
benkuhn · 2024-07-22T04:50:04.071Z · comments (0)
An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)
OpenAI Boycott Revisit
Jake Dennie · 2024-07-22T01:44:55.094Z · comments (2)
Coalitional agency
Richard_Ngo (ricraz) · 2024-07-22T00:09:51.525Z · comments (4)
The AI Driver's Licence - A Policy Proposal
Joshua W (sooney) · 2024-07-21T20:38:07.093Z · comments (0)
[link] Demography and Destiny
Zero Contradictions · 2024-07-21T20:34:07.176Z · comments (11)
[link] The $100B plan with "70% risk of killing us all" w Stephen Fry [video]
Oleg Trott (oleg-trott) · 2024-07-21T20:06:39.615Z · comments (8)
[link] Raising Welfare for Lab Rodents
xanderbalwit · 2024-07-21T19:18:41.131Z · comments (0)
A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (14)
Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (10)
[question] Would a scope-insensitive AGI be less likely to incapacitate humanity?
Jim Buhler (jim-buhler) · 2024-07-21T14:15:27.934Z · answers+comments (3)
[link] Holomorphic surjection theorem (Picard's little theorem)
dkl9 · 2024-07-21T13:24:18.300Z · comments (0)
aimless ace analyzes active amateur: a micro-aaaaalignment proposal
lukehmiles (lcmgcd) · 2024-07-21T12:37:39.925Z · comments (0)
Pivotal Acts are easier than Alignment?
Michael Soareverix (michael-soareverix) · 2024-07-21T12:15:12.818Z · comments (4)
Ball Sq Pathways
jefftk (jkaufman) · 2024-07-21T02:20:06.607Z · comments (1)
Freedom and Privacy of Thought Architectures
JohnBuridan · 2024-07-20T21:43:11.419Z · comments (2)
Introduction to Modern Dating: Strategic Dating Advice for beginners
Jesper Lindholm · 2024-07-20T15:45:25.705Z · comments (5)
[link] Why Georgism Lost Its Popularity
Zero Contradictions · 2024-07-20T15:08:41.469Z · comments (47)
[link] Only Fools Avoid Hindsight Bias
Kevin Dorst · 2024-07-20T13:42:35.755Z · comments (4)
A more systematic case for inner misalignment
Richard_Ngo (ricraz) · 2024-07-20T05:03:03.500Z · comments (4)
BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann (Stuckwork) · 2024-07-20T02:20:51.848Z · comments (0)
Krona Compare
jefftk (jkaufman) · 2024-07-20T01:10:03.994Z · comments (0)
(Approximately) Deterministic Natural Latents
johnswentworth · 2024-07-19T23:02:12.306Z · comments (0)
Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Lidor Banuel Dabbah · 2024-07-19T20:32:15.095Z · comments (6)
[link] JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan (SenR) · 2024-07-19T16:10:54.664Z · comments (8)
[link] Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger · 2024-07-19T14:07:25.162Z · comments (1)
Sustainability of Digital Life Form Societies
Hiroshi Yamakawa (hiroshi-yamakawa) · 2024-07-19T13:59:13.973Z · comments (1)
[link] Romae Industriae
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-19T13:03:31.536Z · comments (2)
[question] Have people given up on iterated distillation and amplification?
Chris_Leong · 2024-07-19T12:23:04.625Z · answers+comments (1)
How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation")
Ruby · 2024-07-19T00:31:38.332Z · comments (21)
[link] Linkpost: Surely you can be serious
kave · 2024-07-18T22:18:09.271Z · comments (7)
My experience applying to MATS 6.0
mic (michael-chen) · 2024-07-18T19:02:21.849Z · comments (3)
[question] What are the actual arguments in favor of computationalism as a theory of identity?
sunwillrise (andrei-alexandru-parfeni) · 2024-07-18T18:44:20.751Z · answers+comments (24)
[link] Yet Another Critique of "Luxury Beliefs"
ymeskhout · 2024-07-18T18:37:28.703Z · comments (10)
[Interim research report] Evaluating the Goal-Directedness of Language Models
Rauno Arike (rauno-arike) · 2024-07-18T18:19:04.260Z · comments (0)
[link] Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
Karolis Jucys (karolis-ramanauskas) · 2024-07-18T17:02:06.179Z · comments (0)
Activation Engineering Theories of Impact
kubanetics (jakub-nowak) · 2024-07-18T16:44:33.656Z · comments (1)
[question] Me & My Clone
SimonBaars (simonbaars) · 2024-07-18T16:25:40.770Z · answers+comments (19)
AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (18)
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey (Lee_Sharkey) · 2024-07-18T14:15:50.248Z · comments (17)
SAEs (usually) Transfer Between Base and Chat Models
Connor Kissane (ckkissane) · 2024-07-18T10:29:46.138Z · comments (0)
← previous page (newer posts) · next page (older posts) →