LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

High-level interpretability: detecting an AI's objectives
Paul Colognese (paul-colognese) · 2023-09-28T19:30:16.753Z · comments (4)

[link] We're all in this together
Tamsin Leake (carado-1) · 2023-12-05T13:57:46.270Z · comments (65)

Would You Work Harder In The Least Convenient Possible World?
Firinn · 2023-09-22T05:17:05.148Z · comments (93)

Grokking, memorization, and generalization — a discussion
Kaarel (kh) · 2023-10-29T23:17:30.098Z · comments (11)

[link] Motivation gaps: Why so much EA criticism is hostile and lazy
titotal (lombertini) · 2024-04-22T11:49:59.389Z · comments (5)

[link] Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis
simeon_c (WayZ) · 2024-02-01T21:30:44.090Z · comments (17)

How We Picture Bayesian Agents
johnswentworth · 2024-04-08T18:12:48.595Z · comments (14)

MATS AI Safety Strategy Curriculum
Ryan Kidd (ryankidd44) · 2024-03-07T19:59:37.434Z · comments (2)

AI #79: Ready for Some Football
Zvi · 2024-08-29T13:30:10.902Z · comments (16)

Meetup Tip: Heartbeat Messages
Screwtape · 2023-12-07T17:18:33.582Z · comments (4)

Duct Tape security
Isaac King (KingSupernova) · 2024-04-26T18:57:05.659Z · comments (11)

[New Feature] Your Subscribed Feed
Ruby · 2024-06-11T22:45:00.000Z · comments (8)

When Are Circular Definitions A Problem?
johnswentworth · 2024-05-28T20:00:23.408Z · comments (15)

[link] The 2nd Demographic Transition
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-06T14:10:13.095Z · comments (17)

[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:17.755Z · comments (0)

Different senses in which two AIs can be “the same”
Vivek Hebbar (Vivek) · 2024-06-24T03:16:43.400Z · comments (0)

Hiring: Lighthaven Events & Venue Lead
Raemon · 2023-10-13T21:02:33.212Z · comments (2)

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example
Stuart_Armstrong · 2023-11-21T11:41:34.798Z · comments (9)

Generalized Stat Mech: The Boltzmann Approach
David Lorell · 2024-04-12T17:47:31.880Z · comments (7)

[link] GPT-4o System Card
Zach Stein-Perlman · 2024-08-08T20:30:52.633Z · comments (11)

Best in Class Life Improvement
sapphire (deluks917) · 2024-04-04T01:51:02.556Z · comments (20)

The Hessian rank bounds the learning coefficient
Lucius Bushnaq (Lblack) · 2024-08-08T20:55:36.960Z · comments (9)

Shard Theory - is it true for humans?
Rishika (rishika-bose) · 2024-06-14T19:21:47.997Z · comments (7)

Estimating Tail Risk in Neural Networks
Mark Xu (mark-xu) · 2024-09-13T20:00:06.921Z · comments (4)

How useful is "AI Control" as a framing on AI X-Risk?
habryka (habryka4) · 2024-03-14T18:06:30.459Z · comments (4)

Text Posts from the Kids Group: 2020
jefftk (jkaufman) · 2024-04-13T22:30:05.326Z · comments (3)

SB 1047 Is Weakened
Zvi · 2024-06-06T13:40:41.547Z · comments (4)

Thoughts On (Solving) Deep Deception
Jozdien · 2023-10-21T22:40:10.060Z · comments (2)

AI #39: The Week of OpenAI
Zvi · 2023-11-23T15:10:04.865Z · comments (8)

AI #42: The Wrong Answer
Zvi · 2023-12-14T14:50:05.086Z · comments (6)

Why Large Bureaucratic Organizations?
johnswentworth · 2024-08-27T18:30:07.422Z · comments (51)

[link] Why not electric trains and excavators?
bhauth · 2023-11-21T00:07:17.967Z · comments (39)

Timaeus is hiring!
Jesse Hoogland (jhoogland) · 2024-07-12T23:42:28.651Z · comments (4)

Brief notes on the Wikipedia game
Olli Järviniemi (jarviniemi) · 2024-07-14T02:28:22.473Z · comments (9)

[link] Open Source Automated Interpretability for Sparse Autoencoder Features
kh4dien · 2024-07-30T21:11:36.866Z · comments (1)

[Valence series] 3. Valence & Beliefs
Steven Byrnes (steve2152) · 2023-12-11T20:21:30.570Z · comments (11)

Ophiology (or, how the Mamba architecture works)
Danielle Ensign (phylliida-dev) · 2024-04-09T19:31:09.975Z · comments (8)

Don't Share Information Exfohazardous on Others' AI-Risk Models
Thane Ruthenis · 2023-12-19T20:09:06.244Z · comments (11)

Introducing AI-Powered Audiobooks of Rational Fiction Classics
Askwho · 2024-05-04T17:32:49.719Z · comments (14)

[link] [Link post] Michael Nielsen's "Notes on Existential Risk from Artificial Superintelligence"
Joel Becker (joel-becker) · 2023-09-19T13:31:02.298Z · comments (12)

[link] Towards Understanding Sycophancy in Language Models
Ethan Perez (ethan-perez) · 2023-10-24T00:30:48.923Z · comments (0)

"Fractal Strategy" workshop report
Raemon · 2024-04-06T21:26:53.263Z · comments (19)

OpenAI's Preparedness Framework: Praise & Recommendations
Akash (akash-wasil) · 2024-01-02T16:20:04.249Z · comments (1)

What and Why: Developmental Interpretability of Reinforcement Learning
Garrett Baker (D0TheMath) · 2024-07-09T14:09:40.649Z · comments (3)

Out-of-distribution Bioattacks
jefftk (jkaufman) · 2023-12-02T12:20:05.626Z · comments (15)

AE Studio @ SXSW: We need more AI consciousness research (and further resources)
AE Studio (AEStudio) · 2024-03-26T20:59:09.129Z · comments (7)

[link] Most experts believe COVID-19 was probably not a lab leak
DanielFilan · 2024-02-02T19:28:00.319Z · comments (89)

How to be an amateur polyglot
arisAlexis (arisalexis) · 2024-05-08T15:08:11.404Z · comments (16)

[link] Funding case: AI Safety Camp
Remmelt (remmelt-ellen) · 2023-12-12T09:08:18.911Z · comments (5)

Preventing model exfiltration with upload limits
ryan_greenblatt · 2024-02-06T16:29:33.999Z · comments (21)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

darklight on Darklight's Shortform

So, a while back I came up with an obscure idea I called the Alpha Omega Theorem and posted it on the Less Wrong forums. Given how there's only one post about it, it shouldn't be something that LLMs would know about. So in the past, I'd ask them "What is the Alpha Omega Theorem?", and they'd always make up some nonsense about a mathematical theory that doesn't actually exist. More recently, Google Gemini and Microsoft Bing Chat would use search to find my post and use that as the basis for their explanation. However, I only have the free version of ChatGPT and Claude, so they don't have access to the Internet and would make stuff up.

A couple days ago I tried the question on ChatGPT again, and GPT-4o managed to correctly say that there isn't a widely known concept of that name in math or science, and basically said it didn't know. Claude still makes up a nonsensical math theory. I also today tried telling Google Gemini not to use search, and it also said it did not know rather than making stuff up.

I'm actually pretty surprised by this. Looks like OpenAI and Google figured out how to reduce hallucinations somehow.

three-monkey-mind on Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more

If even one out of every ten accessibility advocates/experts/etc. did these things, then all these bugs would’ve been fixed years ago.

Maybe you're aware of an OOM more accessibility advocates than I am, but I come across all sorts of well-written blog posts explaining this or that bug, which browser/etc. it happens in, and how to work around it. That's most of the bullet points, although it might not be in the bug tracker of choice for the project.

What people aren't doing, as far as I have seen, is starting pooled-funds bug bounties for these things. People pass the collection plate for childhood cancer, especially since I'm told that September is Childhood Cancer Awareness Month, but not bugfixing.

This is not insensible: all sorts of people tend to be unwilling to set aside the cost of a new cell phone to fix one bug apiece that, generally speaking, is encountered in one's day job.

And there are a lot of accessibility bugs out there, some of which are quite old. I can only assume that accessibility bugs aren't treated massively more seriously than anything else in the WebKit or Firefox Bugzillas.

While the world would be a better place if bug-bounty collection plates were more popular, I can see why they're not as popular as I'd like.

gunnar_zarncke on [Intuitive self-models] 1. Preliminaries

…So that’s all that’s needed. If any system has both a capacity for endogenous action (motor control, attention control, etc.), and a generic predictive learning algorithm, that algorithm will be automatically incentivized to develop generative models about itself (both its physical self and its algorithmic self), in addition to (and connected to) models about the outside world.

Yes, and there are many different classes of such models. Most of them boring because the prediction of the effect of the agent on the environment is limited (small effect or low data rate) or simple (linear-ish or more-is-better-like).

But the self-models of social animals will quickly grow complex because the prediction of the action on the environment includes elements in the environment - other members of the species - that themselves predict the actions of other members.

You don't mention it, but I think Theory of Mind [? · GW] or Emphatic Inference [? · GW] play a large role in the specific flavor of human self-models.

silasbarta on Austin, TX Petrov Day and Potluck 2024

FYI there’s some backup on westbound William Cannon just before the turn into thr neighborhood at James Ranch Rd.

gb on Inquisitive vs. adversarial rationality

Not for this kind of fact, I’m afraid – my experience is that in answering questions like these, LLMs typically do no better than an educated guess. There are just way too many people stating their educated legal guesses as fact in the corpus, so it gets hard to distinguish.

sil-ver on [Intuitive self-models] 1. Preliminaries

Mhh, I think "it's not possible to solve (1) without also solving (2)" is equivalent to "every solution to (1) also solves (2)", which is equivalent to "(1) is sufficient for (2)". I did take some liberty in rephrasing step (2) from "figure out what consciousness is" to "figure out its computational implementation".

michael-pearce on Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

On the question of quantizing different feature activations differently: Computing the description length using the entropy of a feature activation's probability distribution is flexible enough to distinguish different types of distributions. For example, binary distributions would have a entropy of one bit and more continuous distributions would have larger entropies.

In our methodology, the effective float precision matters because it sets the bin width for the histogram of a feature's activations that is then used to compute the entropy. We used the same effective float precision for all features, which was found by rounding activations to different precisions until the reconstruction or cross-entropy loss is changed by some amount.

christiankl on Inquisitive vs. adversarial rationality

When it comes to trying to understand basic facts like how legal systems work LLMs make it easy to get an overview.

review-bot on [April Fools'] Definitive confirmation of shard theory

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

ymeskhout on Pronouns are Annoying

The whole point of the sentence was to demonstrate how bad ambiguity can get with pronouns, and this exchange is demonstrating my point exactly. The issue might be that you're making some (very reasonable) assumptions without noticing it narrows the range of possible interpretations. The only unambiguous part of the sentence is "John told Mark", but every other he can be either John or Mark.

Edit: my apologies for any rude tone, it was not intentional. All of us necessarily make reasonable assumptions to narrow ambiguity in our day to day conversations and it can be hard to completely jettison the habit.