LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Open Source Automated Interpretability for Sparse Autoencoder Features
kh4dien · 2024-07-30T21:11:36.866Z · comments (1)

SB 1047 Is Weakened
Zvi · 2024-06-06T13:40:41.547Z · comments (4)

o1-preview is pretty good at doing ML on an unknown dataset
Håvard Tveit Ihle (havard-tveit-ihle) · 2024-09-20T08:39:49.927Z · comments (1)

Introducing AI-Powered Audiobooks of Rational Fiction Classics
Askwho · 2024-05-04T17:32:49.719Z · comments (14)

An AI Race With China Can Be Better Than Not Racing
niplav · 2024-07-02T17:57:36.976Z · comments (32)

What and Why: Developmental Interpretability of Reinforcement Learning
Garrett Baker (D0TheMath) · 2024-07-09T14:09:40.649Z · comments (4)

Ophiology (or, how the Mamba architecture works)
Danielle Ensign (phylliida-dev) · 2024-04-09T19:31:09.975Z · comments (8)

EIS XIV: Is mechanistic interpretability about to be practically useful?
scasper · 2024-10-11T22:13:51.033Z · comments (4)

"Fractal Strategy" workshop report
Raemon · 2024-04-06T21:26:53.263Z · comments (22)

AE Studio @ SXSW: We need more AI consciousness research (and further resources)
AE Studio (AEStudio) · 2024-03-26T20:59:09.129Z · comments (8)

Don't Share Information Exfohazardous on Others' AI-Risk Models
Thane Ruthenis · 2023-12-19T20:09:06.244Z · comments (11)

Counting AGIs
cash (cshunter) · 2024-11-26T00:06:17.845Z · comments (19)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Joar Skalse (Logical_Lunatic) · 2024-05-17T19:13:31.380Z · comments (10)

Timaeus is hiring!
Jesse Hoogland (jhoogland) · 2024-07-12T23:42:28.651Z · comments (6)

AI #42: The Wrong Answer
Zvi · 2023-12-14T14:50:05.086Z · comments (6)

[link] The economics of space tethers
harsimony · 2024-08-22T16:15:22.699Z · comments (22)

Indecision and internalized authority figures
Kaj_Sotala · 2024-07-06T10:10:02.528Z · comments (1)

OpenAI's Preparedness Framework: Praise & Recommendations
Akash (akash-wasil) · 2024-01-02T16:20:04.249Z · comments (1)

[link] Funding case: AI Safety Camp
Remmelt (remmelt-ellen) · 2023-12-12T09:08:18.911Z · comments (5)

Out-of-distribution Bioattacks
jefftk (jkaufman) · 2023-12-02T12:20:05.626Z · comments (15)

Preventing model exfiltration with upload limits
ryan_greenblatt · 2024-02-06T16:29:33.999Z · comments (21)

How to be an amateur polyglot
arisAlexis (arisalexis) · 2024-05-08T15:08:11.404Z · comments (16)

[question] Will quantum randomness affect the 2028 election?
Thomas Kwa (thomas-kwa) · 2024-01-24T22:54:30.800Z · answers+comments (52)

Understanding SAE Features with the Logit Lens
Joseph Bloom (Jbloom) · 2024-03-11T00:16:57.429Z · comments (0)

Implementing activation steering
Annah (annah) · 2024-02-05T17:51:55.851Z · comments (7)

minutes from a human-alignment meeting
bhauth · 2024-05-24T05:01:53.904Z · comments (4)

[link] Most experts believe COVID-19 was probably not a lab leak
DanielFilan · 2024-02-02T19:28:00.319Z · comments (89)

OpenAI: Altman Returns
Zvi · 2023-11-30T14:10:05.469Z · comments (12)

Friendship is transactional, unconditional friendship is insurance
Ruby · 2024-07-17T22:52:41.967Z · comments (24)

[link] On Shifgrethor
JustisMills · 2024-10-27T15:30:13.688Z · comments (18)

The Third Fundamental Question
Screwtape · 2024-11-15T04:01:33.770Z · comments (7)

Do Not Mess With Scarlett Johansson
Zvi · 2024-05-22T15:10:03.215Z · comments (7)

[link] An Opinionated Evals Reading List
Marius Hobbhahn (marius-hobbhahn) · 2024-10-15T14:38:58.778Z · comments (0)

[question] What's with all the bans recently?
[deleted] · 2024-04-04T06:16:49.062Z · answers+comments (83)

Advice to junior AI governance researchers
Akash (akash-wasil) · 2024-07-08T19:19:07.316Z · comments (1)

[link] How LDT helps reduce the AI arms race
Tamsin Leake (carado-1) · 2023-12-10T16:21:44.409Z · comments (13)

[link] Static Analysis As A Lifestyle
adamShimi · 2024-07-03T18:29:37.384Z · comments (11)

METR is hiring!
Beth Barnes (beth-barnes) · 2023-12-26T21:00:50.625Z · comments (1)

Schelling game evaluations for AI control
Olli Järviniemi (jarviniemi) · 2024-10-08T12:01:24.389Z · comments (5)

2. Corrigibility Intuition
Max Harms (max-harms) · 2024-06-08T15:52:29.971Z · comments (10)

How a chip is designed
YM (Yannick_Muehlhaeuser_duplicate0.05902100825326273) · 2024-06-28T08:04:27.392Z · comments (4)

[link] The Perceptron Controversy
Yuxi_Liu · 2024-01-10T23:07:23.341Z · comments (18)

Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
Seth Herd · 2024-08-05T15:38:09.682Z · comments (22)

SAEs (usually) Transfer Between Base and Chat Models
Connor Kissane (ckkissane) · 2024-07-18T10:29:46.138Z · comments (0)

[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex (Stefan42) · 2024-07-05T17:05:25.631Z · comments (2)

Interpreting and Steering Features in Images
Gytis Daujotas (gytis-daujotas) · 2024-06-20T18:33:59.512Z · comments (6)

AI #69: Nice
Zvi · 2024-06-20T12:40:02.566Z · comments (9)

A gentle introduction to mechanistic anomaly detection
Erik Jenner (ejenner) · 2024-04-03T23:06:16.778Z · comments (2)

Occupational Licensing Roundup #1
Zvi · 2024-10-30T11:00:04.516Z · comments (11)

On the Debate Between Jezos and Leahy
Zvi · 2024-02-06T14:40:05.487Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

joseph-miller on Joseph Miller's Shortform

There are two types of people in this world.

There are people who treat the lock on a public bathroom as a tool for communicating occupancy and a safeguard against accidental attempts to enter when the room is unavailable. For these people the standard protocol is to discern the likely state of engagement of the inner room and then tentatively proceed inside if they detect no signs of human activity.

And there are people who view the lock on a public bathroom as a physical barricade with which to temporarily defend possessed territory. They start by giving the the door a hearty push to test the tensile strength of the barrier. On meeting resistance they engage with full force, wringing the handle up and down and slamming into the door with their full body weight. Only once their attempts are thwarted do they reluctantly retreat to find another stall.

cbiddulph on You should consider applying to PhDs (soon!)

Thanks, this post made me seriously consider applying to a PhD, and I strong-upvoted. I had vaguely assumed that PhDs take way too long and don't allow enough access to compute compared to industry AI labs. But considering the long lead time required for the application process and the reminder that you can always take new opportunities as they come up, I now think applying is worth it.

However, looking into it, putting together a high-quality application starting now and finishing by the deadline seems approximately impossible? If the deadline were December 15, that would give you two weeks; other schools like Berkeley have even earlier deadlines. I asked ChatGPT how long it would take to apply to just a single school, and it said it would take 43–59 hours of time spent working, or ~4–6 weeks in real time. Claude said 37-55 hours/4-6 weeks.

Not to discourage anyone from starting their application now if they think they can do it - I guess if you're sufficiently productive and agentic and maybe take some amphetamines, you can do anything. But this seems like a pretty crazy timeline. Just the thought of asking someone to write me a recommendation letter in a two-week timeframe makes me feel bad.

Your post does make me think "if I were going to be applying to a PhD next December, what would I want to do now?" That seems pretty clarifying, and would probably be a helpful frame even if it turns out that a better opportunity comes along and I never apply to a PhD.

I think it'd be a good idea for you to repost this in August or early September of next year!

nicholas-heather-kross on Why and When Interpretability Work is Dangerous

Kinda, my current mainline-doom-case is "some AI gets controlled --> powerful people use it to prop themselves up --> world gets worse until AI gets uncontrollably bad --> doom". I would call it a different yet also-important doom case of "perpetual low-grade-AI dictatorship where the AI is controlled by humans in a surveillance state".

sunwillrise on gwern's Shortform

All of these ideas sound awesome and exciting, and precisely the right kind [LW · GW] of use of LLMs that I would like to see on LW!

sunwillrise on A shot at the diamond-alignment problem

It's looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago

I'm not sure I like this argument very much, as it currently stands. It's not that I believe anything you wrote in this paragraph is wrong per se, but more like this misses the mark a bit in terms of framing.

Yudkowsky had (and, AFAICT, still has) a specific theory [LW · GW] of human values in terms of what they mean in a reductionist [LW · GW] framework, where it makes sense (and is rather natural) to think of (approximate) utility functions [LW · GW] of humans and of Coherent Extrapolated Volition [LW · GW] as things-that-exist-in-the-territory [LW · GW].

I think a lot of writing and analysis, summarized by me here [LW(p) · GW(p)], has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe [LW(p) · GW(p)] "human values" in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints [LW · GW]; I instead observe a tremendous number of unanswered questions and apparent contradictions [LW(p) · GW(p)] that throw the entire edifice in disarray.

But supplementing this reorientation of thinking around what it means to satisfy human values has been "prosaic" [LW · GW] alignment researchers pivoting more towards intent alignment [LW · GW] as opposed to doomed-from-the-start paradigms like "learning the true human utility function" [LW · GW] or ambitious value learning [LW · GW], a recognition that realism about (AGI) rationality [LW · GW] is likely just straight-up false and that the very specific set of conclusions MIRI-clustered alignment researchers have reached [LW(p) · GW(p)] about what AGI cognition will be like are entirely overconfident and seem contradicted by our modern observations of LLMs [LW(p) · GW(p)], and ultimately an increased focus on the basic observation that full value alignment simply is not required [LW(p) · GW(p)] for a good AI outcome (or at the very least for prevent AI takeover). So it's not so much that human values (to the extent such a thing makes sense) are simpler, but more so that fulfilling those values is just not needed to as nearly a high a degree as people used to think.

nadroj on Mechanistically Eliciting Latent Behaviors in Language Models

Couldn't you do something like fit a Gaussian to the model's activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.

(If a gaussian isn't expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)

drake-thomas on Is the mind a program?

The theoretical maximum FLOPS of an Earth-bound classical computer is something like .

Is this supposed to have a different base or exponent? A single H100 already gets like $2^{45}$ FLOP/s.

green_leaf on LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.

Ooh.

ete on Raemon's Shortform

I lean towards an opt-out system for whole post imports? I'd expect the vast majority of relevant authors to be happy with it, and it would offer less inconvenience to readers. Letting an author easily register as "no whole text imports please" seems worthwhile, and maybe if people aren't happy with that switching to opt-in?

seth-herd on gwern's Shortform

I look forward to seeing some of this integrated into LW!

I spend a fair amount of time writing comments that are probably pretty obvious to experienced LWers. I think this is important for getting newer people on-board with historical ideas and arguments. This includes explaining to enthusiastic newbies why they're being voted into oblivion. Giving them some simulated expert comments would be great.

I think we'd see a large improvement in how useful new user's early posts are, and therefore how much they're encouraged to help vs. turned away by downvotes because they're not really contributing. It would be a great way to let new posters do some efficient due diligence without catching up on the entire distributed history of discussion on their chosen topic.

I think it would also be useful for the most knowledgeable authors to have an LLM with cached context/hidden state from the best LW posts and comment sections giving virtual comments before publication.

I'd love to see whatever you've got going internally as an optional feature (if it doesn't cost too much) rather than wait for a finished, integrated feature.