LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Natural Latents Are Not Robust To Tiny Mixtures
johnswentworth · 2024-06-07T18:53:36.643Z · comments (8)

The proper response to mistakes that have harmed others?
Ruby · 2023-12-31T04:06:31.505Z · comments (12)

A civilization ran by amateurs
Olli Järviniemi (jarviniemi) · 2024-05-30T17:57:32.601Z · comments (7)

Balancing Games
jefftk (jkaufman) · 2024-02-24T14:40:04.237Z · comments (18)

AI Safety Chatbot
markov (markovial) · 2023-12-21T14:06:48.981Z · comments (11)

Managing risks while trying to do good
Wei Dai (Wei_Dai) · 2024-02-01T18:08:46.506Z · comments (26)

[Intuitive self-models] 4. Trance
Steven Byrnes (steve2152) · 2024-10-08T13:30:41.446Z · comments (6)

Offering AI safety support calls for ML professionals
Vael Gates · 2024-02-15T23:48:12.797Z · comments (1)

[link] DeepMind: Evaluating Frontier Models for Dangerous Capabilities
Zach Stein-Perlman · 2024-03-21T03:00:31.599Z · comments (8)

Social status part 2/2: everything else
Steven Byrnes (steve2152) · 2024-03-05T16:29:19.072Z · comments (2)

Balsa Update and General Thank You
Zvi · 2023-12-12T20:30:03.980Z · comments (8)

Self-explaining SAE features
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-05T22:20:36.041Z · comments (13)

Originality vs. Correctness
alkjash · 2023-12-06T18:51:49.531Z · comments (17)

MATS Alumni Impact Analysis
utilistrutil · 2024-09-30T02:35:57.273Z · comments (6)

5 Physics Problems
DaemonicSigil · 2024-03-18T08:05:45.971Z · comments (0)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (9)

[link] Results from an Adversarial Collaboration on AI Risk (FRI)
Josh Rosenberg (josh-rosenberg) · 2024-03-11T20:00:24.642Z · comments (3)

There Should Be More Alignment-Driven Startups
Vaniver · 2024-05-31T02:05:06.799Z · comments (14)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

Base LLMs refuse too
Connor Kissane (ckkissane) · 2024-09-29T16:04:21.343Z · comments (20)

"Epistemic range of motion" and LessWrong moderation
habryka (habryka4) · 2023-11-27T21:58:40.834Z · comments (3)

Interdictor Ship
lsusr · 2024-08-19T04:59:18.487Z · comments (9)

[link] Linkpost: Memorandum on Advancing the United States’ Leadership in Artificial Intelligence
Nisan · 2024-10-25T04:37:00.828Z · comments (2)

On OpenAI Dev Day
Zvi · 2023-11-09T16:10:06.646Z · comments (0)

[link] Is Claude a mystic?
jessicata (jessica.liu.taylor) · 2024-06-07T04:27:09.118Z · comments (23)

[question] What do we know about the AI knowledge and views, especially about existential risk, of the new OpenAI board members?
Zvi · 2024-03-11T14:55:05.128Z · answers+comments (2)

Approaching Human-Level Forecasting with Language Models
Fred Zhang (fred-zhang) · 2024-02-29T22:36:34.012Z · comments (6)

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II
Lester Leong (lester-leong) · 2024-10-14T04:05:05.096Z · comments (9)

What is "True Love"?
johnswentworth · 2024-08-18T16:05:47.358Z · comments (9)

Raemon's Deliberate (“Purposeful?”) Practice Club
Raemon · 2023-11-14T18:24:19.335Z · comments (11)

0th Person and 1st Person Logic
Adele Lopez (adele-lopez-1) · 2024-03-10T00:56:14.446Z · comments (28)

An Actually Intuitive Explanation of the Oberth Effect
Isaac King (KingSupernova) · 2024-01-10T20:23:17.216Z · comments (33)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (3)

Toward Safety Cases For AI Scheming
Mikita Balesni (mykyta-baliesnyi) · 2024-10-31T17:20:06.019Z · comments (1)

[link] Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso (rhaps0dy) · 2024-07-25T22:00:55.398Z · comments (8)

[link] Linkpost: Surely you can be serious
kave · 2024-07-18T22:18:09.271Z · comments (8)

D&D.Sci: The Mad Tyrant's Pet Turtles
abstractapplic · 2024-03-29T16:22:13.732Z · comments (18)

Measuring Coherence of Policies in Toy Environments
dx26 (dylan-xu) · 2024-03-18T17:59:08.118Z · comments (9)

New paper shows truthfulness & instruction-following don't generalize by default
joshc (joshua-clymer) · 2023-11-19T19:27:30.735Z · comments (0)

What's next for the field of Agent Foundations?
Nora_Ammann · 2023-11-30T17:55:13.982Z · comments (23)

AI #48: Exponentials in Geometry
Zvi · 2024-01-18T14:20:07.869Z · comments (9)

[link] shoes with springs
bhauth · 2023-12-30T21:46:55.319Z · comments (6)

Thoughts on SB-1047
ryan_greenblatt · 2024-05-29T23:26:14.392Z · comments (1)

[link] More people getting into AI safety should do a PhD
AdamGleave · 2024-03-14T22:14:48.855Z · comments (24)

LessOnline Festival Updates Thread
Ben Pace (Benito) · 2024-04-18T21:55:08.003Z · comments (26)

The Sense Of Physical Necessity: A Naturalism Demo (Introduction)
LoganStrohl (BrienneYudkowsky) · 2024-02-24T02:56:31.458Z · comments (1)

AI #81: Alpha Proteo
Zvi · 2024-09-12T13:00:07.958Z · comments (3)

Does AI risk “other” the AIs?
Joe Carlsmith (joekc) · 2024-01-09T17:51:47.020Z · comments (3)

[link] Towards shutdownable agents via stochastic choice
EJT (ElliottThornley) · 2024-07-08T10:14:24.452Z · comments (7)

[link] An Opinionated Evals Reading List
Marius Hobbhahn (marius-hobbhahn) · 2024-10-15T14:38:58.778Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

anthonyc on How to cite LessWrong as an academic source?

I'm curious which kinds of posts you're looking to cite, for what kinds of use in a dissertation for what field.

Looking over the site as a whole, different posts should (IMO) be regarded as akin to primary sources, news sources, non-peer-reviewed academic papers, whitepapers, symposia, textbook chapters, or professional sources, depending on the author and epidemic status.

In other words, this isn't "a site" for this purpose, it's a forum that hosts many kinds of content in a conveniently cross-referenceable format, some but not all of which is of a suitable standard for referencing for some but not all academic uses. This at least should be familiar to how your professors think about other kinds of citations. Someone might cite a doctor's published case study as part of the background research on some disease, or the NYT's publication of a summary of the pentagon papers in regards to the history of first amendment jurisprudence, or a corporate whitepaper or other report as a source of data about an industry.

avturchin on How to cite LessWrong as an academic source?

I cited it as a blog. This option is available on Zotero.

anthonyc on Occupational Licensing Roundup #1

I think the things we need in those scenarios are permits and inspections before completing the connection/installation of the new stuff, which I have always needed in any state I've lived in in addition to needing to use licensed plumbers and electricians.

martinsq on Winning isn't enough

My understanding from discussions with the authors (but please correct me):

This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.

Maybe it's easiest if I explain what this post grows out of:

There seems to be a widespread vibe amongst rationalists that "one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply win". This vibe is no coincidence, since Eliezer and Nate, in some of their writing about FDT, use language strongly implying that decision theory A is objectively better than decision theory B because it just wins more. Unfortunately, this intuitive notion of winning cannot actually be made into a philosophically valid objective metric. (In more detail, a precise definition of winning is already decision-theory-complete, so these arguments beg the question.) This point is well-known in philosophical academia, and was already succinctly explained in a post by Caspar (which the authors mention).

In the current post, the authors extend a similar philosophical critique to other widespread uses of winning, or background assumptions about rationality. For example, some people say that "winning is about not playing dominated strategies"... and the authors agree about avoiding dominated strategies, but point out that this is not too action-guiding, because it is consistent with many policies. Or also, some people say that "rationality is about implementing the heuristics that have worked well in the past, and/or you think will lead to good future performance"... but these utterances hide other philosophical assumptions, like assuming the same mechanisms are at play in the past and future, which are especially tenuous for big problems like x-risk. Thus, vague references to winning aren't enough to completely pin down and justify behavior. Instead, we fundamentally need additional constraints or principles about normativity, what the authors call non-pragmatic principles. Of course, these principles cannot themselves be justified in terms of past performance (which would lead to circularity), so they instead need to be taken as normative axioms (just like we need ethical axioms, because ought cannot be derived from is).

nicholas-goldowsky-dill on Nicholas Goldowsky-Dill's Shortform

One question I have about the Mechanistic Interpretability safety case sketch from Roger Grosse: How do you actually find features of high concern?

Features of High Concern are things like "intent to carry out a harmful plan". By definition, they very rarely activate on benign deployment outputs. This must be the case as "if the features [of high concern] activate more than a handful of times in deployment (internal or external), we discontinue the model and roll back to a weaker model." [M1.1]

They propose to find these features [S1.1] by performing dictionary learning of some form, and then some combination of:

autointerp on the learned features – which currently requires examples of the feature activating. I guess you can also look at attribution of bad outcomes to features?
looking at what features activate "when the model is instructed or perturbed to be misaligned."

Generally dictionary learning only learns features that actually occur on some inputs. So you need to do something outside of the normal deployment distribution. Some ideas (many alluded to in the document):

Learn the features on the base model (or a ‘agentic-only’ version of the model) and then hope they transfer to the safety-trained model
Learn the features by gathering the activations on off-policy transcripts of harmful behavior. This is what Fabien Rodger does in coup-probes [AF · GW]
Prompting the model to behave badly. This probably can’t be sufficient for features of high concern – you don’t want your model to take catastrophic actions even when instructed to do so! Maybe if you catch jailbreaks with a separate mechanism that you disable?
Perturbing the model to behave badly. For example:
- Something in the genre of Mack and Turner's mechanistically eliciting latent behaviors [AF · GW] – e.g. this works to discover sleeper agents.
- Maybe steering for related features? It’s not clear to me how you do dictionary learning on a steered model and then apply those features to the unsteered model. But something in this genre seems plausible.

The situation is easier for features of medium concern, which are only problematic in certain contexts. Ideally you can learn the features in benign contexts and they’ll transfer to catastrophic contexts. For example, it’s okay if your model lies when roleplaying as a shady car salesman, when playing mafia, or in other low-stakes scenarios. If you can use these to learn a “deception” feature it might activate when the model assures you that some backdoored code is safe. You’d still want to validate this transfer empirically!

I think it would be valuable to think about how this could work in specific scenarios. That is, specify different combinations of

Feature of high concern
High level description of a misaligned model & its behavior
Catastrophic action (that would involve the feature of high concern)

Then try to answer the question of how you'd learn the feature such that it only activates in deployment when the model is behaving catastrophically. Do this for examples like

Scheming model intentionally inserting a backdoor into the codebase of the inference server
Model trained with poisoned data that behaves badly in response to some particular trigger
Oversight model that is colluding with the model it monitors, and doesn't flag an unsafe action by the monitored model

All of these strategies rely on the hope of feature transfer to the on-policy deployment distribution. It seems possible to get enough evidence to think it's likely these features transfer but difficult to have enough evidence to be highly confident. As they mention, having multiple realistic model organisms seems necessary for validation. Where do you get model organisms? Maybe

Model organisms of data poisoning – seem doable (e.g. sleeper agents)
Model organisms of scheming – seems difficult to make highly realistic but worth trying
Catch the first model that schemes against you and use it as a model organism – a risky game to be playing! But this is a good reason to pursue control strategies.

directedevolution on The Median Researcher Problem

In academic biomedicine, at least, which is where I work, it’s all about tech dev. Most of the development is based on obvious signals and conceptual clarity. Yes, we do study biological systems, but that comes after years, even decades, of building the right tools to get a crushingly obvious signal out of the system of interest. Until that point all the data is kind of a hint of what we will one day have clarity on rather than a truly useful stepping stone towards it. Have as much statistical rigor as you like, but if your methods aren’t good enough to deliver the data you need, it just doesn’t matter. Which is why people read titles, not figure footnotes: it’s the big ideas that really matter, and the labor going on in the labs themselves. Papers are in a way just evidence of work being done.

That’s why I sometimes worry about LessWrong. Participants who aren’t professionally doing research and spend a lot of time critiquing papers over niche methodological issues be misallocating their attention, or searching under the spotlight. The interesting thing is growth in our ability to measure and manipulate phenomena, not the exact analysis method in one paper or another. What’s true will eventually become crushingly obvious and you won’t need fancy statistics at that point, and before then the data will be crap so the fancy statistics won’t be much use. Obviously there’s a middle ground, but I think the vast majority of time is spent in the “too early to tell” or “everybody knows that” phase. If you can’t participate in that technology development in some way, I am not sure it’s right to say you are “outperforming” anything.

benquo on Scissors Statements for President?

X and Y are cooperating to contain people who object-level care about A and B, and recruit them into the dialectic drama. X is getting A wrong on purpose, and Y is getting B wrong on purpose, as a loyalty test. Trying to join the big visible org doing something about A leads to accepting escalating conditioning to develop the blind spot around B, and vice versa.

X and Y use the conflict as a pretext to expropriate resources from the relatively uncommitted. For instance, one way to interpret political polarization in the US is as a scam for the benefit of people who profit from campaign spending. War can be an excuse to subsidize armies. Etc.

I wrote about this here: http://benjaminrosshoffman.com/discursive-warfare-and-faction-formation/

tsvibt on Scissors Statements for President?

IDK, but I'll note that IME, calling for empathy for "the other side" (in either direction) is received with incuriosity / indifference at best, often hostility.

One thing that stuck with me is one of those true crime Youtube videos, where at some stage of the interrogation, the investigator stops being nice, and instead will immediately and harshly contradict anything that the suspect Bob is saying to paint a story where he's innocent. The commentator claimed that the reason the investigator does this is to avoid giving Bob confidence: if Bob's statements hung in the air unchallenged, Bob might think he's successfully creating a narrative and getting that narrative bought. Even if the investigator is not in danger of being fooled (e.g. because she already has video evidence contradicting some of Bob's statements), Bob might get more confident and spend more time lying instead of just confessing.

A conjecture is that for Susan, empathizing with Robert seems like giving room for him to gain more political steam; and the deeper the empathy, the more room you're giving Robert.

annasalamon on Scissors Statements for President?

If we can get good enough models of however the scissors-statements actually work, we might be able to help more people be more in touch with the common humanity of both halves of the country, and more able to heal blind spots.

E.g., if the above model is right, maybe we could tell at least some people "try exploring the hypothesis that Y-voters are not so much in favor of Y, as against X -- and that you're right about the problems with Y, but they might be able to see something that you and almost everyone you talk to is systematically blinded to about X."

We can build a useful genre-savviness about common/destructive meme patterns and how to counter them, maybe. LessWrong is sort of well-positioned to be a leader there: we have analytic strength, and aren't too politically mindkilled.

moisentinel on Going Beyond "immaturity"

Do you have any tips on improving my writing, or my worldview?