LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Jailbreaking language models with user roleplay
loops (smitop) · 2024-09-28T23:43:10.870Z · comments (0)

An open response to Wittkotter and Yampolskiy
Donald Hobson (donald-hobson) · 2024-09-24T22:27:21.987Z · comments (0)

The Geometric Importance of Side Payments
StrivingForLegibility · 2024-08-07T01:38:04.635Z · comments (4)

[link] Approval-Seeking ⇒ Playful Evaluation
Jonathan Moregård (JonathanMoregard) · 2024-08-28T21:03:51.244Z · comments (0)

Steering LLMs' Behavior with Concept Activation Vectors
Ruixuan Huang (sprout_ust) · 2024-09-28T09:53:19.658Z · comments (0)

MIT FutureTech are hiring for a Head of Operations role
peterslattery · 2024-10-02T17:11:42.960Z · comments (0)

[link] AI Safety at the Frontier: Paper Highlights, July '24
gasteigerjo · 2024-08-05T13:00:46.028Z · comments (0)

[link] Contagious Beliefs—Simulating Political Alignment
James Stephen Brown (james-brown) · 2024-10-13T00:27:08.084Z · comments (0)

[link] [Linkpost] Automated Design of Agentic Systems
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-19T23:06:06.669Z · comments (1)

[question] What makes one a "rationalist"?
mathyouf · 2024-10-08T20:25:21.812Z · answers+comments (5)

[link] [Linkpost] Hawkish nationalism vs international AI power and benefit sharing
jakub_krys (kryjak) · 2024-10-18T18:13:19.425Z · comments (5)

[question] What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?
Roko · 2024-10-19T06:11:12.602Z · answers+comments (16)

The Personal Implications of AGI Realism
xizneb · 2024-10-20T16:43:37.870Z · comments (7)

A brief theory of why we think things are good or bad
David Johnston (david-johnston) · 2024-10-20T20:31:26.309Z · comments (10)

Piling bounded arguments
momom2 (amaury-lorin) · 2024-09-19T22:27:41.534Z · comments (0)

A Brief Explanation of AI Control
Aaron_Scher · 2024-10-22T07:00:56.954Z · comments (1)

Sequence overview: Welfare and moral weights
MichaelStJules · 2024-08-15T04:22:32.567Z · comments (0)

Funding for programs and events on global catastrophic risk, effective altruism, and other topics
abergal · 2024-08-14T23:59:48.146Z · comments (0)

[link] October 2024 Progress in Guaranteed Safe AI
Quinn (quinn-dougherty) · 2024-10-28T23:34:51.689Z · comments (0)

Enhancing Mathematical Modeling with LLMs: Goals, Challenges, and Evaluations
ozziegooen · 2024-10-28T21:44:42.352Z · comments (0)

[question] somebody explain the word "epistemic" to me
KvmanThinking (avery-liu) · 2024-10-28T16:40:24.275Z · answers+comments (8)

Quantitative Trading Bootcamp [Nov 6-10]
Ricki Heicklen (bayesshammai) · 2024-10-28T18:39:58.480Z · comments (0)

Moral Trade, Impact Distributions and Large Worlds
Larks · 2024-09-20T03:45:56.273Z · comments (0)

[link] Validating / finding alignment-relevant concepts using neural data
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-20T21:12:49.267Z · comments (0)

Fake Blog Posts as a Problem Solving Device
silentbob · 2024-08-31T09:22:54.513Z · comments (0)

[link] Boons and banes
dkl9 · 2024-09-23T06:18:38.335Z · comments (0)

[question] On the subject of in-house large language models versus implementing frontier models
Annapurna (jorge-velez) · 2024-09-23T15:00:32.811Z · answers+comments (1)

Denver USA - ACX Meetups Everywhere Fall 2024
Eneasz · 2024-08-29T18:40:53.332Z · comments (0)

[link] Checking public figures on whether they "answered the question" quick analysis from Harris/Trump debate, and a proposal
david reinstein (david-reinstein) · 2024-09-11T20:25:27.845Z · comments (4)

Broadly human level, cognitively complete AGI
p.b. · 2024-08-06T09:26:13.220Z · comments (0)

[question] Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics?
Noosphere89 (sharmake-farah) · 2024-08-30T15:12:28.823Z · answers+comments (11)

[question] If I ask an LLM to think step by step, how big are the steps?
ryan_b · 2024-09-13T20:30:50.558Z · answers+comments (1)

Of Birds and Bees
RussellThor · 2024-09-30T10:52:15.069Z · comments (9)

[link] In-Context Learning: An Alignment Survey
alamerton · 2024-09-30T18:44:28.589Z · comments (0)

Foresight Vision Weekend 2024
Allison Duettmann (allison-duettmann) · 2024-10-01T21:59:55.107Z · comments (0)

[link] Consciousness As Recursive Reflections
Gunnar_Zarncke · 2024-10-05T20:00:53.053Z · comments (3)

Testing "True" Language Understanding in LLMs: A Simple Proposal
MtryaSam · 2024-11-02T19:12:34.710Z · comments (2)

[link] Is Redistributive Taxation Justifiable? Part 1: Do the Rich Deserve their Wealth?
Alexander de Vries (alexander-de-vries) · 2024-09-05T10:23:08.958Z · comments (20)

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
Winnie Yang (winnie-yang) · 2024-08-28T08:41:38.967Z · comments (2)

The Great Bootstrap
KristianRonn · 2024-10-11T19:46:51.752Z · comments (0)

Join my new subscriber chat
sarahconstantin · 2024-11-06T02:30:11.059Z · comments (0)

One person's worth of mental energy for AI doom aversion jobs. What should I do?
Lorec · 2024-08-26T01:29:01.700Z · comments (16)

Prediction markets and Taxes
Edmund Nelson (edmund-nelson) · 2024-11-01T17:39:35.191Z · comments (7)

[link] Thinking LLMs: General Instruction Following with Thought Generation
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-10-15T09:21:22.583Z · comments (0)

[link] Taking nonlogical concepts seriously
Kris Brown (kris-brown) · 2024-10-15T18:16:01.226Z · comments (5)

[link] Species as Canonical Referents of Super-Organisms
Yudhister Kumar (randomwalks) · 2024-10-18T07:49:52.944Z · comments (8)

[link] Metaculus's 'Minitaculus' Experiments — Collaborate With Us
ChristianWilliams · 2024-08-26T20:44:32.125Z · comments (0)

Goal: Understand Intelligence
Johannes C. Mayer (johannes-c-mayer) · 2024-11-03T21:20:02.900Z · comments (8)

Thirty random thoughts about AI alignment
Lysandre Terrisse · 2024-09-15T16:24:10.572Z · comments (1)

Retrieval Augmented Genesis
João Ribeiro Medeiros (joao-ribeiro-medeiros) · 2024-10-01T20:18:01.836Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

anthonyc on How to cite LessWrong as an academic source?

I'm curious which kinds of posts you're looking to cite, for what kinds of use in a dissertation for what field.

Looking over the site as a whole, different posts should (IMO) be regarded as akin to primary sources, news sources, non-peer-reviewed academic papers, whitepapers, symposia, textbook chapters, or professional sources, depending on the author and epidemic status.

In other words, this isn't "a site" for this purpose, it's a forum that hosts many kinds of content in a conveniently cross-referenceable format, some but not all of which is of a suitable standard for referencing for some but not all academic uses. This at least should be familiar to how your professors think about other kinds of citations. Someone might cite a doctor's published case study as part of the background research on some disease, or the NYT's publication of a summary of the pentagon papers in regards to the history of first amendment jurisprudence, or a corporate whitepaper or other report as a source of data about an industry.

avturchin on How to cite LessWrong as an academic source?

I cited it as a blog. This option is available on Zotero.

anthonyc on Occupational Licensing Roundup #1

I think the things we need in those scenarios are permits and inspections before completing the connection/installation of the new stuff, which I have always needed in any state I've lived in in addition to needing to use licensed plumbers and electricians.

martinsq on Winning isn't enough

My understanding from discussions with the authors (but please correct me):

This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.

Maybe it's easiest if I explain what this post grows out of:

There seems to be a widespread vibe amongst rationalists that "one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply win". This vibe is no coincidence, since Eliezer and Nate, in some of their writing about FDT, use language strongly implying that decision theory A is objectively better than decision theory B because it just wins more. Unfortunately, this intuitive notion of winning cannot actually be made into a philosophically valid objective metric. (In more detail, a precise definition of winning is already decision-theory-complete, so these arguments beg the question.) This point is well-known in philosophical academia, and was already succinctly explained in a post by Caspar (which the authors mention).

In the current post, the authors extend a similar philosophical critique to other widespread uses of winning, or background assumptions about rationality. For example, some people say that "winning is about not playing dominated strategies"... and the authors agree about avoiding dominated strategies, but point out that this is not too action-guiding, because it is consistent with many policies. Or also, some people say that "rationality is about implementing the heuristics that have worked well in the past, and/or you think will lead to good future performance"... but these utterances hide other philosophical assumptions, like assuming the same mechanisms are at play in the past and future, which are especially tenuous for big problems like x-risk. Thus, vague references to winning aren't enough to completely pin down and justify behavior. Instead, we fundamentally need additional constraints or principles about normativity, what the authors call non-pragmatic principles. Of course, these principles cannot themselves be justified in terms of past performance (which would lead to circularity), so they instead need to be taken as normative axioms (just like we need ethical axioms, because ought cannot be derived from is).

nicholas-goldowsky-dill on Nicholas Goldowsky-Dill's Shortform

One question I have about the Mechanistic Interpretability safety case sketch from Roger Grosse: How do you actually find features of high concern?

Features of High Concern are things like "intent to carry out a harmful plan". By definition, they very rarely activate on benign deployment outputs. This must be the case as "if the features [of high concern] activate more than a handful of times in deployment (internal or external), we discontinue the model and roll back to a weaker model." [M1.1]

They propose to find these features [S1.1] by performing dictionary learning of some form, and then some combination of:

autointerp on the learned features – which currently requires examples of the feature activating. I guess you can also look at attribution of bad outcomes to features?
looking at what features activate "when the model is instructed or perturbed to be misaligned."

Generally dictionary learning only learns features that actually occur on some inputs. So you need to do something outside of the normal deployment distribution. Some ideas (many alluded to in the document):

Learn the features on the base model (or a ‘agentic-only’ version of the model) and then hope they transfer to the safety-trained model
Learn the features by gathering the activations on off-policy transcripts of harmful behavior. This is what Fabien Rodger does in coup-probes [AF · GW]
Prompting the model to behave badly. This probably can’t be sufficient for features of high concern – you don’t want your model to take catastrophic actions even when instructed to do so! Maybe if you catch jailbreaks with a separate mechanism that you disable?
Perturbing the model to behave badly. For example:
- Something in the genre of Mack and Turner's mechanistically eliciting latent behaviors [AF · GW] – e.g. this works to discover sleeper agents.
- Maybe steering for related features? It’s not clear to me how you do dictionary learning on a steered model and then apply those features to the unsteered model. But something in this genre seems plausible.

The situation is easier for features of medium concern, which are only problematic in certain contexts. Ideally you can learn the features in benign contexts and they’ll transfer to catastrophic contexts. For example, it’s okay if your model lies when roleplaying as a shady car salesman, when playing mafia, or in other low-stakes scenarios. If you can use these to learn a “deception” feature it might activate when the model assures you that some backdoored code is safe. You’d still want to validate this transfer empirically!

I think it would be valuable to think about how this could work in specific scenarios. That is, specify different combinations of

Feature of high concern
High level description of a misaligned model & its behavior
Catastrophic action (that would involve the feature of high concern)

Then try to answer the question of how you'd learn the feature such that it only activates in deployment when the model is behaving catastrophically. Do this for examples like

Scheming model intentionally inserting a backdoor into the codebase of the inference server
Model trained with poisoned data that behaves badly in response to some particular trigger
Oversight model that is colluding with the model it monitors, and doesn't flag an unsafe action by the monitored model

All of these strategies rely on the hope of feature transfer to the on-policy deployment distribution. It seems possible to get enough evidence to think it's likely these features transfer but difficult to have enough evidence to be highly confident. As they mention, having multiple realistic model organisms seems necessary for validation. Where do you get model organisms? Maybe

Model organisms of data poisoning – seem doable (e.g. sleeper agents)
Model organisms of scheming – seems difficult to make highly realistic but worth trying
Catch the first model that schemes against you and use it as a model organism – a risky game to be playing! But this is a good reason to pursue control strategies.

directedevolution on The Median Researcher Problem

In academic biomedicine, at least, which is where I work, it’s all about tech dev. Most of the development is based on obvious signals and conceptual clarity. Yes, we do study biological systems, but that comes after years, even decades, of building the right tools to get a crushingly obvious signal out of the system of interest. Until that point all the data is kind of a hint of what we will one day have clarity on rather than a truly useful stepping stone towards it. Have as much statistical rigor as you like, but if your methods aren’t good enough to deliver the data you need, it just doesn’t matter. Which is why people read titles, not figure footnotes: it’s the big ideas that really matter, and the labor going on in the labs themselves. Papers are in a way just evidence of work being done.

That’s why I sometimes worry about LessWrong. Participants who aren’t professionally doing research and spend a lot of time critiquing papers over niche methodological issues be misallocating their attention, or searching under the spotlight. The interesting thing is growth in our ability to measure and manipulate phenomena, not the exact analysis method in one paper or another. What’s true will eventually become crushingly obvious and you won’t need fancy statistics at that point, and before then the data will be crap so the fancy statistics won’t be much use. Obviously there’s a middle ground, but I think the vast majority of time is spent in the “too early to tell” or “everybody knows that” phase. If you can’t participate in that technology development in some way, I am not sure it’s right to say you are “outperforming” anything.

benquo on Scissors Statements for President?

X and Y are cooperating to contain people who object-level care about A and B, and recruit them into the dialectic drama. X is getting A wrong on purpose, and Y is getting B wrong on purpose, as a loyalty test. Trying to join the big visible org doing something about A leads to accepting escalating conditioning to develop the blind spot around B, and vice versa.

X and Y use the conflict as a pretext to expropriate resources from the relatively uncommitted. For instance, one way to interpret political polarization in the US is as a scam for the benefit of people who profit from campaign spending. War can be an excuse to subsidize armies. Etc.

I wrote about this here: http://benjaminrosshoffman.com/discursive-warfare-and-faction-formation/

tsvibt on Scissors Statements for President?

IDK, but I'll note that IME, calling for empathy for "the other side" (in either direction) is received with incuriosity / indifference at best, often hostility.

One thing that stuck with me is one of those true crime Youtube videos, where at some stage of the interrogation, the investigator stops being nice, and instead will immediately and harshly contradict anything that the suspect Bob is saying to paint a story where he's innocent. The commentator claimed that the reason the investigator does this is to avoid giving Bob confidence: if Bob's statements hung in the air unchallenged, Bob might think he's successfully creating a narrative and getting that narrative bought. Even if the investigator is not in danger of being fooled (e.g. because she already has video evidence contradicting some of Bob's statements), Bob might get more confident and spend more time lying instead of just confessing.

A conjecture is that for Susan, empathizing with Robert seems like giving room for him to gain more political steam; and the deeper the empathy, the more room you're giving Robert.

annasalamon on Scissors Statements for President?

If we can get good enough models of however the scissors-statements actually work, we might be able to help more people be more in touch with the common humanity of both halves of the country, and more able to heal blind spots.

E.g., if the above model is right, maybe we could tell at least some people "try exploring the hypothesis that Y-voters are not so much in favor of Y, as against X -- and that you're right about the problems with Y, but they might be able to see something that you and almost everyone you talk to is systematically blinded to about X."

We can build a useful genre-savviness about common/destructive meme patterns and how to counter them, maybe. LessWrong is sort of well-positioned to be a leader there: we have analytic strength, and aren't too politically mindkilled.

moisentinel on Going Beyond "immaturity"

Do you have any tips on improving my writing, or my worldview?