LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible
GeneSmith · 2023-12-12T18:14:51.438Z · comments (162)

Speaking to Congressional staffers about AI risk
Akash (akash-wasil) · 2023-12-04T23:08:52.055Z · comments (23)

Constellations are Younger than Continents
Jeffrey Heninger (jeffrey-heninger) · 2023-12-19T06:12:40.667Z · comments (22)

AI Control: Improving Safety Despite Intentional Subversion
Buck · 2023-12-13T15:51:35.982Z · comments (7)

Thoughts on “AI is easy to control” by Pope & Belrose
Steven Byrnes (steve2152) · 2023-12-01T17:30:52.720Z · comments (55)

re: Yudkowsky on biological materials
bhauth · 2023-12-11T13:28:10.639Z · comments (30)

Critical review of Christiano's disagreements with Yudkowsky
Vanessa Kosoy (vanessa-kosoy) · 2023-12-27T16:02:50.499Z · comments (40)

"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity
Thane Ruthenis · 2023-12-16T20:08:39.375Z · comments (23)

2023 Unofficial LessWrong Census/Survey
Screwtape · 2023-12-02T04:41:51.418Z · comments (81)

Effective Aspersions: How the Nonlinear Investigation Went Wrong
TracingWoodgrains (tracingwoodgrains) · 2023-12-19T12:00:23.529Z · comments (170)

The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda
Cameron Berg (cameron-berg) · 2023-12-18T20:35:01.569Z · comments (20)

The likely first longevity drug is based on sketchy science. This is bad for science and bad for longevity.
BobBurgers · 2023-12-12T02:42:18.559Z · comments (34)

[link] Succession
Richard_Ngo (ricraz) · 2023-12-20T19:25:03.185Z · comments (48)

How useful is mechanistic interpretability?
ryan_greenblatt · 2023-12-01T02:54:53.488Z · comments (53)

Most People Don't Realize We Have No Idea How Our AIs Work
Thane Ruthenis · 2023-12-21T20:02:00.360Z · comments (42)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery
Seb Farquhar · 2023-12-18T11:58:39.379Z · comments (21)

Is being sexy for your homies?
Valentine · 2023-12-13T20:37:02.043Z · comments (92)

The Plan - 2023 Version
johnswentworth · 2023-12-29T23:34:19.651Z · comments (39)

AI Views Snapshots
Rob Bensinger (RobbBB) · 2023-12-13T00:45:50.016Z · comments (61)

The Dark Arts
lsusr · 2023-12-19T04:41:13.356Z · comments (49)

[link] Bayesian Injustice
Kevin Dorst · 2023-12-14T15:44:08.664Z · comments (10)

The LessWrong 2022 Review
habryka (habryka4) · 2023-12-05T04:00:00.000Z · comments (43)

Mapping the semantic void: Strange goings-on in GPT embedding spaces
mwatkins · 2023-12-14T13:10:22.691Z · comments (31)

Natural Latents: The Math
johnswentworth · 2023-12-27T19:03:01.923Z · comments (31)

Current AIs Provide Nearly No Data Relevant to AGI Alignment
Thane Ruthenis · 2023-12-15T20:16:09.723Z · comments (152)

What I Would Do If I Were Working On AI Governance
johnswentworth · 2023-12-08T06:43:42.565Z · comments (32)

Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper · 2023-12-05T16:48:18.177Z · comments (29)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
Neel Nanda (neel-nanda-1) · 2023-12-23T02:44:24.270Z · comments (6)

On the future of language models
owencb · 2023-12-20T16:58:28.433Z · comments (17)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (4)

"AI Alignment" is a Dangerously Overloaded Term
Roko · 2023-12-15T14:34:29.850Z · comments (98)

[question] How do you feel about LessWrong these days? [Open feedback thread]
jacobjacob · 2023-12-05T20:54:42.317Z · answers+comments (272)

Prediction Markets aren't Magic
SimonM · 2023-12-21T12:54:07.754Z · comments (29)

Meaning & Agency
abramdemski · 2023-12-19T22:27:32.123Z · comments (17)

Based Beff Jezos and the Accelerationists
Zvi · 2023-12-06T16:00:08.380Z · comments (29)

[Valence series] 1. Introduction
Steven Byrnes (steve2152) · 2023-12-04T15:40:21.274Z · comments (14)

A Crisper Explanation of Simulacrum Levels
Thane Ruthenis · 2023-12-23T22:13:52.286Z · comments (13)

[link] A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien (alexandre-variengien) · 2023-12-19T11:52:27.354Z · comments (3)

Refusal mechanisms: initial experiments with Llama-2-7b-chat
Andy Arditi (andy-arditi) · 2023-12-08T17:08:01.250Z · comments (7)

OpenAI: Leaks Confirm the Story
Zvi · 2023-12-12T14:00:04.812Z · comments (9)

Send us example gnarly bugs
Beth Barnes (beth-barnes) · 2023-12-10T05:23:00.773Z · comments (10)

MATS Summer 2023 Retrospective
Rocket (utilistrutil) · 2023-12-01T23:29:47.958Z · comments (34)

EU policymakers reach an agreement on the AI Act
tlevin (trevor) · 2023-12-15T06:02:44.668Z · comments (7)

Studying The Alien Mind
Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-12-05T17:27:28.049Z · comments (10)

[link] The problems with the concept of an infohazard as used by the LW community [Linkpost]
Noosphere89 (sharmake-farah) · 2023-12-22T16:13:54.822Z · comments (43)

[link] The Offense-Defense Balance Rarely Changes
Maxwell Tabarrok (maxwell-tabarrok) · 2023-12-09T15:21:23.340Z · comments (23)

Neural uncertainty estimation review article (for alignment)
Charlie Steiner · 2023-12-05T08:01:32.723Z · comments (3)

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem
Ansh Radhakrishnan (anshuman-radhakrishnan-1) · 2023-12-16T05:49:23.672Z · comments (3)

[link] Nietzsche's Morality in Plain English
Arjun Panickssery (arjun-panickssery) · 2023-12-04T00:57:42.839Z · comments (13)

next page (older posts) →

Archive

Recent comments

raemon on some thoughts on LessOnline

Young people (metaphorically or literally) are welcome!

seth-herd on jacquesthibs's Shortform

I think future more powerful/useful AIs will understand our intentions better IF they are trained to predict language. Text corpuses contain rich semantics about human intentions.

I can imagine other AI systems that are trained differently, and I would be more worried about those.

That's what I meant by current AI understanding our intentions possibly better than future AI.

richard_kennaway on Introducing AI Lab Watch

"AI Watch."

raemon on Raemon's Shortform

Are the disagree reacts with ‘small icons are good for this reason (enough to override other concerns)’ or ‘I didn’t update previously?’

d0themath on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"

I will also suggest the questions: 1) What are the things I’m really confident in? And 2) What are the things those I often read or talk to are really confident in? 3) And are there simple arguments which just involve bringing in little-thought-about domains of effect which throw that confidence into question?

jesse-hoogland on Examples of Highly Counterfactual Discoveries?

Anecdotally (I couldn't find confirmation after a few minutes of searching), I remember hearing a claim about Darwin being particularly ahead of the curve with sexual selection & mate choice. That without Darwin it might have taken decades for biologists to come to the same realizations.

review-bot on Cohabitive Games so Far

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

flewrint-ophiuni on Cooperation is optimal, with weaker agents too - tldr

Thank you for the references! I'm reading your writings, it's interesting

I posted the super-cooperation argument while expecting that LessWrong would likely not be receptive, but I'm not sure which community would engage with all this and find it pertinent at this stage

More concrete and empirical productions seems needed

mir on Johannes C. Mayer's Shortform

I rly like the idea of making songs to powerfwly remind urself abt things. TODO.

Step 1: Set an alarm for the morning. Step 2: Set the alarm tone for this song. Step 3: Make the alarm snooze for 30 minutes after the song has played. Step 4: Make the alarm only dismissable with solving a puzzle. Step 5: Only ever dismiss the alarm after you already left the house for the walk. Step 6: Always have an umbrella for when it is rainy, and have an alternative route without muddy roads.

I currently (until I get around to making a better system...) have an AI voice say reminders to myself based on calendar events I've set up to repeat every day (or any period I've defined). The event description is JSON, and if '"prompt": "Time to take a walk!"' is nonempty, the voice says what's in the prompt.

I don't have any routines that are too forcefwl (like "only dismissable with solving a puzzle"), because I want to minimize whip and maximize carrot. If I can only do what's good bc I force myself to do it, it's much less effective compared to if I just *want* to do what's good all the time.

...But whip can often be effective, so I don't recommend never using it. I'm just especially weak to it, due to not having much social backup-motivation, and a heavy tendency to fall into deep depressive equilibria.

bogdan-ionut-cirstea on Refusal in LLMs is mediated by a single direction

We can implement this as an inference-time intervention: every time a component (e.g. an attention head) writes its output $c_{out} \in R^{d_{model}}$ to the residual stream, we can erase its contribution to the "refusal direction" $^r$ . We can do this by computing the projection of $c_{out}$ onto $^r$ , and then subtracting this projection away:
$c_{out}^{'} \leftarrow c_{out} - (c_{out} \cdot^r)^r$
Note that we are ablating the same direction at every token and every layer. By performing this ablation at every component that writes the residual stream, we effectively prevent the model from ever representing this feature.

I'll note that to me this seems surprisingly spiritually similar to lines 7-8 from Algorithm 1 (at page 13) from Concept Algebra for (Score-Based) Text-Controlled Generative Models, where they 'project out' a direction corresponding to a semantic concept after each diffusion step (in a diffusion model).

This seems notable because the above paper proposes a theory for why linear representations might emerge in diffusion models and the authors seem interested in potentially connecting their findings to representations in transformers (especially in the residual stream). From a response to a review:

Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.
To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!
(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with $y$. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)
Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ''linear subspace hypothesis'' in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.
One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic's usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.