LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

flowing like water; hard like stone
lsusr · 2024-02-20T03:20:46.531Z · comments (4)

Deceptive agents can collude to hide dangerous features in SAEs
Simon Lermen (dalasnoin) · 2024-07-15T17:07:33.283Z · comments (0)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

Incentive Learning vs Dead Sea Salt Experiment
Steven Byrnes (steve2152) · 2024-06-25T17:49:01.488Z · comments (1)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

Interpretability of SAE Features Representing Check in ChessGPT
Jonathan Kutasov (jonathan-kutasov) · 2024-10-05T20:43:36.679Z · comments (2)

[question] Supposing the 1bit LLM paper pans out
O O (o-o) · 2024-02-29T05:31:24.158Z · answers+comments (11)

European Progress Conference
Martin Sustrik (sustrik) · 2024-10-06T11:10:03.819Z · comments (11)

On the 2nd CWT with Jonathan Haidt
Zvi · 2024-04-05T17:30:05.223Z · comments (3)

NYU Code Debates Update/Postmortem
David Rein (david-rein) · 2024-05-24T16:08:06.151Z · comments (4)

There aren't enough smart people in biology doing something boring
Abhishaike Mahajan (abhishaike-mahajan) · 2024-10-21T15:52:04.482Z · comments (13)

The Foraging (Ex-)Bandit [Ruleset & Reflections]
abstractapplic · 2024-11-14T20:16:21.535Z · comments (3)

Probably Not a Ghost Story
George Ingebretsen (george-ingebretsen) · 2024-06-12T22:55:26.264Z · comments (4)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

Ackshually, many worlds is wrong
tailcalled · 2024-04-11T20:23:59.416Z · comments (42)

[link] A brief history of the automated corporation
owencb · 2024-11-04T14:35:04.906Z · comments (1)

[link] Secret US natsec project with intel revealed
Nathan Helm-Burger (nathan-helm-burger) · 2024-05-25T04:22:11.624Z · comments (0)

Option control
Joe Carlsmith (joekc) · 2024-11-04T17:54:03.073Z · comments (0)

Meetup In a Box: Year In Review
Czynski (JacobKopczynski) · 2024-02-14T01:18:28.259Z · comments (0)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

[link] Let's Design A School, Part 2.1 School as Education - Structure
Sable · 2024-05-02T22:04:30.435Z · comments (2)

[link] Death notes - 7 thoughts on death
Nathan Young · 2024-10-28T15:01:13.532Z · comments (1)

Vote in the LessWrong review! (LW 2022 Review voting phase)
habryka (habryka4) · 2024-01-17T07:22:17.921Z · comments (9)

Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
claudia.biancotti · 2024-11-18T09:38:35.723Z · comments (4)

Smartphone Etiquette: Suggestions for Social Interactions
Declan Molony (declan-molony) · 2024-06-04T06:01:03.336Z · comments (4)

Essaying Other Plans
Screwtape · 2024-03-06T22:59:06.240Z · comments (4)

[link] Care Doesn't Scale
stavros · 2024-10-28T11:57:38.742Z · comments (1)

$250K in Prizes: SafeBench Competition Announcement
ozhang (oliver-zhang) · 2024-04-03T22:07:41.171Z · comments (0)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

SAEs you can See: Applying Sparse Autoencoders to Clustering
Robert_AIZI · 2024-10-28T14:48:16.744Z · comments (0)

Ideas for Next-Generation Writing Platforms, using LLMs
ozziegooen · 2024-06-04T18:40:24.636Z · comments (4)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

D&D.Sci Hypersphere Analysis Part 3: Beat it with Linear Algebra
aphyer · 2024-01-16T22:44:52.424Z · comments (1)

[question] Thoughts on Francois Chollet's belief that LLMs are far away from AGI?
O O (o-o) · 2024-06-14T06:32:48.170Z · answers+comments (17)

How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots
Owain_Evans · 2024-03-28T02:34:21.799Z · comments (0)

The Sequences on YouTube
Neil (neil-warren) · 2024-01-07T01:44:39.663Z · comments (9)

Distillation of 'Do language models plan for future tokens'
TheManxLoiner · 2024-06-27T20:57:34.351Z · comments (2)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

Causality is Everywhere
silentbob · 2024-02-13T13:44:49.952Z · comments (12)

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features
Logan Riggs (elriggs) · 2024-03-15T16:30:00.744Z · comments (5)

Trying Bluesky
jefftk (jkaufman) · 2024-11-17T02:50:04.093Z · comments (16)

My Dating Heuristic
Declan Molony (declan-molony) · 2024-05-21T05:28:40.197Z · comments (4)

Exploring OpenAI's Latent Directions: Tests, Observations, and Poking Around
Johnny Lin (hijohnnylin) · 2024-01-31T06:01:27.969Z · comments (4)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

[link] Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles
Zack_M_Davis · 2024-03-02T22:05:49.553Z · comments (22)

AI #57: All the AI News That’s Fit to Print
Zvi · 2024-03-28T11:40:05.435Z · comments (14)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

Consequentialism is a compass, not a judge
Neil (neil-warren) · 2024-04-13T10:47:44.980Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lee-aao on OpenAI Email Archives (from Musk v. Altman)

Greg Brockman to Elon Musk, (cc: Sam Altman) - Nov 22, 2015 6:11 PM

In response to this follow up, Elon first mentions that $100M is not enough. And that he is encouraging OpenAI to raise more money on their own and promises to increase the amount they can raise to $1B.

I found this on the OpenAI blog: https://openai.com/index/openai-elon-musk/
There is a couple of other messages there. With the vibe that OpenAI team felt a betrayal from Elon.

We're sad that it's come to this with someone whom we’ve deeply admired—someone who inspired us to aim higher, then told us we would fail, started a competitor, and then sued us when we started making meaningful progress towards OpenAI’s mission without him.

@habryka [LW · GW] can you pls check the link? I think these messages could have added more context. Not sure why they weren't also included in the original source, though.

self on Dark Side Epistemology

The “how to think” memes floating around, the cached thoughts of Deep Wisdom—some of it will be good advice devised by rationalists. But other notions were invented to protect a lie or self-deception: spawned from the Dark Side.

It's so unfortunate that "how to think" - the rules of proper belief - are not hardcoded in the system's firmware, and must instead be entered via user-supplied data the belief system is built to manage. I'd frame that this post is centrally about this user-caused systembehavior-variability, and the implicit security flaw.

Another aspect: Dominant memes - that is, memes that feel good, fair & highstatus - can be functionally dysvirtuous and unilaterally damaging.

ape-in-the-coat on Magic by forgetting

The intuition that this is absurd is pointing at the fact that these technical details aren't what most people probably would care about, except if they insist on treating these probability numbers as real things and trying to make them follow consistent rules.

Except, this is exactly how people reason about the identities of everything.

Suppose you own a ball. And then a copy of this ball is created. Is there 50% chance that you now own the newly created ball? Do you half-own both balls? Of course not! Your ball is the same phisical object, no matter how many copies of it are created, you know which of the balls is yours.

Now, suppose that two balls are shuffled so that you don't know where is yours. Naturally, you assume that for every ball there is 50% probability that it's "your ball". Not because the two balls are copies of each other - they were so even before the shuffling. This probability represents your knowledge state and the shuffling made you less certain about which ball is yours.

And then suppose that one of these two balls is randomly selected and placed in a bag, with another identical ball. Now, to the best of your knowledge there is 50% probability that your ball is in the bag. And if a random ball is selected from the bag, there is 25% chance that it's yours.

So as a result of such manipulations there are three identical balls and one has 50% chance to be yours, while the other two have 25% chance to be yours. Is it a paradox? Oh course not. So why does it suddenly become a paradox when we are talking about copies of humans?

The moment such numbers stop being convenient, like assigning different weights to copies you are actually indifferent between

But we are not indifferent between them! That's the whole point. The idea that we should be indifferent between them is an extra assumption, which we are not making while reasoning about ownership of the balls. So why should we make it here?

metachirality on leogao's Shortform

What would an event optimized for this sort of thing look like?

q-home on Q Home's Shortform

Draft of a future post, any feedback is welcome. Continuation of a thought from this shortform post [LW(p) · GW(p)].

(picture: https://en.wikipedia.org/wiki/Drawing_Hands)

The problem

There's an alignment-related problem: how do we make an AI care about causes of a particular sensory pattern? What are "causes" of a particular sensory pattern in the first place? You want the AI to differentiate between "putting a real strawberry on a plate" and "creating a perfect illusion of a strawberry on a plate", but what's the difference between doing real things and creating perfect illusions, in general?

(Relevant topics: environmental goals; identifying causal goal concepts from sensory data; "look where I'm pointing, not at my finger"; Pointers Problem [? · GW]; Eliciting Latent Knowledge [? · GW]; symbol grounding problem [? · GW]; ontology identification problem.)

I have a general answer to those questions. My answer is very unfinished. Also it isn't mathematical, it's philosophical in nature. But I believe it's important anyway. Because there's not a lot of philosophical or non-philosophical ideas about the questions above. With questions like these you don't know where to even start thinking, so it's hard to imagine even a bad answer.

Obvious observations

Observation 1. Imagine you come up with a model which perfectly predicts your sensory experience (Predictor). Just having this model is not enough to understand causes of a particular sensory pattern, i.e. differentiate between stuff like "putting a real strawberry on a plate" and "creating a perfect illusion of a strawberry on a plate".

Observation 2. Not every Predictor has variables which correspond to causes of a particular sensory pattern. Not every Predictor can be used to easily derive something corresponding to causes of a particular sensory pattern. For example, some Predictors might make predictions by simulating a large universe with a superintelligent civilization inside which predicts your sensory experiences. See "Transparent priors".

The solution

So, what are causes of a particular sensory pattern?

"Recursive Sensory Models" (RSMs).

I'll explain what an RSM is and provide various examples.

What is a Recursive Sensory Model?

An RSM is a sequence of N models (Model 1, Model 2, ..., Model N) for which the following two conditions hold true:

Model (K + 1) is good at predicting more aspects of sensory experience than Model (K). Model (K + 2) is good at predicting more aspects than Model (K + 1). And so on.
Model 1 can be transformed into any of the other models according to special transformation rules. Those rules are supposed to be simple. But I can't give a fully general description of those rules. That's one of the biggest unfinished parts of my idea.

The second bullet point is kinda the most important one, but it's very underspecified. So you can only get a feel for it through looking at specific examples.

Core claim: when the two conditions hold true, the RSM contains easily identifiable "causes" of particular sensory patterns. The two conditions are necessary and sufficient for the existence of such "causes". The universe contains "causes" of particular sensory patterns to the extent to which statistical laws describing the patterns also describe deeper laws of the universe.

Example: object permanence

Imagine you're looking at a landscape with trees, lakes and mountains. You notice that none of those objects disappear.

It seems like a good model: "most objects in the 2D space of my vision don't disappear". (Model 1)

But it's not perfect. When you close your eyes, the landscape does disappear. When you look at your feet, the landscape does disappear.

So you come up with a new model: "there is some 3D space with objects; the space and the objects are independent from my sensory experience; most of the objects don't disappear". (Model 2)

Model 2 is better at predicting the whole of your sensory experience.

However, note that the "mathematical ontology [? · GW]" of both models is almost identical. (Both models describe spaces whose points can be occupied by something.) They're just applied to slightly different things. That's why "recursion" is in the name of Recursive Sensory Models: an RSM reveals similarities between different layers of reality. As if reality is a fractal.

Intuitively, Model 2 describes "causes" (real trees, lakes and mountains) of sensory patterns (visions of trees, lakes and mountains).

Example: reductionism

You notice that most visible objects move smoothly (don't disappear, don't teleport).

"Most visible objects move smoothly in a 2D/3D space" is a good model for predicting sensory experience. (Model 1)

But there's a model which is even better: "visible objects consist of smaller and invisible/less visible objects (cells, molecules, atoms) which move smoothly in a 2D/3D space". (Model 2)

However, note that the mathematical ontology of both models is almost identical.

Intuitively, Model 2 describes "causes" (atoms) of sensory patterns (visible objects).

Example: a scale model

Imagine you're alone in a field with rocks of different size and a scale model of the whole environment. You've already learned object permanence.

"Objects don't move in space unless I push them" is a good model for predicting sensory experience. (Model 1)

But it has a little flaw. When you push a rock, the corresponding rock in the scale model moves too. And vice-versa.

"Objects don't move in space unless I push them; there's a simple correspondence between objects in the field and objects in the scale model" is a better model for predicting sensory experience. (Model 2)

However, note that the mathematical ontology of both models is identical.

Intuitively, Model 2 describes a "cause" (the scale model) of sensory patterns (rocks of different size being at certain positions). Though you can reverse the cause and effect here.

Example: empathy

If you put your hand on a hot stove, you quickly move the hand away. Because it's painful and you don't like pain. This is a great model (Model 1) for predicting your own movements near a hot stove.

But why do other people avoid hot stoves? If another person touches a hot stove, pain isn't instantiated in your sensory experience.

Behavior of other people can be predicted with this model: "people have similar sensory experience and preferences, inaccessible to each other". (Model 2)

However, note that the mathematical ontology of both models is identical.

Intuitively, Model 2 describes a "cause" (inaccessible sensory experience) of sensory patterns (other people avoiding hot stoves).

Counterexample: a chaotic universe

Imagine yourself in a universe where your sensory experience is produced by very simple, but very chaotic laws. Despite the chaos, your sensory experience contains some simple, relatively stable patterns. Purely by accident.

In such universe, RSMs might not find any "causes" underlying particular sensory patterns (except the simple chaotic laws).

But in such case there are probably no "causes".

donatas-luciunas on Alignment is not intelligent

just to minimize the possible harm to these people if that happens, I will on purpose never collect their personal data, and will also tell them to be suspicious of me if I contact them in future

I don't think this would be a rational thing to do. If I knew that I will become psychopath on New Year's Eve, I will provide all help that is relevant for people until then. Protected people after New Year's Eve is not my interest. Vulnerable people after New Year's Eve is my interest.

Or in other words:

I don't need to warn them, if I am no danger
I don't want to warn them, if I am danger

richard_kennaway on Eli's shortform feed

This should be at the author’s discretion. Notify them when a shortform qualifies, add the option to the triple-dot menu, and provide a place for the author to add a title.

No AI titles. If the author wrote the content, they can write the title. If they didn’t, they can ask an AI themselves.

chipmonk on Hierarchical Agency: A Missing Piece in AI Alignment

Do we have a LessWrong tag for "hierarchical agency" or "multi-scale alignment" or something? Should I make one?

chipmonk on Hierarchical Agency: A Missing Piece in AI Alignment

I just made a twitter list with accounts interested in hierarchical agency (or what i call "multi-scale alignment"). Lmk who should be added

chipmonk on Hierarchical Agency: A Missing Piece in AI Alignment

Random but you might like this graphic I made representing hierarchical agency from my post today on a very similar idea. What would you change about it?