LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] The Great Organism Theory of Evolution
rogersbacon · 2024-08-10T12:26:02.434Z · comments (0)

My decomposition of the alignment problem
Daniel C (harper-owen) · 2024-09-02T00:21:08.359Z · comments (22)

[link] Four Randomized Control Trials In Economics
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-08T15:59:23.250Z · comments (1)

Musings on Text Data Wall (Oct 2024)
Vladimir_Nesov · 2024-10-05T19:00:21.286Z · comments (2)

[link] Anthropic is being sued for copying books to train Claude
Remmelt (remmelt-ellen) · 2024-08-31T02:57:27.092Z · comments (4)

[link] Compression Moves for Prediction
adamShimi · 2024-09-14T17:51:12.004Z · comments (0)

[link] AI Model Registries: A Foundational Tool for AI Governance
Elliot Mckernon (elliot) · 2024-10-07T19:27:43.466Z · comments (1)

[link] Green and golden: a meditation
Richard_Ngo (ricraz) · 2024-08-18T01:36:43.613Z · comments (0)

Why I'm bearish on mechanistic interpretability: the shards are not in the network
tailcalled · 2024-09-13T17:09:25.407Z · comments (40)

Ten counter-arguments that AI is (not) an existential risk (for now)
Ariel Kwiatkowski (ariel-kwiatkowski) · 2024-08-13T22:35:15.341Z · comments (5)

[question] What should we do about COVID in 2024?
ChristianKl · 2024-08-04T10:57:24.140Z · answers+comments (2)

[link] To Be Born in a Bag
Niko_McCarty (niko-2) · 2024-10-06T17:21:00.605Z · comments (1)

Tokenized SAEs: Infusing per-token biases.
tdooms · 2024-08-04T09:17:46.755Z · comments (20)

Why Reflective Stability is Important
Johannes C. Mayer (johannes-c-mayer) · 2024-09-05T15:28:19.913Z · comments (2)

What program structures enable efficient induction?
Daniel C (harper-owen) · 2024-09-05T10:12:14.058Z · comments (4)

Looking for Goal Representations in an RL Agent - Update Post
CatGoddess · 2024-08-28T16:42:19.367Z · comments (0)

Announcing the PIBBSS Symposium '24!
DusanDNesic · 2024-09-03T11:19:47.568Z · comments (0)

[question] What are the best resources for building gears-level models of how governments actually work?
adamShimi · 2024-08-19T14:05:02.590Z · answers+comments (6)

Scaling Laws and Likely Limits to AI
Davidmanheim · 2024-08-18T17:19:46.597Z · comments (0)

"Real AGI"
Seth Herd · 2024-09-13T14:13:24.124Z · comments (18)

[link] Should Sports Betting Be Banned?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-21T14:13:35.404Z · comments (2)

Rabin's Paradox
Charlie Steiner · 2024-08-14T05:40:25.572Z · comments (40)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

Avoiding the Bog of Moral Hazard for AI
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-13T21:24:34.137Z · comments (12)

Economics Roundup #4
Zvi · 2024-10-15T13:20:06.923Z · comments (4)

[link] AI existential risk probabilities are too unreliable to inform policy
Oleg Trott (oleg-trott) · 2024-07-28T00:59:59.497Z · comments (5)

Can Large Language Models effectively identify cybersecurity risks?
emile delcourt (emile-delcourt) · 2024-08-30T20:20:21.345Z · comments (0)

[question] How great is the utility of "saving" endangered languages?
SpectrumDT · 2024-08-20T13:14:32.895Z · answers+comments (29)

Finding Deception in Language Models
Esben Kran (esben-kran) · 2024-08-20T09:42:13.060Z · comments (4)

Bryan Johnson and a search for healthy longevity
NancyLebovitz · 2024-07-27T15:28:13.117Z · comments (17)

My career exploration: Tools for building confidence
lynettebye · 2024-09-13T11:37:55.843Z · comments (0)

[link] Four Levels of Voting Methods
hive · 2024-09-26T18:15:00.565Z · comments (3)

[question] Is this voting system strategy proof?
Donald Hobson (donald-hobson) · 2024-09-06T20:44:46.691Z · answers+comments (9)

[link] Why Swiss watches and Taylor Swift are AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T13:23:27.033Z · comments (11)

[question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T12:40:07.678Z · answers+comments (7)

"Which Future Mind is Me?" Is a Question of Values
dadadarren · 2024-08-09T18:17:09.884Z · comments (12)

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-08-24T07:39:00.057Z · comments (0)

Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025)
Linda Linsefors · 2024-08-23T14:18:24.327Z · comments (2)

Is Text Watermarking a lost cause?
egor.timatkov · 2024-10-01T16:20:51.113Z · comments (13)

[link] AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering
Corin Katzke (corin-katzke) · 2024-07-29T17:50:52.454Z · comments (1)

[link] some questionable space launch guns
bhauth · 2024-10-13T22:52:26.418Z · comments (0)

[link] Will we ever run out of new jobs?
Kevin Kohler (KevinKohler) · 2024-08-19T15:04:03.849Z · comments (7)

[link] Instruction Following without Instruction Tuning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-24T13:49:09.078Z · comments (0)

[link] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-14T23:23:26.296Z · comments (1)

Slave Morality: A place for every man and every man in his place
Martin Sustrik (sustrik) · 2024-09-19T04:20:04.491Z · comments (7)

Review: Dr Stone
ProgramCrafter (programcrafter) · 2024-09-29T10:35:53.175Z · comments (5)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

[link] CultFrisbee
Gauraventh (aryangauravyadav) · 2024-08-11T21:36:36.550Z · comments (3)

[link] My lukewarm take on GLP-1 agonists
George3d6 · 2024-08-26T12:34:27.929Z · comments (0)

Interview with Robert Kralisch on Simulators
WillPetillo · 2024-08-26T05:49:15.543Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

owain_evans on LLMs can learn about themselves by introspection

Wrapping a question in a hypothetical feels closer to rephrasing the question than probing "introspection"

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question. Also, GPT3.5 does worse at predicting GPT-3.5 than Llama-70B does at predicting GPT-3.5 (without finetuning), and GPT4 is only a little better at predicting itself than are other models.

>Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.

While we don't know what is going on internally, I agree it's quite possible these "arise from similar things". In the paper we discuss "self-simulation" as a possible mechanism. Does that fit what you have in mind? Note: We are not claiming that models must be doing something very self-aware and sophisticated. The main thing is just to show that there is introspection according to our definition. Contrary to what you say, I don't think this result is obvious and (as I noted above) it's easy to run experiments where models do not show any advantage in predicting themselves.

alex_altair on Dalcy's Shortform

For some reason the "only if" always throws me off. It reminds me of the unless keyword in ruby, which is equivalent to if not, but somehow always made my brain segfault.

alex_altair on Dalcy's Shortform

It's maybe also worth saying that any other description method is a subset of programs (or is incomputable and therefore not what real-world AI systems are). So if the theoretical issues in AIT bother you, you can probably make a similar argument using a programming language with no while loop, or I dunno, finite MDPs whose probability distributions are Gaussian with finite parameter descriptions.

denkenberger on Bitter lessons about lucid dreaming

Stress during the day takes years off people's lives. Is there any evidence that stress during dreams (not necessarily nightmares) has a similar effect? Then that could be a significant benefit of lucid dreaming to reduce stress.

alex_altair on Dalcy's Shortform

Yeah, I think structural selection theorems matter a lot, for reasons I discussed here [LW · GW].

This is also one reason why I continue to be excited about Algorithmic Information Theory. Computable functions are behavioral, but programs (= algorithms) are structural! The fact that programs can be expressed in the homogeneous language of finite binary strings gives a clear way to select for structure; just limit the length of your program. We even know exactly how this mathematical parameter translates into real-world systems, because we can know exactly how many bits our ML models take up on the hard drives.

And I think you can use algorithmic information distance to well-define just how close to agent-structured your policy is. First, define the specific program A that you mean to be maximally agent-structured (which I define as a utility-maximizing program). If your policy (as a program) can be described as "Program A, but different in ways X" then we have an upper bound for how close it is to agent-structured it is. X will be a program that tells you how to transform A into your policy, and that gives us a "distance" of at most the length of X in bits.

For a given length, almost no programs act anything like A. So if your policy is only slightly bigger than A, and it acts like A, then it's probably of the form "A, but slightly different", which means it's agent-structured. (Unfortunately this argument needs like 200 pages of clarification.)

owain_evans on LLMs can learn about themselves by introspection

I think ground-truth is more expensive, noisy, and contentious as you get to questions like "What are your goals?" or "Do you have feelings?". I still think it's possible to get evidence on these questions. Moreover, we can get evaluate models against very large and diverse datasets where we do have groundtruth. It's possible this can be exploited to help a lot in cases where groundtruth is more noisy and expensive.

Where we have groundtruth: We have groundtruth for questions like the ones we study above (about properties of model behavior on a given prompt), and for questions like "Would you answer question [hard math question] correctly?". This can be extended to other counterfactual questions like "Suppose three words were deleted from this [text]. Which choice of three words have most change your rating of the quality of the text?"

Where groundtruth is more expensive and/or less clearcut. E.g. "Would you answer question [history exam question] correctly?". Or questions about which concepts the model is using to solve a problem, what the model's goals or references are. I still think we can gather evidence that makes answers to these questions more or less likely -- esp. if we average over a large set of such questions.

benito on leogao's Shortform

Pretty sure @Ronny Fernandez [LW · GW] has opinions about this (in particular, I expect he disagrees that actively being visibly weird requires being ignorant of how to behave conventionally).

martinkunev on Bitter lessons about lucid dreaming

I've had lucid dreams by accident (never tried to induce one). Upon waking up, my head hurts. Do others have the same experience? What are common negative effects of lucid dreams?

Also, can you control when you wake up?

zach-stein-perlman on LLMs can learn about themselves by introspection

I haven't read the paper. I think doing a little work on introspection is worthwhile. But I naively expect that it's quite intractable to do introspection science when we don't have access to the ground truth, and such cases are the only important ones. Relatedly, these tasks are trivialized by letting the model call itself, while letting the model call itself gives no help on welfare or "true preferences" introspection questions, if I understand correctly. [Edit: like, the inner states here aren’t the important hidden ones.]

jbash on LLM Psychometrics and Prompt-Induced Psychopathy

Random reactions--

It looks like you're really assigning scores to the personae the models present, not to the models themselves.

The models as opposed to the personae may or may not actually have anything that can reasonably be interpreted as "native" levels of psychopathy. It's kind of hard to tell whether something is, say, prepared to manipulate you, when there's no strong reason to think it particularly cares about having any particular effect on you. But if they do have native levels--
- It doesn't "feel" to me as though the human-oriented questions on the LSRP are the right sorts of ways to find out. The questions may suit the masks, but not the shoggoth.
- I feel even less as though "no system prompt" would elicit the native level, rather than some default persona's level.
By asking a model to play any role to begin with, you're directly asking it to be deceptive. If you tell it it's a human bicycle mechanic named Sally, in fact it still is an AI system that doesn't have a job other than to complete text or converse or whatever. It's just going along with you and playing the role of Sally.

When you see the model acting as psychopathic as it "expects" that Sally would be, you're actually demonstrating that the models can easily be prompted to in some sense cheat on psychopathy inventories. Well, effectively cheat, anyway. It's not obvious to me that the models-in-themselves have any "true beliefs" about who or what they are that aren't dependent on context, so the question of whether they're being deceptive may be harder than it looks.

But they seem to have at least some capacity to "intentionally" "fake" targeted levels of psychopathy.
By training a model to take on roles given in system prompts in the first place, its creators are intentionally teaching it to honor requests to be deceptive.

Just blithely taking on whatever roles you think fit the conversation you're in sounds kind of psychopathic, actually.
By "safety training" a model, its creators are causing it to color its answers according to what people want to hear, which I would think would probably make it more, not less, prone to deception and manipulation in general. It could actually inculcate something like psychopathy. And it could easily fail to carry over to actions, rather than words, once you get an agentic system.

I'm still not convinced the whole approach has any real value even for the LLMs we have now, let alone for whatever (probably architecturally different) systems end up achieving AGI or ASI.

All that goes double for teaching it to be "likable".
Since any given model can be asked to play any role, it might be more interesting to try to figure out which of all the possible roles it might be "willing" to assume would make it maximally deceptive.