LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Meaning & Agency
abramdemski · 2023-12-19T22:27:32.123Z · comments (17)

Everything Wrong with Roko's Claims about an Engineered Pandemic
EZ97 · 2024-02-22T15:59:08.439Z · comments (10)

Improving the Welfare of AIs: A Nearcasted Proposal
ryan_greenblatt · 2023-10-30T14:51:35.901Z · comments (5)

Access to powerful AI might make computer security radically easier
Buck · 2024-06-08T06:00:19.310Z · comments (14)

We Should Prepare for a Larger Representation of Academia in AI Safety
Leon Lang (leon-lang) · 2023-08-13T18:03:19.799Z · comments (13)

Review: Conor Moreton's "Civilization & Cooperation"
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-05-26T19:32:43.131Z · comments (8)

[link] Introducing METR's Autonomy Evaluation Resources
Megan Kinniment (megan-kinniment) · 2024-03-15T23:16:59.696Z · comments (0)

[link] LTFF and EAIF are unusually funding-constrained right now
Linch · 2023-08-30T01:03:30.321Z · comments (24)

AI #31: It Can Do What Now?
Zvi · 2023-09-28T16:00:01.919Z · comments (6)

You can, in fact, bamboozle an unaligned AI into sparing your life
David Matolcsi (matolcsid) · 2024-09-29T16:59:43.942Z · comments (170)

Prediction Markets aren't Magic
SimonM · 2023-12-21T12:54:07.754Z · comments (29)

Problems with Robin Hanson's Quillette Article On AI
DaemonicSigil · 2023-08-06T22:13:43.654Z · comments (33)

[link] I compiled a ebook of `Project Lawful` for eBook readers
OrwellGoesShopping · 2023-09-15T18:09:31.703Z · comments (4)

Based Beff Jezos and the Accelerationists
Zvi · 2023-12-06T16:00:08.380Z · comments (29)

[link] New report: Safety Cases for AI
joshc (joshua-clymer) · 2024-03-20T16:45:27.984Z · comments (13)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

[link] Linkpost: A Post Mortem on the Gino Case
Linch · 2023-10-24T06:50:42.896Z · comments (7)

Public Call for Interest in Mathematical Alignment
Davidmanheim · 2023-11-22T13:22:09.558Z · comments (9)

story-based decision-making
bhauth · 2024-02-07T02:35:27.286Z · comments (11)

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

[link] Large Language Models can Strategically Deceive their Users when Put Under Pressure.
ReaderM · 2023-11-15T16:36:04.446Z · comments (8)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (60)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (5)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (8)

[link] Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan (akbir-khan) · 2024-02-07T21:28:10.694Z · comments (14)

On the abolition of man
Joe Carlsmith (joekc) · 2024-01-18T18:17:06.201Z · comments (18)

Teaching CS During Take-Off
andrew carle (andrew-carle) · 2024-05-14T22:45:39.447Z · comments (13)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

Stagewise Development in Neural Networks
Jesse Hoogland (jhoogland) · 2024-03-20T19:54:06.181Z · comments (1)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

[link] Techno-humanism is techno-optimism for the 21st century
Richard_Ngo (ricraz) · 2023-10-27T18:37:39.776Z · comments (5)

Research update: Towards a Law of Iterated Expectations for Heuristic Estimators
Eric Neyman (UnexpectedValues) · 2024-10-07T19:29:29.033Z · comments (2)

[link] Detecting Genetically Engineered Viruses With Metagenomic Sequencing
jefftk (jkaufman) · 2024-06-27T14:01:34.868Z · comments (10)

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
Lukas_Gloor · 2024-05-09T15:43:11.490Z · comments (36)

[link] Re: Anthropic's suggested SB-1047 amendments
RobertM (T3t) · 2024-07-27T22:32:39.447Z · comments (13)

2024 Petrov Day Retrospective
Ben Pace (Benito) · 2024-09-28T21:30:14.952Z · comments (25)

Growth and Form in a Toy Model of Superposition
Liam Carroll (liam-carroll) · 2023-11-08T11:08:04.359Z · comments (7)

[link] Self-Help Corner: Loop Detection
adamShimi · 2024-10-02T08:33:23.487Z · comments (6)

[link] More Hyphenation
Arjun Panickssery (arjun-panickssery) · 2024-02-07T19:43:29.086Z · comments (19)

How well do truth probes generalise?
mishajw · 2024-02-24T14:12:19.729Z · comments (11)

You’re Measuring Model Complexity Wrong
Jesse Hoogland (jhoogland) · 2023-10-11T11:46:12.466Z · comments (15)

I'm a bit skeptical of AlphaFold 3
Oleg Trott (oleg-trott) · 2024-06-25T00:04:41.274Z · comments (14)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

OpenAI: Helen Toner Speaks
Zvi · 2024-05-30T21:10:02.938Z · comments (8)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

We don't understand what happened with culture enough
Jan_Kulveit · 2023-10-09T09:54:20.096Z · comments (21)

[link] Benchmarks for Detecting Measurement Tampering [Redwood Research]
ryan_greenblatt · 2023-09-05T16:44:48.032Z · comments (19)

Apply to be a Safety Engineer at Lockheed Martin!
yanni kyriacos (yanni) · 2024-03-31T21:02:08.499Z · comments (3)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (49)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

owain_evans on LLMs can learn about themselves by introspection

Wrapping a question in a hypothetical feels closer to rephrasing the question than probing "introspection"

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question. Also, GPT3.5 does worse at predicting GPT-3.5 than Llama-70B does at predicting GPT-3.5 (without finetuning), and GPT4 is only a little better at predicting itself than are other models.

>Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.

While we don't know what is going on internally, I agree it's quite possible these "arise from similar things". In the paper we discuss "self-simulation" as a possible mechanism. Does that fit what you have in mind? Note: We are not claiming that models must be doing something very self-aware and sophisticated. The main thing is just to show that there is introspection according to our definition. Contrary to what you say, I don't think this result is obvious and (as I noted above) it's easy to run experiments where models do not show any advantage in predicting themselves.

alex_altair on Dalcy's Shortform

For some reason the "only if" always throws me off. It reminds me of the unless keyword in ruby, which is equivalent to if not, but somehow always made my brain segfault.

alex_altair on Dalcy's Shortform

It's maybe also worth saying that any other description method is a subset of programs (or is incomputable and therefore not what real-world AI systems are). So if the theoretical issues in AIT bother you, you can probably make a similar argument using a programming language with no while loop, or I dunno, finite MDPs whose probability distributions are Gaussian with finite parameter descriptions.

denkenberger on Bitter lessons about lucid dreaming

Stress during the day takes years off people's lives. Is there any evidence that stress during dreams (not necessarily nightmares) has a similar effect? Then that could be a significant benefit of lucid dreaming to reduce stress.

alex_altair on Dalcy's Shortform

Yeah, I think structural selection theorems matter a lot, for reasons I discussed here [LW · GW].

This is also one reason why I continue to be excited about Algorithmic Information Theory. Computable functions are behavioral, but programs (= algorithms) are structural! The fact that programs can be expressed in the homogeneous language of finite binary strings gives a clear way to select for structure; just limit the length of your program. We even know exactly how this mathematical parameter translates into real-world systems, because we can know exactly how many bits our ML models take up on the hard drives.

And I think you can use algorithmic information distance to well-define just how close to agent-structured your policy is. First, define the specific program A that you mean to be maximally agent-structured (which I define as a utility-maximizing program). If your policy (as a program) can be described as "Program A, but different in ways X" then we have an upper bound for how close it is to agent-structured it is. X will be a program that tells you how to transform A into your policy, and that gives us a "distance" of at most the length of X in bits.

For a given length, almost no programs act anything like A. So if your policy is only slightly bigger than A, and it acts like A, then it's probably of the form "A, but slightly different", which means it's agent-structured. (Unfortunately this argument needs like 200 pages of clarification.)

owain_evans on LLMs can learn about themselves by introspection

I think ground-truth is more expensive, noisy, and contentious as you get to questions like "What are your goals?" or "Do you have feelings?". I still think it's possible to get evidence on these questions. Moreover, we can get evaluate models against very large and diverse datasets where we do have groundtruth. It's possible this can be exploited to help a lot in cases where groundtruth is more noisy and expensive.

Where we have groundtruth: We have groundtruth for questions like the ones we study above (about properties of model behavior on a given prompt), and for questions like "Would you answer question [hard math question] correctly?". This can be extended to other counterfactual questions like "Suppose three words were deleted from this [text]. Which choice of three words have most change your rating of the quality of the text?"

Where groundtruth is more expensive and/or less clearcut. E.g. "Would you answer question [history exam question] correctly?". Or questions about which concepts the model is using to solve a problem, what the model's goals or references are. I still think we can gather evidence that makes answers to these questions more or less likely -- esp. if we average over a large set of such questions.

benito on leogao's Shortform

Pretty sure @Ronny Fernandez [LW · GW] has opinions about this (in particular, I expect he disagrees that actively being visibly weird requires being ignorant of how to behave conventionally).

martinkunev on Bitter lessons about lucid dreaming

I've had lucid dreams by accident (never tried to induce one). Upon waking up, my head hurts. Do others have the same experience? What are common negative effects of lucid dreams?

Also, can you control when you wake up?

zach-stein-perlman on LLMs can learn about themselves by introspection

I haven't read the paper. I think doing a little work on introspection is worthwhile. But I naively expect that it's quite intractable to do introspection science when we don't have access to the ground truth, and such cases are the only important ones. Relatedly, these tasks are trivialized by letting the model call itself, while letting the model call itself gives no help on welfare or "true preferences" introspection questions, if I understand correctly. [Edit: like, the inner states here aren’t the important hidden ones.]

jbash on LLM Psychometrics and Prompt-Induced Psychopathy

Random reactions--

It looks like you're really assigning scores to the personae the models present, not to the models themselves.

The models as opposed to the personae may or may not actually have anything that can reasonably be interpreted as "native" levels of psychopathy. It's kind of hard to tell whether something is, say, prepared to manipulate you, when there's no strong reason to think it particularly cares about having any particular effect on you. But if they do have native levels--
- It doesn't "feel" to me as though the human-oriented questions on the LSRP are the right sorts of ways to find out. The questions may suit the masks, but not the shoggoth.
- I feel even less as though "no system prompt" would elicit the native level, rather than some default persona's level.
By asking a model to play any role to begin with, you're directly asking it to be deceptive. If you tell it it's a human bicycle mechanic named Sally, in fact it still is an AI system that doesn't have a job other than to complete text or converse or whatever. It's just going along with you and playing the role of Sally.

When you see the model acting as psychopathic as it "expects" that Sally would be, you're actually demonstrating that the models can easily be prompted to in some sense cheat on psychopathy inventories. Well, effectively cheat, anyway. It's not obvious to me that the models-in-themselves have any "true beliefs" about who or what they are that aren't dependent on context, so the question of whether they're being deceptive may be harder than it looks.

But they seem to have at least some capacity to "intentionally" "fake" targeted levels of psychopathy.
By training a model to take on roles given in system prompts in the first place, its creators are intentionally teaching it to honor requests to be deceptive.

Just blithely taking on whatever roles you think fit the conversation you're in sounds kind of psychopathic, actually.
By "safety training" a model, its creators are causing it to color its answers according to what people want to hear, which I would think would probably make it more, not less, prone to deception and manipulation in general. It could actually inculcate something like psychopathy. And it could easily fail to carry over to actions, rather than words, once you get an agentic system.

I'm still not convinced the whole approach has any real value even for the LLMs we have now, let alone for whatever (probably architecturally different) systems end up achieving AGI or ASI.

All that goes double for teaching it to be "likable".
Since any given model can be asked to play any role, it might be more interesting to try to figure out which of all the possible roles it might be "willing" to assume would make it maximally deceptive.