LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence
Towards_Keeperhood (Simon Skade) · 2024-05-06T17:09:10.729Z · comments (16)

[link] If Clarity Seems Like Death to Them
Zack_M_Davis · 2023-12-30T17:40:42.622Z · comments (191)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (6)

Enriched tab is now the default LW Frontpage experience for logged-in users
Ruby · 2024-06-21T00:09:30.441Z · comments (27)

On OpenAI’s Model Spec
Zvi · 2024-06-21T13:00:03.014Z · comments (3)

I'm open for projects (sort of)
cousin_it · 2024-04-18T18:05:01.395Z · comments (13)

Decision Theory in Space
lsusr · 2024-08-18T07:02:11.847Z · comments (18)

So You Created a Sociopath - New Book Announcement!
Garrett Baker (D0TheMath) · 2024-04-01T18:02:18.010Z · comments (3)

[link] Announcing Squiggle Hub
ozziegooen · 2023-08-05T01:00:17.739Z · comments (4)

1. The CAST Strategy
Max Harms (max-harms) · 2024-06-07T22:29:13.005Z · comments (19)

How to hire somebody better than yourself
lukehmiles (lcmgcd) · 2024-08-28T08:12:53.450Z · comments (5)

Higher-effort summer solstice: What if we used AI (i.e., Angel Island)?
Rachel Shu (wearsshoes) · 2024-06-25T01:35:54.064Z · comments (9)

[link] Bayesians Commit the Gambler's Fallacy
Kevin Dorst · 2024-01-07T12:54:59.939Z · comments (28)

D&D.Sci: The Mad Tyrant's Pet Turtles [Evaluation and Ruleset]
abstractapplic · 2024-04-09T14:01:34.426Z · comments (6)

On the Proposed California SB 1047
Zvi · 2024-02-12T16:40:04.854Z · comments (18)

Thoughts on "The Offense-Defense Balance Rarely Changes"
Cullen (Cullen_OKeefe) · 2024-02-12T03:26:50.662Z · comments (4)

Telopheme, telophore, and telotect
TsviBT · 2023-09-17T16:24:03.365Z · comments (7)

[link] Contra Scott on Abolishing the FDA
Maxwell Tabarrok (maxwell-tabarrok) · 2023-12-15T14:00:17.247Z · comments (3)

The goal of physics
Jim Pivarski (jim-pivarski) · 2023-09-02T23:08:02.125Z · comments (4)

[link] Will releasing the weights of large language models grant widespread access to pandemic agents?
jefftk (jkaufman) · 2023-10-30T18:22:59.677Z · comments (25)

Feedback-loops, Deliberate Practice, and Transfer Learning
jacobjacob · 2023-09-07T01:57:33.066Z · comments (5)

AI Safety 101 : Capabilities - Human Level AI, What? How? and When?
markov (markovial) · 2024-03-07T17:29:53.260Z · comments (8)

Forecasting One-Shot Games
Raemon · 2024-08-31T23:10:05.475Z · comments (0)

The predictive power of dissipative adaptation
dr_s · 2023-12-17T14:01:31.568Z · comments (14)

AI #68: Remarkably Reasonable Reactions
Zvi · 2024-06-13T16:30:02.969Z · comments (11)

Saving the world sucks
Defective Altruism (Elijah Bodden) · 2024-01-10T05:55:46.504Z · comments (29)

Bounty for Evidence on Some of Palisade Research's Beliefs
benwr · 2024-09-23T20:01:20.917Z · comments (4)

Big Picture AI Safety: Introduction
EuanMcLean (euanmclean) · 2024-05-23T11:15:44.037Z · comments (7)

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"
Raemon · 2024-02-24T20:33:49.574Z · comments (19)

In Defense of Parselmouths
Screwtape · 2023-11-15T23:02:19.344Z · comments (10)

AI #75: Math is Easier
Zvi · 2024-08-01T13:40:05.539Z · comments (25)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

AI #41: Bring in the Other Gemini
Zvi · 2023-12-07T15:10:05.552Z · comments (16)

[Valence series] 4. Valence & Liking / Admiring
Steven Byrnes (steve2152) · 2024-06-10T14:19:51.194Z · comments (12)

Some costs of superposition
Linda Linsefors · 2024-03-03T16:08:20.674Z · comments (11)

Untrustworthy models: a frame for scheming evaluations
Olli Järviniemi (jarviniemi) · 2024-08-19T16:27:11.088Z · comments (3)

[link] Metascience of the Vesuvius Challenge
Maxwell Tabarrok (maxwell-tabarrok) · 2024-03-30T12:02:38.978Z · comments (2)

We ran an AI safety conference in Tokyo. It went really well. Come next year!
Blaine (blaine-rogers) · 2024-07-17T06:55:39.620Z · comments (1)

[link] Book review: Deep Utopia
PeterMcCluskey · 2024-04-23T19:55:50.417Z · comments (14)

New intro textbook on AIXI
Alex_Altair · 2024-05-11T18:18:50.945Z · comments (5)

AI #72: Denying the Future
Zvi · 2024-07-11T15:00:05.865Z · comments (8)

Applying refusal-vector ablation to a Llama 3 70B agent
Simon Lermen (dalasnoin) · 2024-05-11T00:08:08.117Z · comments (12)

[link] Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize
Owain_Evans · 2023-12-19T19:14:26.423Z · comments (4)

[link] Rational Animations' intro to mechanistic interpretability
Writer · 2024-06-14T16:10:57.015Z · comments (1)

Quick thoughts on the implications of multi-agent views of mind on AI takeover
Kaj_Sotala · 2023-12-11T06:34:06.395Z · comments (14)

Principled Satisficing To Avoid Goodhart
JenniferRM · 2024-08-16T19:05:27.204Z · comments (2)

Startup Roundup #2
Zvi · 2024-08-06T13:30:06.554Z · comments (0)

AI #80: Never Have I Ever
Zvi · 2024-09-10T17:50:08.074Z · comments (20)

[link] I'd also take $7 trillion
bhauth · 2024-02-19T03:31:45.552Z · comments (12)

Things Solenoid Narrates
Solenoid_Entity · 2024-04-12T23:57:16.169Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

cubefox on A brief theory of why we think things are good or bad

We believe many things because we are somewhat rational; we consider hypotheses, compare them with observed evidence, and emphasise those that are more compatible. Goodness and badness defy this practice; normal reasoning does not produce hypotheses of the form "if X is morally good then I will observe Y".

I disagree. If X is an action, is is usually considered good if it increases welfare, and bad if it decreases welfare. And we can find evidence for or against something being conducive to welfare. So we can find evidence for or against something being good.

cata on If far-UV is so great, why isn't it everywhere?

If you installed it in a preschool and it successfully killed all the pathogens there wouldn't be essentially no effect.

d0themath on Alexander Gietelink Oldenziel's Shortform

I think with sunglasses there’s a veneer of plausible deniability. They in fact have a utilitarian purpose outside of just creating information asymmetry. If you’re wearing a mask though, there’s no deniability. You just don’t want people to know where you’re looking.

carl-feynman on What if AGI was already accidentally created in 2019? [Fictional story]

It’s a cute idea, but AI is a terrible fiction writer.

alexander-gietelink-oldenziel on Alexander Gietelink Oldenziel's Shortform

That's a good counterexample ! Masks are dangerous and mysterious, but not cool in the way sunglasses are in agree

felix-j-binder on LLMs can learn about themselves by introspection

I believe introspection is a recursive task. You can perhaps do a single 'moment' of introspection on a single forward pass, but I'm not sure I'd even call that real introspection. Real introspection involves the ability to introspect about your introspection.

That is a good point! Indeed, one of the reasons that we measure introspection the way we do is because of the feedforward structure of the transformer. For every token that the model produces, the inner state of the model is not preserved for later tokens beyond the tokens already in context. Therefore, if you are introspecting at time n+1 about what was going on inside you at point n, the activations you would be targeting would be (by default) lost. (You could imagine training a model so that its embedding of previous token carries some information about internal activations, but we don’t expect that this is the case by default).

Therefore, we focus on introspection in a single forward pass. This is compatible with the model reasoning about the result of its introspection after it has written it into its context.

it would perhaps be interesting to train it on introspection earlier in its training and then simply continue with the normal training and see how that goes

I agree! One ways in which self-simulation is a useful strategy might be when the training data contains outputs that are similar to how the model would actually act: ie, for GPT N, that might be outputs of GPT N-1. Then, you might use your default behavior to stand in for that text. It seems plausible that people do this to some degree: if I know how I tend to behave, I can use this to stand in for predicting how other people might act in a particular situation. I take it that this is what you point out in the second paragraph.

you use a different prompt for training than checking and it might also be worthwhile to train it on that style of prompting but with unrelated content. (Not that I know how you'd fit that style of prompting with a different style of content mind you.)

Ah apologies—this might be a confusion due to the examples we show in the figures. We use the same general prompt templates for hypothetical questions in training and test. The same general patterns of our results hold when evaluating on the test set (see the appendix).

steve2152 on Against empathy-by-default

I definitely think that the human brain has innate evolved mechanisms related to social behavior in general, and to caring about (certain) other people’s welfare in particular.

And I agree that the evolutionary pressure explaining why those mechanisms exist are generally related to the kinds of things that Robert Trivers and other evolutionary psychologists talk about.

This post isn’t about that. Instead it’s about what those evolved mechanisms are, i.e. how they work in the brain.

Does that help?

…But I do want to push back against a strain of thought within evolutionary psychology where they say “there was an evolutionary pressure for the human brain to do X, and therefore the human brain does X”. I think this fails to appreciate the nature of the constraints that the brain operates under. There can be evolutionary pressure for the brain to do something, but there’s no way for the brain to do it, so it doesn’t happen, or the brain does something kinda like that but with incidental side-effects or whatever.

As an example, imagine if I said: “Here’s the source code for training an image-classifier ConvNet from random initialization using uncontrolled external training data. Can you please edit this source code so that the trained model winds up confused about the shape of Toyota Camry tires specifically?” The answer is: “Nope. Sorry. There is no possible edit I can make to this PyTorch source code such that that will happen.” You see what I mean? I think this kind of thing happens in the brain a lot. I talk about it more specifically here [LW · GW]. More of my opinions about evolutionary psychology here [LW · GW] and here [LW · GW].

felix-j-binder on LLMs can learn about themselves by introspection

I want to make the case that even this minimal strategy would be something that we might want to call "introspective," or that it can lead to the model learning true facts about itself.

First, self-simulating is a valid way of learning something about one's own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.

In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself ("What would the second letter of your response have been?"), but the results of others tell you something about the model more straightforwardly ("Would you have chosen the more wealth-seeking answer?"). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., "What would you have said is the capital of France?" does not, but "What would you have said if we asked you if we should implement subscription fees?" arguably does), then reasoning about that behavior would tell you something about the model.

So we have cases where (1) the model's statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.

roko on If far-UV is so great, why isn't it everywhere?

Far-UVC probably would have a large effect if a particular city or country installed it.

But if only a few buildings install it, then it has no effect because people just catch the bugs elsewhere.

Imagine the effect of just treating sewage from one house, and leaving all the untreated sewage from a million houses untreated in the river. There would be essentially no effect.

owain_evans on LLMs can learn about themselves by introspection

I agree about the "longer responses".

I'm unsure about the "personality trait" framing. There are two senses of "introspection" for humans. One is introspecting on your current mental state ("I feel a headache starting") and the other is being introspective about patterns in your behavior (e.g. "i tend to dislike violent movies" or "i tend to be shy among new people"). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this -- i.e. if another model had the same observational data then it could learn the same fact.

So I'm most interested in the former kind of introspective, or in cases of the latter where it'd take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.