LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Chinese scientists acknowledge xrisk & call for international regulatory body [Linkpost]
Akash (akash-wasil) · 2023-11-01T13:28:43.723Z · comments (4)

An Introduction to AI Sandbagging
Teun van der Weij (teun-van-der-weij) · 2024-04-26T13:40:00.126Z · comments (8)

[link] EPUBs of MIRI Blog Archives and selected LW Sequences
mesaoptimizer · 2023-10-26T14:17:11.538Z · comments (6)

AI #38: Let’s Make a Deal
Zvi · 2023-11-16T19:50:05.442Z · comments (2)

Start an Upper-Room UV Installation Company?
jefftk (jkaufman) · 2024-10-19T02:00:10.691Z · comments (9)

Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders
Gytis Daujotas (gytis-daujotas) · 2024-08-01T21:08:38.800Z · comments (6)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

Minimal Motivation of Natural Latents
johnswentworth · 2024-10-14T22:51:58.125Z · comments (14)

Which LessWrong/Alignment topics would you like to be tutored in? [Poll]
Ruby · 2024-09-19T01:35:02.999Z · comments (12)

How difficult is AI Alignment?
Sammy Martin (SDM) · 2024-09-13T15:47:10.799Z · comments (6)

The need for multi-agent experiments
Martín Soto (martinsq) · 2024-08-01T17:14:16.590Z · comments (3)

[link] cancer rates after gene therapy
bhauth · 2024-10-16T15:32:53.949Z · comments (0)

[question] Where is the Town Square?
Gretta Duleba (gretta-duleba) · 2024-02-13T03:53:18.205Z · answers+comments (8)

Job Listing: Managing Editor / Writer
Gretta Duleba (gretta-duleba) · 2024-02-21T23:41:26.818Z · comments (2)

Childhood and Education Roundup #4
Zvi · 2024-01-30T13:50:06.033Z · comments (10)

[question] Does reducing the amount of RL for a given capability level make AI safer?
Chris_Leong · 2024-05-05T17:04:01.799Z · answers+comments (22)

Incidental polysemanticity
Victor Lecomte (victor-lecomte) · 2023-11-15T04:00:00.000Z · comments (7)

Why does generalization work?
Martín Soto (martinsq) · 2024-02-20T17:51:10.424Z · comments (16)

2023 LessWrong Community Census, Request for Comments
Screwtape · 2023-11-01T16:32:19.102Z · comments (37)

[link] rapid growth
Chipmonk · 2024-06-05T00:43:51.501Z · comments (0)

[link] How bad is chlorinated water?
bhauth · 2023-12-13T18:00:12.640Z · comments (18)

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
Sonia Joseph (redhat) · 2024-03-13T17:09:17.027Z · comments (13)

Understanding Positional Features in Layer 0 SAEs
bilalchughtai (beelal) · 2024-07-29T09:36:40.701Z · comments (0)

[link] Why Georgism Lost Its Popularity
Zero Contradictions · 2024-07-20T15:08:41.469Z · comments (50)

The Next ChatGPT Moment: AI Avatars
kolmplex (luke-man) · 2024-01-05T20:14:10.074Z · comments (10)

Humanity isn't remotely longtermist, so arguments for AGI x-risk should focus on the near term
Seth Herd · 2024-08-12T18:10:56.543Z · comments (10)

My intellectual journey to (dis)solve the hard problem of consciousness
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-06T09:32:41.612Z · comments (41)

Sci-Fi books micro-reviews
Yair Halberstadt (yair-halberstadt) · 2024-06-24T09:49:28.523Z · comments (27)

Locating My Eyes (Part 3 of "The Sense of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-02-29T03:09:25.810Z · comments (4)

Concrete empirical research projects in mechanistic anomaly detection
Erik Jenner (ejenner) · 2024-04-03T23:07:21.502Z · comments (3)

New Executive Team & Board — PIBBSS
Nora_Ammann · 2024-07-01T19:30:45.261Z · comments (1)

Ambiguity in Prediction Market Resolution is Still Harmful
aphyer · 2024-07-31T20:32:40.217Z · comments (17)

The Case for Predictive Models
Rubi J. Hudson (Rubi) · 2024-04-03T18:22:20.243Z · comments (7)

[link] Non-alignment project ideas for making transformative AI go well
Lukas Finnveden (Lanrian) · 2024-01-04T07:23:13.658Z · comments (1)

When fine-tuning fails to elicit GPT-3.5's chess abilities
Theodore Chapman · 2024-06-14T18:50:52.855Z · comments (3)

A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed
johnswentworth · 2024-08-22T19:19:28.940Z · comments (4)

How I internalized my achievements to better deal with negative feelings
Raymond Koopmanschap · 2024-02-27T15:10:24.149Z · comments (7)

Case studies on social-welfare-based standards in various industries
HoldenKarnofsky · 2024-06-20T13:33:44.780Z · comments (0)

Paper Summary: The Effects of Communicating Uncertainty on Public Trust in Facts and Numbers
Jeffrey Heninger (jeffrey-heninger) · 2024-07-09T16:50:05.776Z · comments (2)

Housing Roundup #7
Zvi · 2024-03-04T15:00:08.192Z · comments (1)

[link] you should probably eat oatmeal sometimes
bhauth · 2024-08-25T14:50:37.570Z · comments (31)

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2024-07-11T20:27:00.000Z · comments (60)

[link] We Need Major, But Not Radical, FDA Reform
Maxwell Tabarrok (maxwell-tabarrok) · 2024-02-24T16:54:33.061Z · comments (12)

Trust as a bottleneck to growing teams quickly
benkuhn · 2024-07-13T18:00:04.579Z · comments (3)

[link] Surgery Works Well Without The FDA
Maxwell Tabarrok (maxwell-tabarrok) · 2024-01-26T13:31:29.968Z · comments (28)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues
aphyer · 2024-06-07T19:02:06.859Z · comments (16)

[link] AI Girlfriends Won't Matter Much
Maxwell Tabarrok (maxwell-tabarrok) · 2023-12-23T15:58:30.308Z · comments (22)

[link] Rowing vs steering
Saul Munn (saul-munn) · 2024-08-10T07:00:17.594Z · comments (2)

Formalizing the Informal (event invite)
abramdemski · 2024-09-10T19:22:53.564Z · comments (0)

Deep and obvious points in the gap between your thoughts and your pictures of thought
KatjaGrace · 2024-02-23T07:30:07.461Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

felix-j-binder on LLMs can learn about themselves by introspection

I believe introspection is a recursive task. You can perhaps do a single 'moment' of introspection on a single forward pass, but I'm not sure I'd even call that real introspection. Real introspection involves the ability to introspect about your introspection.

That is a good point! Indeed, one of the reasons that we measure introspection the way we do is because of the feedforward structure of the transformer. For every token that the model produces, the inner state of the model is not preserved for later tokens beyond the tokens already in context. Therefore, if you are introspecting at time n+1 about what was going on inside you at point n, the activations you would be targeting would be (by default) lost. (You could imagine training a model so that its embedding of previous token carries some information about internal activations, but we don’t expect that this is the case by default).

Therefore, we focus on introspection in a single forward pass. This is compatible with the model reasoning about the result of its introspection after it has written it into its context.

it would perhaps be interesting to train it on introspection earlier in its training and then simply continue with the normal training and see how that goes

I agree! One ways in which self-simulation is a useful strategy might be when the training data contains outputs that are similar to how the model would actually act: ie, for GPT N, that might be outputs of GPT N-1. Then, you might use your default behavior to stand in for that text. It seems plausible that people do this to some degree: if I know how I tend to behave, I can use this to stand in for predicting how other people might act in a particular situation. I take it that this is what you point out in the second paragraph.

you use a different prompt for training than checking and it might also be worthwhile to train it on that style of prompting but with unrelated content. (Not that I know how you'd fit that style of prompting with a different style of content mind you.)

Ah apologies—this might be a confusion due to the examples we show in the figures. We use the same general prompt templates for hypothetical questions in training and test. The same general patterns of our results hold when evaluating on the test set (see the appendix).

steve2152 on Against empathy-by-default

I definitely think that the human brain has innate evolved mechanisms related to social behavior in general, and to caring about (certain) other people’s welfare in particular.

And I agree that the evolutionary pressure explaining why those mechanisms exist are generally related to the kinds of things that Robert Trivers and other evolutionary psychologists talk about.

This post isn’t about that. Instead it’s about what those evolved mechanisms are, i.e. how they work in the brain.

Does that help?

…But I do want to push back against a strain of thought within evolutionary psychology where they say “there was an evolutionary pressure for the human brain to do X, and therefore the human brain does X”. I think this fails to appreciate the nature of the constraints that the brain operates under. There can be evolutionary pressure for the brain to do something, but there’s no way for the brain to do it, so it doesn’t happen, or the brain does something kinda like that but with incidental side-effects or whatever.

As an example, imagine if I said: “Here’s the source code for training an image-classifier ConvNet from random initialization using uncontrolled external training data. Can you please edit this source code so that the trained model winds up confused about the shape of Toyota Camry tires specifically?” The answer is: “Nope. Sorry. There is no possible edit I can make to this PyTorch source code such that that will happen.” You see what I mean? I think this kind of thing happens in the brain a lot. I talk about it more specifically here [LW · GW]. More of my opinions about evolutionary psychology here [LW · GW] and here [LW · GW].

felix-j-binder on LLMs can learn about themselves by introspection

I want to make the case that even this minimal strategy would be something that we might want to call "introspective," or that it can lead to the model learning true facts about itself.

First, self-simulating is a valid way of learning something about one's own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.

In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself ("What would the second letter of your response have been?"), but the results of others tell you something about the model more straightforwardly ("Would you have chosen the more wealth-seeking answer?"). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., "What would you have said is the capital of France?" does not, but "What would you have said if we asked you if we should implement subscription fees?" arguably does), then reasoning about that behavior would tell you something about the model.

So we have cases where (1) the model's statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.

roko on If far-UV is so great, why isn't it everywhere?

Far-UVC probably would have a large effect if a particular city or country installed it.

But if only a few buildings install it, then it has no effect because people just catch the bugs elsewhere.

Imagine the effect of just treating sewage from one house, and leaving all the untreated sewage from a million houses untreated in the river. There would be essentially no effect.

owain_evans on LLMs can learn about themselves by introspection

I agree about the "longer responses".

I'm unsure about the "personality trait" framing. There are two senses of "introspection" for humans. One is introspecting on your current mental state ("I feel a headache starting") and the other is being introspective about patterns in your behavior (e.g. "i tend to dislike violent movies" or "i tend to be shy among new people"). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this -- i.e. if another model had the same observational data then it could learn the same fact.

So I'm most interested in the former kind of introspective, or in cases of the latter where it'd take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.

martinkunev on Bitter lessons about lucid dreaming

I don't have much in terms of advise, I never felt the need to research this - I just assumed there must be something. I have a mild nightmare maybe once every couple of months and almost never something more serious.

I have anecdotal evidence that things which disturb your sleep (e.g. coffee or too much salt affecting blood pressure, uncomfortable pillow) cause nightmares. There are also obvious things like not watching horror movies, etc.

sherrinford on Open Thread Fall 2024

I appreciate that you posted a response to my question. However, I assume there is some misunderstanding here.

Zvi notes that he will not "be engaging with any of the arguments against this, of any quality" (which suggests that there are also good or relevant arguments). Zvi includes the statement that "AI is going to kill everyone", and notes that he "strongly disagrees".

As I asked for "arguments related to or a more detailed discussion" of these issues, you mention some people you call "random idiots" and state that their arguments are "batshit insane". It thus seems like a waste of time trying to find arguments relevant to my question based on these keywords.

So I wonder: was your answer actually meant to be helpful?

olli-jaerviniemi on Interest in Leetcode, but for Rationality?

This is a long answer, in which I list around ten concrete problem types that such a site could have.

Before I go into my concrete proposals, here are some general points:

I think the rationality community has focused too much on quantifying subjective uncertainty / probabilistic calibration, and too little on quantitative thinking and numeric literacy in general.
- The set of possible exercises for the latter is way larger and pretty unexplored.
- There are lots of existing calibration tools, so I'd caution against the failure mode of making Yet Another Calibration Tool.
  - (Though I agree with abstractapplic that a calibration tool that's Actually Really Good still doesn't exist.)
More generally, I feel like at least I (and possibly the rationality community at large) has gotten too fixated on a few particular forms of rationality training: cognitive bias training, calibration training, spotting logical fallacies.
- The low-hanging fruit here might be mostly plucked / pushing the frontier requires some thought (c.f. abstractapplic's comment).
Project Euler is worth looking as an example of a well-executed problem database. A few things I like about it:
- A comment thread for those who have solved the problem.
- A wide distribution of problem difficulty (with those difficulties shown by the problems).
- Numbers Going Up when you solve problems is pretty motivating (as are public leaderboards).
- The obvious thing: there is a large diverse set of original, high-quality problems.
- (Project Euler has the big benefit that there is always an objective numerical answer that can be used for verifying user solutions; rationality has a harder task here.)
Two key features a good site would (IMO) have:
- Support a wide variety of problem types. You say that LeetCode has the issue of overfitting; I think the same holds for rationality training. The skillset we are trying to develop is large, too.
- Allow anyone to submit problems with a low barrier. This seems really important if you want to have a large, high-quality problem set.
I feel like the following two are separate entities worth distinguishing:
- High-quantity examples "covering the basics". Calibration training is a central example here. Completing a single instance of the exercise would take some seconds or minutes at top, and the idea is that you do lots of repetitions.
- High-effort "advanced examples". The "Dungeons and Data Science" exercises strike me as a central example here, where completion presumably takes at least minutes and maybe at least hours.
- (At the very least, the UI / site design should think about "an average user completes 0-10 tasks of this form" and "an average user completes 300 tasks of this form" separately.)

And overall I think that having an Actually Really Good website for rationality training would be extremely valuable, so I'm supportive of efforts in this direction.

I brainstormed some problem types that I think such a site could include.

1: Probabilistic calibration training for quantifying uncertainty

This is the obvious one. I already commented on this, in particular that I don't think this should be the main focus. (But if one were to execute this: I think that the lack of quantity and/or diversity of questions in existing tools is a core reason I don't do this more.)

2: "Statistical calibration"

I feel like there are lots of quantitative statistics one could ask questions about. Here are some basic ones:

What is the GPD of [country]?
What share of [country]'s national budget goes to [domain]?
How many people work in [sector/company]?
How many people die of [cause] yearly?
Various economic trends, e.g. productivity gains / price drops in various sectors over time.
How much time do people spend doing [activity] daily/yearly?

(For more ideas, you can e.g. look at Statistics Finland's list here. And there just are various quantitative statistics floating around: e.g. today I learned that salt intake in 1800s Europe was ~18g/day [LW(p) · GW(p)], which sure is more than I'd have guessed.)

3: Quantitative modeling

(The line between this and the previous one is blurry.)

Fermi estimates are the classic one here; see Quantified Intuitions' The Estimation Game. See also this recent post [LW · GW] that's thematically related.

There's room for more sophisticated quantitative modeling, too. Here are two examples to illustrate what I have in mind:

Example 1. How much value would it create to increase the speed of all passenger airplanes by 5%?

Example 2. Consider a company that has two options: either have its employees visit nearby restaurants for lunch, or hire food personnel and start serving lunch at its own spaces. How large does the company need to be for the second one to become profitable?

It's not obvious how to model these phenomena, and the questions are (intentionally) underspecified; I think the interesting part would be comparing modeling choices and estimates of parameters with different users rather than simply comparing outputs.

4: The Wikipedia false-modifications game

See this post [LW · GW] for discussion.

5: Discourse-gone-astray in the wild

(Less confident on this one.)

I suspect there's a lot of pedagogically useful examples of poor discourse happening the wild (e.g. tilted or poorly researched newspaper articles, heated discussions in Twitter or elsewhere). This feels like a better way to execute what the "spot cognitive biases / logical fallacies" exercises aim to do. Answering questions like "How is this text misleading?", "How did this conversation go off the rails?" or "What would have been a better response instead of what was said here?" and then comparing one's notes to others seems like it could make a useful exercise.

6: Re-deriving established concepts

Recently it occurred to me that I didn't know how inflation works and what its upsides are. Working this through (with some vague memories and hints from my friend) felt like a pretty good exercise to me.

Another example: I don't know how people make vacuums in practice, but when I sat and thought it through, it wasn't too hard to think of a way to create a space with much less air molecules than atmosphere with pretty simple tools.

Third example: I've had partial success prompting people to re-derive the notion of Shapley value.

I like this sort of problems: they are a bit confusing, in that part of the problem is asking the right questions, but there are established, correct (or at least extremely good) solutions.

(Of course someone might already know the canonical answer to any given question, but that's fine. I think there are lots of good examples in economics - e.g. Vickrey auction, prediction markets, why price controls are bad / price gouging is pretty good, "fair" betting odds [LW · GW] - for this, but maybe this is just because I don't know much economics.)

7: Generating multiple ideas/interventions/solutions/hypotheses

An exercise I did at some point is "Generate 25 ideas for interventions that might improve learning and other outcomes in public education". I feel like the ability to come up with multiple ideas to a given problem is pretty useful (e.g. this is something I face in my work all the time, and this list itself is an example of "think of many things"). This is similar to the babble exercises [? · GW], though I'm picturing more "serious" prompts than the ones there.

Another way to train this skill would be to have interactive exercises that are about doing science (c.f. the 2-4-6 problem) and aiming to complete them as efficiently as possible (This article is thematically relevant.)

(Discussion of half-developed ideas that I don't yet quite see how to turn into exercises.)

8: Getting better results with more effort

Two personal anecdotes:

I used to play chess as a child, but stopped at some point. When I years later played again, I noticed something: my quick intuitions felt just as weak as before, but I felt like I was better at thinking about what to think, and using more time to make better decisions by thinking more. Whereas when I was younger, I remember often making decisions pretty quickly and not seeing what else I could do.
I did math olympiads in high school. Especially early on, some problems just felt fundamentally unapproachable to me - I just couldn't make any progress on them. Whereas nowadays when I encounter problems, in math or otherwise, I'm rarely stuck in this sense. "Oh, obviously if I just spent more time on this, I could figure this stuff out eventually"

A type of exercise where you are supposed to first give an initial answer after X time, and then are allowed to revise your answer for Y time, seems like it could train this and other skills. (Maybe brainstorming exercises of the form "if you had a week/month/year of time, how would you solve [problem]?" could help, too.)

9: I think there's something in the genre of "be specific", and more specifically in "operationalize vague claims into something that has a truth value", that'd be nice to have in large-quantity exercise form. See this post [LW · GW] for related discussion. I'm also reminded of this comment [LW(p) · GW(p)].

There are definitely things not covered by this list; in particular, I have little of directly training to apply all this in real life (c.f. TAPs [? · GW], which is definitely a very real-life-y technique). So while I did keep practicality in mind, I'd be happy to see exercises that bridge the theory-practice-gap even more.

Also, the Dungeons and Data Science [? · GW] and the stuff Raymond is doing [? · GW] are something to keep in mind.

bhauth on If far-UV is so great, why isn't it everywhere?

Apart from potential harms of far-UVC, it's good to remove particulate pollution anyway. Is it possible that "quiet air filters" is an easier problem to solve?

bogdan-ionut-cirstea on The case for unlearning that removes information from LLM weights

Here's a recent paper which might provide [inspiration for] another approach: Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts (though it seems at least somewhat related to the tamper-resistant paper mentioned in another comment).