LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Secret Collusion: Will We Know When to Unplug AI?
schroederdewitt · 2024-09-16T16:07:01.119Z · comments (7)

A Path out of Insufficient Views
Unreal · 2024-09-24T20:00:27.332Z · comments (46)

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs
Michaël Trazzi (mtrazzi) · 2024-08-24T04:30:11.807Z · comments (0)

[link] Unlocking Solutions—By Understanding Coordination Problems
James Stephen Brown (james-brown) · 2024-07-27T04:52:13.435Z · comments (4)

Thiel on AI & Racing with China
Ben Pace (Benito) · 2024-08-20T03:19:18.966Z · comments (10)

[link] On the Role of Proto-Languages
adamShimi · 2024-09-22T16:50:34.720Z · comments (1)

Safe Predictive Agents with Joint Scoring Rules
Rubi J. Hudson (Rubi) · 2024-10-09T16:38:16.535Z · comments (10)

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
Patrick Leask (patrickleask) · 2024-08-17T01:16:53.764Z · comments (0)

AI #76: Six Shorts Stories About OpenAI
Zvi · 2024-08-08T13:50:04.659Z · comments (10)

The Geometry of Feelings and Nonsense in Large Language Models
7vik (satvik-golechha) · 2024-09-27T17:49:27.420Z · comments (10)

Model evals for dangerous capabilities
Zach Stein-Perlman · 2024-09-23T11:00:00.866Z · comments (9)

Parental Writing Selection Bias
jefftk (jkaufman) · 2024-10-13T14:00:03.225Z · comments (3)

Provably Safe AI: Worldview and Projects
bgold · 2024-08-09T23:21:02.763Z · comments (43)

[link] Anthropic's updated Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · comments (3)

Rewilding the Gut VS the Autoimmune Epidemic
GGD · 2024-08-16T18:00:46.239Z · comments (0)

How to Give in to Threats (without incentivizing them)
Mikhail Samin (mikhail-samin) · 2024-09-12T15:55:50.384Z · comments (25)

Llama Llama-3-405B?
Zvi · 2024-07-24T19:40:07.565Z · comments (9)

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.
Andrew_Critch · 2024-09-11T04:41:24.872Z · comments (7)

[link] Peak Human Capital
PeterMcCluskey · 2024-09-30T21:13:30.421Z · comments (2)

Unlearning via RMU is mostly shallow
Andy Arditi (andy-arditi) · 2024-07-23T16:07:52.223Z · comments (3)

[link] Prices are Bounties
Maxwell Tabarrok (maxwell-tabarrok) · 2024-10-12T14:51:40.689Z · comments (12)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

Applications of Chaos: Saying No (with Hastings Greer)
Elizabeth (pktechgirl) · 2024-09-21T16:30:07.415Z · comments (16)

Interoperable High Level Structures: Early Thoughts on Adjectives
johnswentworth · 2024-08-22T21:12:38.223Z · comments (1)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

Please do not use AI to write for you
Richard_Kennaway · 2024-08-21T09:53:34.425Z · comments (34)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

Evaluating the truth of statements in a world of ambiguous language.
Hastings (hastings-greer) · 2024-10-07T18:08:09.920Z · comments (19)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

The Fragility of Life Hypothesis and the Evolution of Cooperation
KristianRonn · 2024-09-04T21:04:49.878Z · comments (6)

SRE's review of Democracy
Martin Sustrik (sustrik) · 2024-08-03T07:20:01.483Z · comments (2)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes
Anna Gajdova (anna-gajdova) · 2024-10-09T12:56:24.856Z · comments (14)

Extended Interview with Zhukeepa on Religion
Ben Pace (Benito) · 2024-08-18T03:19:05.625Z · comments (58)

Caring about excellence
owencb · 2024-07-22T14:24:37.892Z · comments (4)

[link] Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
ChristianWilliams · 2024-10-10T18:58:46.041Z · comments (2)

[Intuitive self-models] 5. Dissociative Identity (Multiple Personality) Disorder
Steven Byrnes (steve2152) · 2024-10-15T13:31:46.157Z · comments (6)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

AI #75: Math is Easier
Zvi · 2024-08-01T13:40:05.539Z · comments (25)

All The Latest Human tFUS Studies
sarahconstantin · 2024-08-09T22:20:04.561Z · comments (2)

Bounty for Evidence on Some of Palisade Research's Beliefs
benwr · 2024-09-23T20:01:20.917Z · comments (4)

Decision Theory in Space
lsusr · 2024-08-18T07:02:11.847Z · comments (18)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (6)

How to hire somebody better than yourself
lukehmiles (lcmgcd) · 2024-08-28T08:12:53.450Z · comments (5)

Forecasting One-Shot Games
Raemon · 2024-08-31T23:10:05.475Z · comments (0)

Untrustworthy models: a frame for scheming evaluations
Olli Järviniemi (jarviniemi) · 2024-08-19T16:27:11.088Z · comments (3)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

[link] Michael Dickens' Caffeine Tolerance Research
niplav · 2024-09-04T15:41:53.343Z · comments (3)

[link] MIRI's September 2024 newsletter
Harlan · 2024-09-16T18:15:40.785Z · comments (0)

Work with me on agent foundations: independent fellowship
Alex_Altair · 2024-09-21T13:59:16.706Z · comments (5)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

felix-j-binder on LLMs can learn about themselves by introspection

I want to make the case that even this minimal strategy would be something that we might want to call "introspective," or that it can lead to the model learning true facts about itself.

First, self-simulating is a valid way of learning something about one's own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.

In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself ("What would the second letter of your response have been?"), but the results of others tell you something about the model more straightforwardly ("Would you have chosen the more wealth-seeking answer?"). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., "What would you have said is the capital of France?" does not, but "What would you have said if we asked you if we should implement subscription fees?" arguably does), then reasoning about that behavior would tell you something about the model.

So we have cases where (1) the model's statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.

roko on If far-UV is so great, why isn't it everywhere?

Far-UVC probably would have a large effect if a particular city or country installed it.

But if only a few buildings install it, then it has no effect because people just catch the bugs elsewhere.

Imagine the effect of just treating sewage from one house, and leaving all the untreated sewage from a million houses untreated in the river. There would be essentially no effect.

owain_evans on LLMs can learn about themselves by introspection

I agree about the "longer responses".

I'm unsure about the "personality trait" framing. There are two senses of "introspection" for humans. One is introspecting on your current mental state ("I feel a headache starting") and the other is being introspective about patterns in your behavior (e.g. "i tend to dislike violent movies" or "i tend to be shy among new people"). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this -- i.e. if another model had the same observational data then it could learn the same fact.

So I'm most interested in the former kind of introspective, or in cases of the latter where it'd take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.

martinkunev on Bitter lessons about lucid dreaming

I don't have much in terms of advise, I never felt the need to research this - I just assumed there must be something. I have a mild nightmare maybe once every couple of months and almost never something more serious.

I have anecdotal evidence that things which disturb your sleep (e.g. coffee or too much salt affecting blood pressure, uncomfortable pillow) cause nightmares. There are also obvious things like not watching horror movies, etc.

sherrinford on Open Thread Fall 2024

I appreciate that you posted a response to my question. However, I assume there is some misunderstanding here.

Zvi notes that he will not "be engaging with any of the arguments against this, of any quality" (which suggests that there are also good or relevant arguments). Zvi includes the statement that "AI is going to kill everyone", and notes that he "strongly disagrees".

As I asked for "arguments related to or a more detailed discussion" of these issues, you mention some people you call "random idiots" and state that their arguments are "batshit insane". It thus seems like a waste of time trying to find arguments relevant to my question based on these keywords.

So I wonder: was your answer actually meant to be helpful?

olli-jaerviniemi on Interest in Leetcode, but for Rationality?

This is a long answer, in which I list around ten concrete problem types that such a site could have.

Before I go into my concrete proposals, here are some general points:

I think the rationality community has focused too much on quantifying subjective uncertainty / probabilistic calibration, and too little on quantitative thinking and numeric literacy in general.
- The set of possible exercises for the latter is way larger and pretty unexplored.
- There are lots of existing calibration tools, so I'd caution against the failure mode of making Yet Another Calibration Tool.
  - (Though I agree with abstractapplic that a calibration tool that's Actually Really Good still doesn't exist.)
More generally, I feel like at least I (and possibly the rationality community at large) has gotten too fixated on a few particular forms of rationality training: cognitive bias training, calibration training, spotting logical fallacies.
- The low-hanging fruit here might be mostly plucked / pushing the frontier requires some thought (c.f. abstractapplic's comment).
Project Euler is worth looking as an example of a well-executed problem database. A few things I like about it:
- A comment thread for those who have solved the problem.
- A wide distribution of problem difficulty (with those difficulties shown by the problems).
- Numbers Going Up when you solve problems is pretty motivating (as are public leaderboards).
- The obvious thing: there is a large diverse set of original, high-quality problems.
- (Project Euler has the big benefit that there is always an objective numerical answer that can be used for verifying user solutions; rationality has a harder task here.)
Two key features a good site would (IMO) have:
- Support a wide variety of problem types. You say that LeetCode has the issue of overfitting; I think the same holds for rationality training. The skillset we are trying to develop is large, too.
- Allow anyone to submit problems with a low barrier. This seems really important if you want to have a large, high-quality problem set.
I feel like the following two are separate entities worth distinguishing:
- High-quantity examples "covering the basics". Calibration training is a central example here. Completing a single instance of the exercise would take some seconds or minutes at top, and the idea is that you do lots of repetitions.
- High-effort "advanced examples". The "Dungeons and Data Science" exercises strike me as a central example here, where completion presumably takes at least minutes and maybe at least hours.
- (At the very least, the UI / site design should think about "an average user completes 0-10 tasks of this form" and "an average user completes 300 tasks of this form" separately.)

And overall I think that having an Actually Really Good website for rationality training would be extremely valuable, so I'm supportive of efforts in this direction.

I brainstormed some problem types that I think such a site could include.

1: Probabilistic calibration training for quantifying uncertainty

This is the obvious one. I already commented on this, in particular that I don't think this should be the main focus. (But if one were to execute this: I think that the lack of quantity and/or diversity of questions in existing tools is a core reason I don't do this more.)

2: "Statistical calibration"

I feel like there are lots of quantitative statistics one could ask questions about. Here are some basic ones:

What is the GPD of [country]?
What share of [country]'s national budget goes to [domain]?
How many people work in [sector/company]?
How many people die of [cause] yearly?
Various economic trends, e.g. productivity gains / price drops in various sectors over time.
How much time do people spend doing [activity] daily/yearly?

(For more ideas, you can e.g. look at Statistics Finland's list here. And there just are various quantitative statistics floating around: e.g. today I learned that salt intake in 1800s Europe was ~18g/day [LW(p) · GW(p)], which sure is more than I'd have guessed.)

3: Quantitative modeling

(The line between this and the previous one is blurry.)

Fermi estimates are the classic one here; see Quantified Intuitions' The Estimation Game. See also this recent post [LW · GW] that's thematically related.

There's room for more sophisticated quantitative modeling, too. Here are two examples to illustrate what I have in mind:

Example 1. How much value would it create to increase the speed of all passenger airplanes by 5%?

Example 2. Consider a company that has two options: either have its employees visit nearby restaurants for lunch, or hire food personnel and start serving lunch at its own spaces. How large does the company need to be for the second one to become profitable?

It's not obvious how to model these phenomena, and the questions are (intentionally) underspecified; I think the interesting part would be comparing modeling choices and estimates of parameters with different users rather than simply comparing outputs.

4: The Wikipedia false-modifications game

See this post [LW · GW] for discussion.

5: Discourse-gone-astray in the wild

(Less confident on this one.)

I suspect there's a lot of pedagogically useful examples of poor discourse happening the wild (e.g. tilted or poorly researched newspaper articles, heated discussions in Twitter or elsewhere). This feels like a better way to execute what the "spot cognitive biases / logical fallacies" exercises aim to do. Answering questions like "How is this text misleading?", "How did this conversation go off the rails?" or "What would have been a better response instead of what was said here?" and then comparing one's notes to others seems like it could make a useful exercise.

6: Re-deriving established concepts

Recently it occurred to me that I didn't know how inflation works and what its upsides are. Working this through (with some vague memories and hints from my friend) felt like a pretty good exercise to me.

Another example: I don't know how people make vacuums in practice, but when I sat and thought it through, it wasn't too hard to think of a way to create a space with much less air molecules than atmosphere with pretty simple tools.

Third example: I've had partial success prompting people to re-derive the notion of Shapley value.

I like this sort of problems: they are a bit confusing, in that part of the problem is asking the right questions, but there are established, correct (or at least extremely good) solutions.

(Of course someone might already know the canonical answer to any given question, but that's fine. I think there are lots of good examples in economics - e.g. Vickrey auction, prediction markets, why price controls are bad / price gouging is pretty good, "fair" betting odds [LW · GW] - for this, but maybe this is just because I don't know much economics.)

7: Generating multiple ideas/interventions/solutions/hypotheses

An exercise I did at some point is "Generate 25 ideas for interventions that might improve learning and other outcomes in public education". I feel like the ability to come up with multiple ideas to a given problem is pretty useful (e.g. this is something I face in my work all the time, and this list itself is an example of "think of many things"). This is similar to the babble exercises [? · GW], though I'm picturing more "serious" prompts than the ones there.

Another way to train this skill would be to have interactive exercises that are about doing science (c.f. the 2-4-6 problem) and aiming to complete them as efficiently as possible (This article is thematically relevant.)

(Discussion of half-developed ideas that I don't yet quite see how to turn into exercises.)

8: Getting better results with more effort

Two personal anecdotes:

I used to play chess as a child, but stopped at some point. When I years later played again, I noticed something: my quick intuitions felt just as weak as before, but I felt like I was better at thinking about what to think, and using more time to make better decisions by thinking more. Whereas when I was younger, I remember often making decisions pretty quickly and not seeing what else I could do.
I did math olympiads in high school. Especially early on, some problems just felt fundamentally unapproachable to me - I just couldn't make any progress on them. Whereas nowadays when I encounter problems, in math or otherwise, I'm rarely stuck in this sense. "Oh, obviously if I just spent more time on this, I could figure this stuff out eventually"

A type of exercise where you are supposed to first give an initial answer after X time, and then are allowed to revise your answer for Y time, seems like it could train this and other skills. (Maybe brainstorming exercises of the form "if you had a week/month/year of time, how would you solve [problem]?" could help, too.)

9: I think there's something in the genre of "be specific", and more specifically in "operationalize vague claims into something that has a truth value", that'd be nice to have in large-quantity exercise form. See this post [LW · GW] for related discussion. I'm also reminded of this comment [LW(p) · GW(p)].

There are definitely things not covered by this list; in particular, I have little of directly training to apply all this in real life (c.f. TAPs [? · GW], which is definitely a very real-life-y technique). So while I did keep practicality in mind, I'd be happy to see exercises that bridge the theory-practice-gap even more.

Also, the Dungeons and Data Science [? · GW] and the stuff Raymond is doing [? · GW] are something to keep in mind.

bhauth on If far-UV is so great, why isn't it everywhere?

Apart from potential harms of far-UVC, it's good to remove particulate pollution anyway. Is it possible that "quiet air filters" is an easier problem to solve?

bogdan-ionut-cirstea on The case for unlearning that removes information from LLM weights

Here's a recent paper which might provide [inspiration for] another approach: Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts (though it seems at least somewhat related to the tamper-resistant paper mentioned in another comment).

cubefox on Taking nonlogical concepts seriously

But if is "It's a cat" and $q$ is "It has four legs", and $P$ describes our beliefs (or more precisely, say, my beliefs at 5 pm UTC October 20, 2024), then $P (q | p) > P (q)$ . Which surely means $p$ is a materially good reason for $q$ . But $p ⊬ q$ , so the inference from $p$ to $q$ is still logically bad. So we don't have logicism about reasons in probability theory. Moreover, probability expressions are not invariant under substituting non-logical vocabulary. For example, if $r$ is "It has two legs", and we substitute $q$ with $r$ , then $P (r | p) < P (r)$ . Which can only mean the inference from $p$ to $r$ is materially bad.

Laws of probability theory still impose a structure on relations between material concepts (there are still forms of monotonicity and transitivity), whereas the logical-expressivist order of explanation argues that the theoretician isn't entitled to a priori impose such a structure on all material concepts: rather, their job is to describe them.

I think the axioms of probability can be thought of as being relative to material conceptual relations. Specifically, the additivity axiom says that the probabilities of "mutually exclusive" statements can be added together to yield the probability of their disjunction. What does "mutually exclusive" mean? Logically inconsistent? Not necessarily. It could simply mean materially inconsistent. For example, "Bob is married" and "Bob is a bachelor" are (materially, though not logically) mutually exclusive. So their probabilities can be added to form the disjunction. (This arguably also solves the problem of logical omniscience, see here [LW(p) · GW(p)]).

rogerdearnaley on Interpreting the Learning of Deceit

A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They're probing yes-no questions of fact, where assembling the lie seems trivial (it's just a NOT gate), but lying is generally a good deal more complex than that.