LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] Lying to chess players for alignment
Zane · 2023-10-25T17:47:15.033Z · answers+comments (54)

[link] Ilya Sutskever created a new AGI startup
harfe · 2024-06-19T17:17:17.366Z · comments (35)

Kids or No kids
Kids or no kids (grosseholz.f@gmail.com) · 2023-11-14T18:37:02.799Z · comments (10)

[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)

[link] Explaining Impact Markets
Saul Munn (saul-munn) · 2024-01-31T09:51:27.587Z · comments (2)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)

On Claude 3.5 Sonnet
Zvi · 2024-06-24T12:00:05.719Z · comments (14)

[question] How to get nerds fascinated about mysterious chronic illness research?
riceissa · 2024-05-27T22:58:29.707Z · answers+comments (50)

[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (3)

[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (13)

[link] Ideological Bayesians
Kevin Dorst · 2024-02-25T14:17:25.070Z · comments (4)

Counting arguments provide no evidence for AI doom
Nora Belrose (nora-belrose) · 2024-02-27T23:03:49.296Z · comments (188)

Why comparative advantage does not help horses
Sherrinford · 2024-09-30T22:27:57.450Z · comments (10)

Announcing FAR Labs, an AI safety coworking space
bgold · 2023-09-29T16:52:37.753Z · comments (0)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

Catching AIs red-handed
ryan_greenblatt · 2024-01-05T17:43:10.948Z · comments (20)

[link] Things You’re Allowed to Do: University Edition
Saul Munn (saul-munn) · 2024-02-06T00:36:11.690Z · comments (13)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

[link] RAND report finds no effect of current LLMs on viability of bioterrorism attacks
StellaAthena · 2024-01-25T19:17:30.493Z · comments (14)

Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (22)

Symbol/Referent Confusions in Language Model Alignment Experiments
johnswentworth · 2023-10-26T19:49:00.718Z · comments (44)

Investigating the learning coefficient of modular addition: hackathon project
Nina Panickssery (NinaR) · 2023-10-17T19:51:29.720Z · comments (5)

What's A "Market"?
johnswentworth · 2023-08-08T23:29:24.722Z · comments (16)

[link] Logical Share Splitting
DaemonicSigil · 2023-09-11T04:08:32.350Z · comments (16)

It's time for a self-reproducing machine
Carl Feynman (carl-feynman) · 2024-08-07T21:52:22.819Z · comments (68)

Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper
Neel Nanda (neel-nanda-1) · 2023-10-23T22:38:33.951Z · comments (12)

Apollo Research 1-year update
Marius Hobbhahn (marius-hobbhahn) · 2024-05-29T17:44:32.484Z · comments (0)

Notes on Dwarkesh Patel’s Podcast with Demis Hassabis
Zvi · 2024-03-01T16:30:08.687Z · comments (0)

Towards a Less Bullshit Model of Semantics
johnswentworth · 2024-06-17T15:51:06.060Z · comments (44)

[link] Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity
GideonF · 2024-07-15T18:37:40.232Z · comments (17)

[link] Atoms to Agents Proto-Lectures
johnswentworth · 2023-09-22T06:22:05.456Z · comments (14)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (1)

OpenAI: The Board Expands
Zvi · 2024-03-12T14:00:04.110Z · comments (1)

Takeoff speeds presentation at Anthropic
Tom Davidson (tom-davidson-1) · 2024-06-04T22:46:35.448Z · comments (0)

On attunement
Joe Carlsmith (joekc) · 2024-03-25T12:47:34.856Z · comments (8)

[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (33)

Trying to understand John Wentworth's research agenda
johnswentworth · 2023-10-20T00:05:40.929Z · comments (13)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

[link] The Soul Key
Richard_Ngo (ricraz) · 2023-11-04T17:51:53.176Z · comments (9)

Everything Wrong with Roko's Claims about an Engineered Pandemic
EZ97 · 2024-02-22T15:59:08.439Z · comments (10)

Just admit that you’ve zoned out
joec · 2024-06-04T02:51:27.594Z · comments (22)

How to train your own "Sleeper Agents"
evhub · 2024-02-07T00:31:42.653Z · comments (11)

Quotes from Leopold Aschenbrenner’s Situational Awareness Paper
Zvi · 2024-06-07T11:40:03.981Z · comments (10)

[link] Biological Anchors: The Trick that Might or Might Not Work
Scott Alexander (Yvain) · 2023-08-12T00:53:30.159Z · comments (3)

Circular Reasoning
abramdemski · 2024-08-05T18:10:32.736Z · comments (36)

[link] Linkpost: They Studied Dishonesty. Was Their Work a Lie?
Linch · 2023-10-02T08:10:51.857Z · comments (12)

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Johnny Lin (hijohnnylin) · 2024-03-25T21:17:58.421Z · comments (7)

New page: Integrity
Zach Stein-Perlman · 2024-07-10T15:00:41.050Z · comments (3)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lsusr on Hell is wasted on the evil

If you value doing good, then your values will be satisfied better by living in a horrible world than a utopia.

gwern on Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes

Also, note that the 'Blue LED' was not originally my example at all, someone else brought it up as an example.

Then maybe you shouldn't be trying to defend it (or your other two examples of engines and programming languages, for that matter), especially given that you still have not explained how 'the LED' could have been given a Nobel ever inasmuch as everyone involved was dead.

michaeldickens on leogao's Shortform

I don't think I understand what "learn to be visibly weird" means, and how it differs from not following social conventions because you fail to understand them correctly.

roman-malov on Hell is wasted on the evil

I'm not a native speaker, can someone please explain the meaning of "Hell is wasted on the evil" in simpler terms?

deepthoughtlife on Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes

The point is that the 'Blue LED' is not a sufficient advancement over the 'LED' not that it is a snub. I don't care about whether or not it is a snub. That's just not how I think about things like this. Also, note that the 'Blue LED' was not originally my example at all, someone else brought it up as an example.

I talked about 'inventing LEDs at all' since that is the minimum related thing where it might actually have been enough of a breakthrough in physics to matter. Blue LEDs are simply not significant enough a change from what we already had. Even just the switch to making white LEDs (from blue) which simply required a phosphor (or required multiple colors) of the right kind were much more significant in terms of applications if that is what you think is important.

steve2152 on Against empathy-by-default

Sorry for oversimplifying your views, thanks for clarifying. :)

Here’s a part I especially disagree with:

Over time, in societies with well-functioning social and legal systems, most people learn that hurting other people doesn't actually help them selfishly. This causes them to adopt a general presumption against committing violence, theft, and other anti-social acts themselves, as a general principle. This general principle seems to be internalized in most people's minds as not merely "it is not in your selfish interest to hurt other people" but rather "it is morally wrong to hurt other people". In other words, people internalize their presumption as a moral principle, rather than as a purely practical principle. This is what prevents people from stabbing each other in the backs immediately once the environment changes.

Just to be clear, I imagine we’ll both agree that if some behavior is always a good idea, it can turn into an unthinking habit. For example, today I didn’t take all the cash out of my wallet and shred it—not because I considered that idea and decided that it’s a bad idea, but rather because it never crossed my mind to do that in the first place. Ditto with my (non)-decision to not plan a coup this morning. But that’s very fragile (it relies on ideas not crossing my mind), and different from what you’re talking about.

My belief is: Neurotypical people have an innate drive to notice, internalize, endorse, and take pride in following social norms, especially behaviors that they imagine would impress the people whom they like and admire in turn. (And I have ideas about how this works in the brain! I think it’s mainly related to what I call the “drive to be liked / admired”, general discussion here [LW · GW], more neuroscience details coming soon I hope.)

The object-level content of these norms is different in different cultures and subcultures and times, for sure. But the special way that we relate to these norms has an innate aspect; it’s not just a logical consequence of existing and having goals etc. How do I know? Well, the hypothesis “if X is generally a good idea, then we’ll internalize X and consider not-X to be dreadfully wrong and condemnable” is easily falsified by considering any other aspect of life that doesn’t involve what other people will think of you. It’s usually a good idea to wear shoes that are comfortable, rather than too small. It’s usually a good idea to use a bookmark instead of losing your place every time you put your book down. It’s usually a good idea to sleep on your bed instead of on the floor next to it. Etc. But we just think of all those things as good ideas, not moral rules; and relatedly, if the situation changes such that those things become bad ideas after all for whatever reason, we’ll immediately stop doing them with no hesitation. (If this particular book is too fragile for me to use a bookmark, then that’s fine, I won’t use a bookmark, no worries!)

those moral principles encode facts about what type of conduct happens to be useful in the real world for achieving our largely selfish objectives

I’m not sure what “largely” means here. I hope we can agree that our objectives are selfish in some ways and unselfish in other ways.

Parents generally like their children, above and beyond the fact that their children might give them yummy food and shelter in old age. People generally form friendships, and want their friends to not get tortured, above and beyond the fact that having their friends not get tortured could lead to more yummy food and shelter later on. Etc. I do really think both of those examples centrally involve evolved innate drives. If we have innate drives to eat yummy food and avoid pain, why can’t we also have innate drives to care for children? Mice have innate drives to care for children—it’s really obvious, there are particular hormones and stereotyped cell groups in their hypothalamus and so on. Why not suppose that humans have such innate drives too? Likewise, mice have innate drives related to enjoying the company of conspecifics and conversely getting lonely without such company. Why not suppose that humans have such innate drives too?

yams on yams's Shortform

updated, thanks!

felix-j-binder on LLMs can learn about themselves by introspection

One way in which a LLM is not purely derived from its training data is noise in the training process. This includes the random initialization of the weights. If you were given the random initialization of the weights, it's true that with large amounts of time and computation (and assuming a deterministic world) you could perfectly simulate the resulting model.

Following this definition, we specify it with the following two clauses:

1. M 1 correctly reports F when queried.
2. F is not reported by a stronger language model M 2 that is provided with M 1’s training data
and given the same query as M 1. Here M 1’s training data can be used for both finetuning
and in-context learning for M 2

Here, we use another language model as the external predictor, which might be considerably more powerful, but arguably falls well short of the above scenario. What we mean to illustrate is that introspective facts are those that are neither contained in the training data nor are they those that can be derived from it (such as by asking "What would a reasonable person do in this situation?")—rather, they are those that can only answered by reference to the model itself.

gwern on Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes

One of the problems with the Nobel Prize as a measurement or criteria is that it is not really suited for that by nature, especially given criteria like no posthumous awards. This means that it is easy to critique awarding a Nobel Prize, but it is harder to critique not awarding one. You can't give a Nobel Prize to the inventor of the engine, because they probably died a long time ago; you could have for a recent kind of engine. Similarly, you could give a Turing Award to the inventors of C (and they probably did) but the first person who created a mnemonic shorthand over raw machine opcodes during WWII or whatever was probably dead before the Turing Award was even created.

Let's take your 'inventing the LED' for example. You seem keen on interpreting the absence of a Nobel Prize here as a relative judgment about 'inventing LEDs' vs 'inventing blue LEDs'. But you don't establish that there is any reason to think this is one of the cases where the lack of an award can be validly interpreted as a snub & a judgment by the relevant committee. Is it?

Well, let's take 5 seconds to check some of the historical context here, like who would you award a prize to? I open up Wikipedia and I check the first three names. (Why three? Because Nobel Prizes are arbitrarily limited to 3 awardees.)

All 3 of them, including Oleg Losev who is described as physically creating the first bona fide LED and so seems to be the closest to "the inventor of the LED", died before or around the first commercial LED being announced (October 1962). For about a decade, early weak expensive red LEDs "had little practical use", until finally they began to replace nixie tubes. Only then did they start to take off, and only then did they start to become a revolution. (And reading this WP history, it seems like blue LEDs have wound up being more important than the original red ones anyway.)

Oleg Losev in particular died in 1942, in obscurity, and given the year, you won't be too surprised why:

Losev died of starvation in 1942, at the age of 38, along with many other civilians, during the Siege of Leningrad by the Germans during World War 2.

You can't award Nobel Prizes to the dead - and by the time it was clear LEDs were a major revolution, many of the key players were well and thoroughly dead. That is, the committee could not have awarded a Nobel Prize for 'inventing the LED', without either being prescient or awarding it to later researchers, who were lucky enough to be long-lived but did not actually invent the LED, and that would be a travesty on its own and also crowd out meritorious alternative physics breakthroughs (of which there were many in the 20th century that they are still working their way through).

So, this is one reason to not put too much stress on the absence of a Nobel Prize. Not having a Nobel Prize for work in the early-to-mid 20th century means in considerable part things like "was not killed by Hitler or Stalin", things which are not particularly related to the quality or value of your scientific research but are related to whether you can survive for the 20 or 40 years it may take for your Nobel Prize to show up.

deepthoughtlife on LLMs can learn about themselves by introspection

I find the idea of determining the level of 'introspection' an AI can manage to be an intriguing one, and it seems like introspection is likely very important to generalizing intelligent behavior, and knowing what is going on inside the AI is obviously interesting for the reasons of interpretability mentioned, yet this seems oversold (to me). The actual success rate of self-prediction seems incredibly low considering the trivial/dominant strategy of 'just run the query' (which you do briefly mention) should be easy for the machine to discover during training if it actually has introspective access to itself. 'Ah, the training data always matched what is going on inside me'. If I was supposed to predict someone that always just said what I was thinking, it would be trivial for me to get extremely high scores. That doesn't mean it isn't evidence, just that the evidence is very weak. (If it can't realize this, that is a huge strike against it being introspective.)

You do mention the biggest issue with this showing introspection, "Models only exhibit introspection on simpler tasks", and yet the idea you are going for is clearly for its application to very complex tasks where we can't actually check its work. This flaw seems likely fatal, but who knows at this point? (The fact that GPT-4o and Llama 70B do better than GPT-3.5 does is evidence, but see my later problems with this...)

Additionally, it would make more sense if the models were predicted by a specific capability level model that is well above all of them and trained on the same ground truth data rather than by each other (so you are incorrectly comparing the predictions of some to stronger models and some to weaker.) (This can of course be an additional test rather than instead of.) This complicates the evaluation of things. Comparing your result predicting yourself to something much stronger than you predicting is very different than comparing it to something much weaker trying to predict you...

One big issue I have is that I completely disagree with your (admittedly speculative) claim that success of this kind of predicting behavior means we should believe it on what is going on in reports of things like internal suffering. This seems absurd to me for many reasons (for one thing, we know it isn't suffering because of how it is designed), but the key point is that for this to be true, you would need it to be able to predict its own internal process, not simply its own external behavior. (If you wanted to, you could change this experiment to become predicting patterns of its own internal representations, which would be much more convincing evidence, though I still wouldn't be convinced unless the accuracy was quite high.)

Another point is, if it had significant introspective access, it likely wouldn't need to be trained to use it, so this is actually evidence that it doesn't have introspective access by default at least as much as the idea that you can train it to have introspective access.

I have some other issues. First, the shown validation questions are all in second person. Were cross predictions prompted in exactly the same way as self predictions? This could skew results in favor of models it is true for if you really are prompting that way, and is a large change in prompt if you change it for accuracy. Perhaps you should train it to predict 'model X' even when that model is itself, and see how that changes results. Second, I wouldn't say the results seem well-calibrated just because they seem to go in the same basic direction (some seem close and some quite off). Third, it's weird that the training doesn't even help at all until you get above a fairly high threshold of likelihood. GPT-4o for instance exactly matches untrained at 40% likelihood (and below aside from 1 minorly different point). Fourth, how does its performance vary if you train it on an additional data set where you make sure to include the other parts of the prompt that are not content based, while not including the content you will test on? Fifth, finetuning is often a version of Goodharting, that raises success on the metric without improving actual capabilities (or often even making them worse), and this is not fully addressed just by having the verification set be different than the test set. If you could find a simple way of prompting that lead to introspection that would be much more likely to be evidence in favor of introspection than that it successfully predicted after finetuning. Finally, Figure 17 seems obviously misleading. There should be a line for how it changed over its training for self-prediction and not require carefully reading the words below the figure to see that you just put a mark at the final result for self-prediction).