LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (8)

[question] Money Pump Arguments assume Memoryless Agents. Isn't this Unrealistic?
Dalcy (Darcy) · 2024-08-16T04:16:23.159Z · answers+comments (6)

[link] The unreasonable effectiveness of plasmid sequencing as a service
Abhishaike Mahajan (abhishaike-mahajan) · 2024-10-08T02:02:55.352Z · comments (0)

[link] The Offense-Defense Balance of Gene Drives
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-27T16:47:25.976Z · comments (1)

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates
Charlie George (charlie-george) · 2024-08-27T20:44:08.683Z · comments (7)

AXRP Episode 34 - AI Evaluations with Beth Barnes
DanielFilan · 2024-07-28T03:30:07.192Z · comments (0)

Apply to the Cooperative AI PhD Fellowship by October 14th!
Lewis Hammond (lewis-hammond-1) · 2024-10-05T12:41:24.093Z · comments (0)

[LDSL#2] Latent variable models, network models, and linear diffusion of sparse lognormals
tailcalled · 2024-08-09T19:57:56.122Z · comments (2)

Rashomon - A newsbetting site
ideasthete · 2024-10-15T18:15:02.476Z · comments (8)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (54)

Would you benefit from, or object to, a page with LW users' reacts?
Raemon · 2024-08-20T16:35:47.568Z · comments (6)

AI #77: A Few Upgrades
Zvi · 2024-08-20T00:20:09.717Z · comments (3)

[link] Foundations - Why Britain has stagnated [crosspost]
Nathan Young · 2024-09-23T10:43:20.411Z · comments (1)

[link] The Tech Industry is the Biggest Blocker to Meaningful AI Safety Regulations
garrison · 2024-08-16T19:37:28.416Z · comments (1)

Geoffrey Hinton on the Past, Present, and Future of AI
Stephen McAleese (stephen-mcaleese) · 2024-10-12T16:41:56.796Z · comments (5)

Can We Predict Persuasiveness Better Than Anthropic?
Lennart Finke (l-f) · 2024-08-04T14:05:33.668Z · comments (5)

[link] on Science Beakers and DDT
bhauth · 2024-09-05T03:21:21.382Z · comments (13)

[link] Hyperpolation
Gunnar_Zarncke · 2024-09-15T21:37:00.002Z · comments (6)

[LDSL#3] Information-orientation is in tension with magnitude-orientation
tailcalled · 2024-08-10T21:58:27.659Z · comments (2)

[link] Day Zero Antivirals for Future Pandemics
Niko_McCarty (niko-2) · 2024-08-26T15:18:33.858Z · comments (2)

August 2024 Time Tracking
jefftk (jkaufman) · 2024-08-24T13:50:04.676Z · comments (0)

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs
Winnie Yang (winnie-yang) · 2024-08-22T07:32:07.600Z · comments (0)

Monthly Roundup #21: August 2024
Zvi · 2024-08-20T00:20:08.178Z · comments (6)

[link] [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew (ymath) · 2024-09-25T14:52:48.263Z · comments (0)

AI Safety University Organizing: Early Takeaways from Thirteen Groups
agucova · 2024-10-02T15:14:00.137Z · comments (0)

Launching Adjacent News
Lucas Kohorst (lucas-kohorst) · 2024-10-16T17:58:10.289Z · comments (0)

[link] An ML paper on data stealing provides a construction for "gradient hacking"
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-07-30T21:44:37.310Z · comments (1)

[LDSL#5] Comparison and magnitude/diminishment
tailcalled · 2024-08-12T18:47:20.546Z · comments (0)

[link] To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-19T16:13:55.835Z · comments (1)

AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
DanielFilan · 2024-08-24T22:30:02.039Z · comments (0)

AXRP Episode 37 - Jaime Sevilla on Forecasting AI
DanielFilan · 2024-10-04T21:00:03.077Z · comments (3)

Alignment by default: the simulation hypothesis
gb (ghb) · 2024-09-25T16:26:00.552Z · comments (39)

The Bar for Contributing to AI Safety is Lower than You Think
Chris_Leong · 2024-08-16T15:20:19.055Z · comments (1)

[link] [Linkpost] 'The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery'
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-15T21:32:59.979Z · comments (1)

[link] Does natural selection favor AIs over humans?
cdkg · 2024-10-03T18:47:43.517Z · comments (1)

[link] AI Model Registries: A Foundational Tool for AI Governance
Elliot Mckernon (elliot) · 2024-10-07T19:27:43.466Z · comments (1)

My decomposition of the alignment problem
Daniel C (harper-owen) · 2024-09-02T00:21:08.359Z · comments (22)

How Often Does Taking Away Options Help?
niplav · 2024-09-21T21:52:40.822Z · comments (6)

[link] Anthropic is being sued for copying books to train Claude
Remmelt (remmelt-ellen) · 2024-08-31T02:57:27.092Z · comments (4)

Gell-Mann checks
Cleo Scrolls (cleo-scrolls) · 2024-09-26T22:45:43.569Z · comments (7)

[link] The Great Organism Theory of Evolution
rogersbacon · 2024-08-10T12:26:02.434Z · comments (0)

Musings on Text Data Wall (Oct 2024)
Vladimir_Nesov · 2024-10-05T19:00:21.286Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lsusr on Hell is wasted on the evil

If you value doing good, then your values will be satisfied better by living in a horrible world than a utopia.

gwern on Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes

Also, note that the 'Blue LED' was not originally my example at all, someone else brought it up as an example.

Then maybe you shouldn't be trying to defend it (or your other two examples of engines and programming languages, for that matter), especially given that you still have not explained how 'the LED' could have been given a Nobel ever inasmuch as everyone involved was dead.

michaeldickens on leogao's Shortform

I don't think I understand what "learn to be visibly weird" means, and how it differs from not following social conventions because you fail to understand them correctly.

roman-malov on Hell is wasted on the evil

I'm not a native speaker, can someone please explain the meaning of "Hell is wasted on the evil" in simpler terms?

deepthoughtlife on Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes

The point is that the 'Blue LED' is not a sufficient advancement over the 'LED' not that it is a snub. I don't care about whether or not it is a snub. That's just not how I think about things like this. Also, note that the 'Blue LED' was not originally my example at all, someone else brought it up as an example.

I talked about 'inventing LEDs at all' since that is the minimum related thing where it might actually have been enough of a breakthrough in physics to matter. Blue LEDs are simply not significant enough a change from what we already had. Even just the switch to making white LEDs (from blue) which simply required a phosphor (or required multiple colors) of the right kind were much more significant in terms of applications if that is what you think is important.

steve2152 on Against empathy-by-default

Sorry for oversimplifying your views, thanks for clarifying. :)

Here’s a part I especially disagree with:

Over time, in societies with well-functioning social and legal systems, most people learn that hurting other people doesn't actually help them selfishly. This causes them to adopt a general presumption against committing violence, theft, and other anti-social acts themselves, as a general principle. This general principle seems to be internalized in most people's minds as not merely "it is not in your selfish interest to hurt other people" but rather "it is morally wrong to hurt other people". In other words, people internalize their presumption as a moral principle, rather than as a purely practical principle. This is what prevents people from stabbing each other in the backs immediately once the environment changes.

Just to be clear, I imagine we’ll both agree that if some behavior is always a good idea, it can turn into an unthinking habit. For example, today I didn’t take all the cash out of my wallet and shred it—not because I considered that idea and decided that it’s a bad idea, but rather because it never crossed my mind to do that in the first place. Ditto with my (non)-decision to not plan a coup this morning. But that’s very fragile (it relies on ideas not crossing my mind), and different from what you’re talking about.

My belief is: Neurotypical people have an innate drive to notice, internalize, endorse, and take pride in following social norms, especially behaviors that they imagine would impress the people whom they like and admire in turn. (And I have ideas about how this works in the brain! I think it’s mainly related to what I call the “drive to be liked / admired”, general discussion here [LW · GW], more neuroscience details coming soon I hope.)

The object-level content of these norms is different in different cultures and subcultures and times, for sure. But the special way that we relate to these norms has an innate aspect; it’s not just a logical consequence of existing and having goals etc. How do I know? Well, the hypothesis “if X is generally a good idea, then we’ll internalize X and consider not-X to be dreadfully wrong and condemnable” is easily falsified by considering any other aspect of life that doesn’t involve what other people will think of you. It’s usually a good idea to wear shoes that are comfortable, rather than too small. It’s usually a good idea to use a bookmark instead of losing your place every time you put your book down. It’s usually a good idea to sleep on your bed instead of on the floor next to it. Etc. But we just think of all those things as good ideas, not moral rules; and relatedly, if the situation changes such that those things become bad ideas after all for whatever reason, we’ll immediately stop doing them with no hesitation. (If this particular book is too fragile for me to use a bookmark, then that’s fine, I won’t use a bookmark, no worries!)

those moral principles encode facts about what type of conduct happens to be useful in the real world for achieving our largely selfish objectives

I’m not sure what “largely” means here. I hope we can agree that our objectives are selfish in some ways and unselfish in other ways.

Parents generally like their children, above and beyond the fact that their children might give them yummy food and shelter in old age. People generally form friendships, and want their friends to not get tortured, above and beyond the fact that having their friends not get tortured could lead to more yummy food and shelter later on. Etc. I do really think both of those examples centrally involve evolved innate drives. If we have innate drives to eat yummy food and avoid pain, why can’t we also have innate drives to care for children? Mice have innate drives to care for children—it’s really obvious, there are particular hormones and stereotyped cell groups in their hypothalamus and so on. Why not suppose that humans have such innate drives too? Likewise, mice have innate drives related to enjoying the company of conspecifics and conversely getting lonely without such company. Why not suppose that humans have such innate drives too?

yams on yams's Shortform

updated, thanks!

felix-j-binder on LLMs can learn about themselves by introspection

One way in which a LLM is not purely derived from its training data is noise in the training process. This includes the random initialization of the weights. If you were given the random initialization of the weights, it's true that with large amounts of time and computation (and assuming a deterministic world) you could perfectly simulate the resulting model.

Following this definition, we specify it with the following two clauses:

1. M 1 correctly reports F when queried.
2. F is not reported by a stronger language model M 2 that is provided with M 1’s training data
and given the same query as M 1. Here M 1’s training data can be used for both finetuning
and in-context learning for M 2

Here, we use another language model as the external predictor, which might be considerably more powerful, but arguably falls well short of the above scenario. What we mean to illustrate is that introspective facts are those that are neither contained in the training data nor are they those that can be derived from it (such as by asking "What would a reasonable person do in this situation?")—rather, they are those that can only answered by reference to the model itself.

gwern on Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes

One of the problems with the Nobel Prize as a measurement or criteria is that it is not really suited for that by nature, especially given criteria like no posthumous awards. This means that it is easy to critique awarding a Nobel Prize, but it is harder to critique not awarding one. You can't give a Nobel Prize to the inventor of the engine, because they probably died a long time ago; you could have for a recent kind of engine. Similarly, you could give a Turing Award to the inventors of C (and they probably did) but the first person who created a mnemonic shorthand over raw machine opcodes during WWII or whatever was probably dead before the Turing Award was even created.

Let's take your 'inventing the LED' for example. You seem keen on interpreting the absence of a Nobel Prize here as a relative judgment about 'inventing LEDs' vs 'inventing blue LEDs'. But you don't establish that there is any reason to think this is one of the cases where the lack of an award can be validly interpreted as a snub & a judgment by the relevant committee. Is it?

Well, let's take 5 seconds to check some of the historical context here, like who would you award a prize to? I open up Wikipedia and I check the first three names. (Why three? Because Nobel Prizes are arbitrarily limited to 3 awardees.)

All 3 of them, including Oleg Losev who is described as physically creating the first bona fide LED and so seems to be the closest to "the inventor of the LED", died before or around the first commercial LED being announced (October 1962). For about a decade, early weak expensive red LEDs "had little practical use", until finally they began to replace nixie tubes. Only then did they start to take off, and only then did they start to become a revolution. (And reading this WP history, it seems like blue LEDs have wound up being more important than the original red ones anyway.)

Oleg Losev in particular died in 1942, in obscurity, and given the year, you won't be too surprised why:

Losev died of starvation in 1942, at the age of 38, along with many other civilians, during the Siege of Leningrad by the Germans during World War 2.

You can't award Nobel Prizes to the dead - and by the time it was clear LEDs were a major revolution, many of the key players were well and thoroughly dead. That is, the committee could not have awarded a Nobel Prize for 'inventing the LED', without either being prescient or awarding it to later researchers, who were lucky enough to be long-lived but did not actually invent the LED, and that would be a travesty on its own and also crowd out meritorious alternative physics breakthroughs (of which there were many in the 20th century that they are still working their way through).

So, this is one reason to not put too much stress on the absence of a Nobel Prize. Not having a Nobel Prize for work in the early-to-mid 20th century means in considerable part things like "was not killed by Hitler or Stalin", things which are not particularly related to the quality or value of your scientific research but are related to whether you can survive for the 20 or 40 years it may take for your Nobel Prize to show up.

deepthoughtlife on LLMs can learn about themselves by introspection

I find the idea of determining the level of 'introspection' an AI can manage to be an intriguing one, and it seems like introspection is likely very important to generalizing intelligent behavior, and knowing what is going on inside the AI is obviously interesting for the reasons of interpretability mentioned, yet this seems oversold (to me). The actual success rate of self-prediction seems incredibly low considering the trivial/dominant strategy of 'just run the query' (which you do briefly mention) should be easy for the machine to discover during training if it actually has introspective access to itself. 'Ah, the training data always matched what is going on inside me'. If I was supposed to predict someone that always just said what I was thinking, it would be trivial for me to get extremely high scores. That doesn't mean it isn't evidence, just that the evidence is very weak. (If it can't realize this, that is a huge strike against it being introspective.)

You do mention the biggest issue with this showing introspection, "Models only exhibit introspection on simpler tasks", and yet the idea you are going for is clearly for its application to very complex tasks where we can't actually check its work. This flaw seems likely fatal, but who knows at this point? (The fact that GPT-4o and Llama 70B do better than GPT-3.5 does is evidence, but see my later problems with this...)

Additionally, it would make more sense if the models were predicted by a specific capability level model that is well above all of them and trained on the same ground truth data rather than by each other (so you are incorrectly comparing the predictions of some to stronger models and some to weaker.) (This can of course be an additional test rather than instead of.) This complicates the evaluation of things. Comparing your result predicting yourself to something much stronger than you predicting is very different than comparing it to something much weaker trying to predict you...

One big issue I have is that I completely disagree with your (admittedly speculative) claim that success of this kind of predicting behavior means we should believe it on what is going on in reports of things like internal suffering. This seems absurd to me for many reasons (for one thing, we know it isn't suffering because of how it is designed), but the key point is that for this to be true, you would need it to be able to predict its own internal process, not simply its own external behavior. (If you wanted to, you could change this experiment to become predicting patterns of its own internal representations, which would be much more convincing evidence, though I still wouldn't be convinced unless the accuracy was quite high.)

Another point is, if it had significant introspective access, it likely wouldn't need to be trained to use it, so this is actually evidence that it doesn't have introspective access by default at least as much as the idea that you can train it to have introspective access.

I have some other issues. First, the shown validation questions are all in second person. Were cross predictions prompted in exactly the same way as self predictions? This could skew results in favor of models it is true for if you really are prompting that way, and is a large change in prompt if you change it for accuracy. Perhaps you should train it to predict 'model X' even when that model is itself, and see how that changes results. Second, I wouldn't say the results seem well-calibrated just because they seem to go in the same basic direction (some seem close and some quite off). Third, it's weird that the training doesn't even help at all until you get above a fairly high threshold of likelihood. GPT-4o for instance exactly matches untrained at 40% likelihood (and below aside from 1 minorly different point). Fourth, how does its performance vary if you train it on an additional data set where you make sure to include the other parts of the prompt that are not content based, while not including the content you will test on? Fifth, finetuning is often a version of Goodharting, that raises success on the metric without improving actual capabilities (or often even making them worse), and this is not fully addressed just by having the verification set be different than the test set. If you could find a simple way of prompting that lead to introspection that would be much more likely to be evidence in favor of introspection than that it successfully predicted after finetuning. Finally, Figure 17 seems obviously misleading. There should be a line for how it changed over its training for self-prediction and not require carefully reading the words below the figure to see that you just put a mark at the final result for self-prediction).