LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Predictive model agents are sort of corrigible
Raymond D · 2024-01-05T14:05:03.037Z · comments (6)

[link] OpenAI appoints Retired U.S. Army General Paul M. Nakasone to Board of Directors
[deleted] · 2024-06-13T21:28:18.110Z · comments (10)

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?
RogerDearnaley (roger-d-1) · 2024-01-11T12:56:29.672Z · comments (4)

Dangers of Closed-Loop AI
Gordon Seidoh Worley (gworley) · 2024-03-22T23:52:22.010Z · comments (9)

Index of rationalist groups in the Bay Area July 2024
Lucie Philippon (lucie-philippon) · 2024-07-26T16:32:25.337Z · comments (14)

What Helped Me - Kale, Blood, CPAP, X-tiamine, Methylphenidate
Johannes C. Mayer (johannes-c-mayer) · 2024-01-03T13:22:11.700Z · comments (12)

My Detailed Notes & Commentary from Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:51.894Z · comments (16)

[question] Feedback request: what am I missing?
Nathan Helm-Burger (nathan-helm-burger) · 2024-11-02T17:38:39.625Z · answers+comments (5)

Open consultancy: Letting untrusted AIs choose what answer to argue for
Fabien Roger (Fabien) · 2024-03-12T20:38:03.785Z · comments (5)

A path to human autonomy
Nathan Helm-Burger (nathan-helm-burger) · 2024-10-29T03:02:42.475Z · comments (14)

Categories of leadership on technical teams
benkuhn · 2024-07-22T04:50:04.071Z · comments (0)

How I select alignment research projects
Ethan Perez (ethan-perez) · 2024-04-10T04:33:08.092Z · comments (4)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
Jan Wehner · 2024-07-14T10:37:21.544Z · comments (5)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

LessWrong: After Dark, a new side of LessWrong
So8res · 2024-04-01T22:44:04.449Z · comments (5)

Empirical vs. Mathematical Joints of Nature
Elizabeth (pktechgirl) · 2024-06-26T01:55:22.858Z · comments (1)

Economics Roundup #2
Zvi · 2024-07-02T12:40:05.908Z · comments (5)

Representation Tuning
Christopher Ackerman (christopher-ackerman) · 2024-06-27T17:44:33.338Z · comments (9)

[link] Is the AI Doomsday Narrative the Product of a Big Tech Conspiracy?
garrison · 2024-12-04T19:20:59.286Z · comments (1)

Rolling Thresholds for AGI Scaling Regulation
Larks · 2025-01-12T01:30:23.797Z · comments (3)

Fireplace and Candle Smoke
jefftk (jkaufman) · 2025-01-01T01:50:01.408Z · comments (4)

Alternative Cancer Care As Biohacking & Book Review: Surviving "Terminal" Cancer
DenizT · 2025-01-06T07:43:52.773Z · comments (6)

Childhood and Education Roundup #7
Zvi · 2024-12-09T13:10:05.588Z · comments (10)

Dmitry's Koan
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-10T04:27:30.346Z · comments (2)

AXRP Episode 33 - RLHF Problems with Scott Emmons
DanielFilan · 2024-06-12T03:30:05.747Z · comments (0)

Difficulty classes for alignment properties
Jozdien · 2024-02-20T09:08:24.783Z · comments (5)

Computational Mechanics Hackathon (June 1 & 2)
Adam Shai (adam-shai) · 2024-05-24T22:18:44.352Z · comments (5)

"Which chains-of-thought was that faster than?"
Emrik (Emrik North) · 2024-05-22T08:21:00.269Z · comments (4)

[link] Suffering Is Not Pain
jbkjr · 2024-06-18T18:04:43.407Z · comments (45)

What I Learned (Conclusion To "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-20T21:24:37.464Z · comments (0)

Musings on LLM Scale (Jul 2024)
Vladimir_Nesov · 2024-07-03T18:35:48.373Z · comments (0)

[link] Book review: On the Edge
PeterMcCluskey · 2024-08-30T22:18:39.581Z · comments (0)

[link] The last era of human mistakes
owencb · 2024-07-24T09:58:42.116Z · comments (2)

[link] Robin Hanson & Liron Shapira Debate AI X-Risk
Liron · 2024-07-08T21:45:40.609Z · comments (4)

AI #56: Blackwell That Ends Well
Zvi · 2024-03-21T12:10:05.412Z · comments (16)

[link] The $100B plan with "70% risk of killing us all" w Stephen Fry [video]
Oleg Trott (oleg-trott) · 2024-07-21T20:06:39.615Z · comments (8)

[link] hydrogen tube transport
bhauth · 2024-04-18T22:47:08.790Z · comments (12)

The Schumer Report on AI (RTFB)
Zvi · 2024-05-24T15:10:03.122Z · comments (3)

[link] The Cancer Resolution?
PeterMcCluskey · 2024-07-24T00:25:17.322Z · comments (27)

Musings on Text Data Wall (Oct 2024)
Vladimir_Nesov · 2024-10-05T19:00:21.286Z · comments (2)

[link] My Apartment Art Commission Process
jenn (pixx) · 2024-08-26T18:36:44.363Z · comments (4)

[link] Liquid vs Illiquid Careers
vaishnav92 · 2024-10-20T23:03:49.725Z · comments (7)

[link] Romae Industriae
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-19T13:03:31.536Z · comments (2)

[link] Inferring the model dimension of API-protected LLMs
Ege Erdil (ege-erdil) · 2024-03-18T06:19:25.974Z · comments (3)

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Sodium · 2024-10-03T19:11:58.032Z · comments (17)

ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
25Hour (aaron-kaufman) · 2024-10-05T11:30:11.953Z · comments (2)

D&D.Sci (Easy Mode): On The Construction Of Impossible Structures
abstractapplic · 2024-05-17T00:25:42.950Z · comments (12)

Adam Smith Meets AI Doomers
James_Miller · 2024-01-31T15:53:03.070Z · comments (10)

One True Love
Zvi · 2024-02-09T15:10:05.298Z · comments (7)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

weightt-an on weightt an's Shortform

I would really love if some "let's make asi" people put some effort into making bad outcomes less bad. Like, it would really suck if we are going to be trapped in endless corporate punk hell, with superintentelligent nannies with correct (tm) opinions. Or infinite wedding parties or whatever. Just make sure that if you fuck up we all just get eaten by nanobots please. Permanent entrapment in misery would be a lot worse.

jacob_cannell on How will we update about scheming?

Training processes with varying (apparent) situational awareness
1:2.5 The AI seemingly isn't aware it is an AI except for a small fraction of training which isn't where much of the capabilities are coming from. For instance, the system is pretrained on next token prediction, our evidence strongly indicates that the system doesn't know it is an AI when doing next token prediction (which likely requires being confident that it isn't internally doing a substantial amount of general-purpose thinking about what to think about), and there is only a small RL process which isn't where much of the capabilities are coming from.

Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model. The fact that you need to prompt them to summon out a situationally aware scheming agent doesn't seem like much of a barrier, and indeed strong frontier base models are so obviously misaligned/jail-breakable/dangerous that releasing them to the public is PR-harmful enough to motivate RLHF post training purely for selfish profit-motives.

> This implies that restricting when AIs become (saliently) aware that they are an AI could be a promising intervention, to the extent this is possible without greatly reducing competitiveness.

Who cares if it greatly reduces competitiveness in experimental training runs?

We need to figure out how to align superhuman models - models trained with > 1e25 efficient flops on the current internet/knowledge, which requires experimental iteration. We probably won't get multiple iteration attempts for aligning SI 'in prod', so we need to iterate in simulation (what you now call 'model organisms').

We need to find alignment training methods that work even when the agent has superhuman intelligence/inference. But 'superhuman' hear is relative - measured against our capabilities. The straightforward easy way to accomplish this is training agents in simulations with much earlier knowledge cutoff dates, which isn't theoretically hard - just requires constructing augmented historical training datasets. So you could train on a 10T+ token dataset of human writings/thoughts with cutoff 2010, or 1950, or 1700, etc. These base models wouldn't be capable of simulating/summoning realistic situationally aware agents, their RL derived agents wouldn't be situationally sim-aware either, etc.

sharmake-farah on When is reward ever the optimization target?

So in essence, even if reward truly isn't the optimization target at the outer level, that doesn't imply that all policies trained do not maximize the reward, right?

joey-kl on Drake Thomas's Shortform

Interesting, I can see why that would be a feature. I don't mind the taste at all actually. Before, I had some of their smaller citrus flavored kind, and they dissolved super quick and made me a little nauseous. I can see these ones being better in that respect.

viliam on Notes on Altruism

I suspect that debating altruism is an unusually good opportunity for people to "generalize from one example".

People who enjoy helping others will go like: "Of course, everyone wants to help others, deep inside. It's just that when people are in need or in pain, their self-preservation instinct temporarily reduces this feeling, to make them focus on saving themselves. But as soon as we help them satisfy their physical or emotional needs, you will find that even the seemingly horrible people are actually good, when given the opportunity."

And this is kinda unfalsifiable, because if someone remains a horrible person no matter how much you give them (things, support, forgiveness), you can insist that there must be some need that wasn't sufficiently satisfied yet. And that person would obviously encourage this perspective, because it means that they will get even more things. And there will always be something missing, because the world is not perfect.

On the other hand, people who don't enjoy helping others, can rationalize away almost any observed goodness: "Yeah, they are just showing off (i.e. trading a little effort or money for higher social status). And they definitely expect to get something in return; if not today, then maybe tomorrow. They probably got something in return when you weren't watching. Okay, they never got anything in return, but they thought they would, they just miscalculated; that's not goodness, that's stupidity. Why the fuck would anyone do anything, if they don't expect to benefit from it somehow?"

And this too is kinda unfalsifiable, because almost always there is something, no matter how indirect or disproportionately small. And the very fact that you know about a good deed already makes it suspicious that the person wanted you to know, to get something in return, at least some status in your eyes. (And if you don't know about a good deed, then it cannot contradict your perspective, can it?)

This way, everyone can stay convinced that their general theory of altruism is correct.

So maybe the truth is that (1) people are different, and (2) even the same person can do different good deeds for different reasons, and (3) even one deed can have multiple reasons. For example:

I may expect to maybe get something in return, but the probability of such thing happening multiplied by the average reward may be so small that this simply doesn't make sense economically; there are more profitable things I could be doing instead, if I only cared about my profit
some people may help the poor to signal how rich they are, but that alone does not explain why they chose to signal their wealth this way, instead of e.g. buying a really expensive car, which is what some other rich people do
some people are more likely to help others after their own needs are satisfied, but that may be just a subset of all people; other people respond to having their needs satisfied by simply having more needs, without any altruism manifesting as a side effect
similarly, some people are more likely to help others after seeing a role model, but other people just laugh at the role model, or invent a hypothesis why the role model (1) actually secretly benefits from their seemingly selfless actions, and (2) is actually making the world a worse place

How people feel about receiving charity (e.g. whether they feel degraded by it) may also depend on their model of altruism, which probably is a result of what they would do in the reversed situations, and what motives they have observed (or hypothesized) at others. For example, the kind of person who would only help others to feel smugly superior to them, will probably fight hard against being a recipient of help. The kind of person who gives gladly will be more likely to also receive gladly. (Although, a scammer will also receive gladly; they won't feel humiliated by having successfully exploited others.)

Some smaller points I haven't see in the article:

When considering altruism as reciprocity, i.e. helping others while expecting to get something in return, it probably makes sense to distinguish a few different meanings:

helping someone, because I expect that specific person to later pay back what they owe me
helping people, and expecting that some of them will later somehow reciprocate and many probably won't, but I am okay with such outcome, because so far I am profitable on average -- helping others costs me little, and once in a while someone reciprocates in a way that feels like winning a lottery
helping people in order to establish a general culture of "people in need should be helped" as an insurance in case I would later need some help myself, although I hope not to need this insurance

All of these could be summed up as "helping others, expecting to get something back", but they lead to different behaviors. In the first case, I would only help the people who seem most likely to pay it back later; I wouldn't help the homeless, or strangers. The second case... is actually how it works for me (I think it is not the true reason why I do it, but the fact is that it works). The third case, I think the difference is in the mood: if you only help others because you expect to be poor in future, it feels sad, and you will probably only try doing the minimum necessary.

I noticed a seeming paradox: When I help a person, I don't expect them to do something for me in return. And yet, I would feel better to learn that "this is the kind of person who helps others, when they can". At first I thought, okay, maybe this exposes some kind of hypocrisy or inconsistency on my end: I do not consciously expect to be paid back, but unconsciously I do, so I feel better when I learn that I have helped a person who is likely to reciprocate. But then I noticed that there is also another possible explanation: there are many people in need, and my possibilities to do good are limited; if I help a selfish person, it stops there; but if I help an altruistic person, they may later help someone else, and thus I may have started a chain reaction of good.

Similarly, helping a person who seems to be on a way to improvement feels better, to me. (I think this is not universal. At least I have heard that there are people who help others, but feel betrayed (?) when those people start working on themselves to be less likely to need help in the future.) One possible explanation is that people who get stronger are more likely to reciprocate. But another possible explanation is that when I help people, my hope is to make their situation better; and a person who works on themselves along with receiving my help is acting like a multiplier to that help. You know, "when you give a fish, you feed someone for a day; when you teach them fishing, you feed them for a lifetime", except people actually also need to eat when they are learning, so when you give someone a fish and then they learn fishing (sometimes they don't need you to teach), it's like you have fed them for a life using a single fish, which is an effective act of altruism.

screwtape on Speaking to Congressional staffers about AI risk

I continue to be a fan of people trying to accomplish something in the world and reporting back on what happened. This is a good example of the genre, and on a subject near and dear to (part of) LessWrong's collective heart.

I confidently expect somebody will read a bunch of things on LessWrong, get excited about AI, and try to get the American government to Do Something. By default this attempt will not be particularly well aimed or effective, and every piece of information we can give on the obstacles will be useful. There have been updates since 2023 on government awareness and response to AI, though I suspect the core information in this post about how to get in contact with people. It might even be cause area agnostic; if I wanted to talk to congress people about education or biosecurity, my guess is having draft proposals ready would be useful.

As novel is as the Dialogue feature was, I'd be interested in a tightened up version of this that cut to the key points and takeaways. I'd also be interested in hearing from people who've done policy work whether this seems accurate and whether it leaves anything important out- better yet from people who tried using this as a guide! Overall, yeah, I weakly think this is worth including in a Best Of LessWrong collection.

drake-thomas on Drake Thomas's Shortform

On my model of what's going on, you probably want the lozenges to spend a while dissolving, so that you have fairly continuous exposure of throat and nasal tissue to the zinc ions. I find that they taste bad and astringent if I actively suck on them but are pretty unobtrusive if they just gradually dissolve over an hour or two (sounds like you had a similar experience). I sometimes cut the lozenges in half and let each half dissolve so that they fit into my mouth more easily, you might want to give that a try?

benjamin_todd on How quickly could robots scale up?

Yes - if anyone reading knows more about manufacturing and could comment on how easy it would be to convert, that would be very helpful.

I also agree it would be interesting to try to do more analysis of how much ASI and robotics could speed up construction of robot factories, by looking at different bottlenecks and how much they could help.

I'm not sure a robot workforce would have a huge effect initially, since there's already a large pool of human workers (though maybe you get some boost by making everything run 24/7). However, at later stages it might become hard to hire enough human workers, while with robots you could keep scaling.

liface on Does a site exist for keeping track of casual wagers with others?

Update: I'm pretty sure that making and running this site would be illegal in many jurisdictions.

sharmake-farah on Realism about rationality

To be fair, I expect a lot of the cases of identical copies modulo stochasticity to exist in the future, and indeed you could argue has already happened for AI, but I expect it to be more and more relevant by default, so FDT working in the identical copies case is still a really valuable niche.