LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Managing risks while trying to do good
Wei Dai (Wei_Dai) · 2024-02-01T18:08:46.506Z · comments (26)

My hopes for alignment: Singular learning theory and whole brain emulation
Garrett Baker (D0TheMath) · 2023-10-25T18:31:14.407Z · comments (5)

[question] We might be dropping the ball on Autonomous Replication and Adaptation.
Charbel-Raphaël (charbel-raphael-segerie) · 2024-05-31T13:49:11.327Z · answers+comments (30)

Balancing Games
jefftk (jkaufman) · 2024-02-24T14:40:04.237Z · comments (18)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

The proper response to mistakes that have harmed others?
Ruby · 2023-12-31T04:06:31.505Z · comments (12)

Balsa Update and General Thank You
Zvi · 2023-12-12T20:30:03.980Z · comments (8)

Natural Latents Are Not Robust To Tiny Mixtures
johnswentworth · 2024-06-07T18:53:36.643Z · comments (8)

Social status part 2/2: everything else
Steven Byrnes (steve2152) · 2024-03-05T16:29:19.072Z · comments (2)

Raemon's Deliberate (“Purposeful?”) Practice Club
Raemon · 2023-11-14T18:24:19.335Z · comments (11)

What is SB 1047 *for*?
Raemon · 2024-09-05T17:39:39.871Z · comments (8)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

[link] ‘We’re changing the clouds.’ An unforeseen test of geoengineering is fueling record ocean warmth
Annapurna (jorge-velez) · 2023-08-06T20:58:51.838Z · comments (6)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (3)

Base LLMs refuse too
Connor Kissane (ckkissane) · 2024-09-29T16:04:21.343Z · comments (20)

Interdictor Ship
lsusr · 2024-08-19T04:59:18.487Z · comments (9)

OpenAI’s new Preparedness team is hiring
leopold · 2023-10-26T20:42:35.966Z · comments (2)

What is "True Love"?
johnswentworth · 2024-08-18T16:05:47.358Z · comments (9)

[link] Results from an Adversarial Collaboration on AI Risk (FRI)
Josh Rosenberg (josh-rosenberg) · 2024-03-11T20:00:24.642Z · comments (3)

There Should Be More Alignment-Driven Startups
Vaniver · 2024-05-31T02:05:06.799Z · comments (14)

[link] Is Claude a mystic?
jessicata (jessica.liu.taylor) · 2024-06-07T04:27:09.118Z · comments (23)

[Intuitive self-models] 4. Trance
Steven Byrnes (steve2152) · 2024-10-08T13:30:41.446Z · comments (6)

My AI Predictions 2023 - 2026
HunterJay · 2023-10-16T00:50:52.968Z · comments (28)

On OpenAI Dev Day
Zvi · 2023-11-09T16:10:06.646Z · comments (0)

0th Person and 1st Person Logic
Adele Lopez (adele-lopez-1) · 2024-03-10T00:56:14.446Z · comments (28)

5 Physics Problems
DaemonicSigil · 2024-03-18T08:05:45.971Z · comments (0)

Originality vs. Correctness
alkjash · 2023-12-06T18:51:49.531Z · comments (16)

"Epistemic range of motion" and LessWrong moderation
habryka (habryka4) · 2023-11-27T21:58:40.834Z · comments (3)

An Actually Intuitive Explanation of the Oberth Effect
Isaac King (KingSupernova) · 2024-01-10T20:23:17.216Z · comments (33)

MATS Alumni Impact Analysis
utilistrutil · 2024-09-30T02:35:57.273Z · comments (6)

[question] What do we know about the AI knowledge and views, especially about existential risk, of the new OpenAI board members?
Zvi · 2024-03-11T14:55:05.128Z · answers+comments (2)

AI #25: Inflection Point
Zvi · 2023-08-17T14:40:06.940Z · comments (9)

The Sense Of Physical Necessity: A Naturalism Demo (Introduction)
LoganStrohl (BrienneYudkowsky) · 2024-02-24T02:56:31.458Z · comments (1)

LessOnline Festival Updates Thread
Ben Pace (Benito) · 2024-04-18T21:55:08.003Z · comments (26)

AI #81: Alpha Proteo
Zvi · 2024-09-12T13:00:07.958Z · comments (3)

[link] Linkpost: Surely you can be serious
kave · 2024-07-18T22:18:09.271Z · comments (7)

Thoughts on SB-1047
ryan_greenblatt · 2024-05-29T23:26:14.392Z · comments (1)

Measuring Coherence of Policies in Toy Environments
dx26 (dylan-xu) · 2024-03-18T17:59:08.118Z · comments (9)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Lidor Banuel Dabbah · 2024-07-19T20:32:15.095Z · comments (6)

[link] More people getting into AI safety should do a PhD
AdamGleave · 2024-03-14T22:14:48.855Z · comments (24)

On Frequentism and Bayesian Dogma
DanielFilan · 2023-10-15T22:23:10.747Z · comments (27)

[link] Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso (rhaps0dy) · 2024-07-25T22:00:55.398Z · comments (8)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (9)

Understanding SAE Features with the Logit Lens
Joseph Bloom (Jbloom) · 2024-03-11T00:16:57.429Z · comments (0)

[link] Towards shutdownable agents via stochastic choice
EJT (ElliottThornley) · 2024-07-08T10:14:24.452Z · comments (7)

[link] Linkpost for Jan Leike on Self-Exfiltration
Daniel Kokotajlo (daniel-kokotajlo) · 2023-09-13T21:23:09.239Z · comments (1)

AI #23: Fundamental Problems with RLHF
Zvi · 2023-08-03T12:50:11.852Z · comments (9)

[link] Are There Examples of Overhang for Other Technologies?
Jeffrey Heninger (jeffrey-heninger) · 2023-12-13T21:48:08.954Z · comments (50)

What's next for the field of Agent Foundations?
Nora_Ammann · 2023-11-30T17:55:13.982Z · comments (23)

Self-explaining SAE features
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-05T22:20:36.041Z · comments (13)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

The question of what BPA wants to do to Steve, seems to me to be far more important for Steve's safety, than the question of what set of rules will constrain the actions of BPA.

BPA shouldn't be allowed to want anything for Steve. There shouldn't be a term in its world-model for Steve. This is the goal of cosmic blocking. The BPA can't even know that Steve exists.

I think the difficult part is when BPA looks at Bob's preferences (excluding, of course, references to most specific people) and sees preferences for inflicting harm on people-in-general that can be bent just enough to fit into the "not-torture" bucket, and so it synthetically generates some new people and starts inflicting some kind of marginal harm on them.

And I think that this will in fact be a binding constraint on utopia, because most humans will (given the resources) want to make a personal utopia full of other humans that forms a status hierarchy with them at the top. And 'being forced to participate in a status hierarchy that you are not at the top of' is a type of 'generalized consensual harm'.

Even the good old Reedspacer's Lower Bound fits this model. Reedspacer wants a volcano lair full of catgirls, but the catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer. Arguably, that is a form of consensual harm to the catgirls.

So I don't think there is a neat boundary here. The neatest boundary is informed consent, perhaps backed up by some lower-level tests about what proportion of an entity's existence is actually miserable.

If Reedspacer beats his catgirls, makes them feel sad all the time, that matters. But maybe if one of them feels a little bit sad for a short moment that is acceptable.

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Steve will never become aware of what Bob is doing to OldSteve

But how would Bob know that he wanted to create OldSteve, if Steve has been deleted from his memory via a cosmic block?

I suppose perhaps Bob could create OldEve. Eve is in a similar but not identical point in personality space to Steve and the desire to harm people who are like Eve is really the same desire as the desire to harm people like Steve. So Bob's Extrapolated Volition could create OldEve, who somehow consents to being mistreated in a way that doesn't trigger your torture detection test.

This kind of 'marginal case of consensual torture' has popped up in other similar discussions. E.g. In Yvain's (Scott Alexander's) article on Archipelago there's this section:

"""A child who is abused may be too young to know that escape is an option, or may be brainwashed into thinking they are evil, or guilted into believing they are betraying their families to opt out. And although there is no perfect, elegant solution here, the practical solution is that UniGov enforces some pretty strict laws on child-rearing, and every child, no matter what other education they receive, also has to receive a class taught by a UniGov representative in which they learn about the other communities in the Archipelago, receive a basic non-brainwashed view of the world, and are given directions to their nearest UniGov representative who they can give their opt-out request to"""

So Scott Alexander's solution to OldSteve is that OldSteve must get a non-brainwashed education about how ELYSIUM/Archipelago works and be given the option to opt out.

I think the issue here is that "people who unwisely consent to torture even after being told about it" and "people who are willing and consenting submissives" is not actually a hard boundary.

radford-neal-1 on Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong

But the whole point of using probability to express uncertainty about the world is that the probabilities do not depend on the purpose.

If there are N possible observations, and M binary choices that you need to make, then a direct strategy for how to respond to an observation requires a table of size NxM, giving the actions to take for each possible observation. And you somehow have to learn this table.

In contrast, if the M choices all depend on one binary state of the world, you just need to have a table of probabilities of that state for each of the N observations, and a table of the utilities for the four action/state combinations for the M decisions - which have size proportional to N+M, much smaller than NxM for large N and M. You only need to learn the N probabilities (perhaps the utilities are givens).

And in reality, trying to make decisions without probabilities is even worse than it seems from this, since the set of decisions you may need to make is indefinitely large, and the number of possible observations is enormous. But avoiding having to make decisions by a direct observation->action table requires that probabilities have meaning independent of what decision you're considering at the moment. You can't just say that it could be 1/2, or could be 1/3...

austin-chen on Start an Upper-Room UV Installation Company?

Another similar company I want someone to start is one that produces inexpensive, self-installable far UV lamps. My understanding is that far UV is safe to shine directly on humans (as opposed to standard UV), meaning that you don't need high ceilings or special technicians to install the lamp. However, it's a much newer technology with not very much adoption or testing, I think because of a combination of principal/agent problems and price; see this post on blockers to Far UV adoption [LW · GW].

Beacon does produce these $800 lamps, which are consumer friendly-ish. I bought one for the Manifold office, but due to a variety of trivial inconveniences (figuring out where to mount it; the mobile app not syncing with my phone) it's still not active. I think a competent operator in this space could make a device that's somewhat cheaper & easier to use, and hit a tipping point for widespread/viral adoption.

(If you or someone you know is interested in doing this and is looking for funding, reach out to me at austin@manifund.org!)

deepthoughtlife on LLMs can learn about themselves by introspection

I obviously tend to go on at length about things when I analyze them. I'm glad when that's useful.

I had heard that OpenAI models aren't deterministic even at the lowest randomness, which I believe is probably due to optimizations for speed like how in image generation models (which I am more familiar with) the use of optimizers like xformers throws away a little correctness and determinism for significant improvements in resource usage. I don't know what OpenAI uses to run these models (I assume they have their own custom hardware?), but I'm pretty sure that it is the same reason. I definitely agree that randomness causes a cap on how well it could possibly do. On that point, could you determine the amount of indeterminacy in the system and put the maximum possible on your graphs for their models?

One thing I don't know if I got across in my comment based on the response is that I think if a model truly had introspective abilities to a high degree, it would notice that the basis of the result to such a question should be the same as its own process for the non-hypothetical comes up with. If it had introspection, it would probably use introspection as its default guess for both its own hypothetical behavior and that of any model (in people introspection is constantly used as a minor or sometimes major component of problem solving). Thus it would notice when its introspection got perfect scores and become very heavily dependent on it for this type of task, which is why I would expect its results to really just be 'run the query' for the hypothetical too.

Important point I perhaps should have mentioned originally, I think that the 'single forward pass' thing is in fact a huge problem for the idea of real introspection, since I believe introspection is a recursive task. You can perhaps do a single 'moment' of introspection on a single forward pass, but I'm not sure I'd even call that real introspection. Real introspection involves the ability to introspect about your introspection. Much like consciousness, it is very meta. Of course, the actual recursive depth of introspection at high fidelity usually isn't very deep, but we tell ourselves stories about our stories in an an almost infinitely deep manner (for instance, people have a 'life story' they think about and alter throughout their lives, and use their current story as an input basically always).

There are, of course, other architectures where that isn't a limitation, but we hardly use them at all (talking for the world at large, I'm sure there are still AI researchers working on such architectures). Honestly, I don't understand why they don't just use transformers in a loop with either its own ability to say when it has reached the end or with counters like 'pass 1 of 7'. (If computation is the concern, they could obviously just make it smaller.) They obviously used to have such recursive architectures and everyone in the field would be familiar with them (as are many laymen who are just kind of interested in how things work). I assume that means that people have tried and didn't find it useful enough to focus on, but I think it could help a lot with this kind of thing. (Image generation models actually kind of do this with diffusion, but they have a little extra unnecessary code in between, which are the actual diffusion parts.) I don't actually know why these architectures were abandoned besides there being a new shiny (transformers) though so there may be an obvious reason.

I would agree with you that these results do make it a more interesting research direction than other results would have, and it certainly seems worth "someone's" time to find out how it goes. I think a lot of what you are hoping to get out of it will fail, hopefully in ways that will be obvious to people, but it might fail in interesting ways.

I would agree that it is possible that introspection training is simply eliciting a latent capability that simply wasn't used in the initial training (though it would perhaps be interesting to train it on introspection earlier in its training and then simply continue with the normal training and see how that goes), I just think that finding a way to elicit it without retraining would be much better proof of its existence as a capability rather than as an artifact of goodharting things. I am often pretty skeptical about results across most/all fields where you can't do logical proofs due to this. Of course, even just finding the right prompt is vulnerable to this issue.

I think I don't agree that their being a cap on how much training is helpful necessarily indicates it is elicitation, but I don't really have a coherent argument on the matter. It just doesn't sound right to me.

The point you said you didn't understand was meant to point out (apparently unsuccessfully) that you use a different prompt for training than checking and it might also be worthwhile to train it on that style of prompting but with unrelated content. (Not that I know how you'd fit that style of prompting with a different style of content mind you.)

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority.

If there is an agent that controls 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone else, then assuming that ELYSIUM forbids them to do that, their rational move is to use their resources to prevent ELYSIUM from being built.

And since they control 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone who was trying to actually create ELYSIUM, they would likely succeed and ELYSIUM wouldn't happen.

Re:threats, see my other comment.

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.

This assumes that threats are allowed. If you allow threats within your system you are losing out on most of the value of trying to create an artificial utopia because you will recreate most of the bad dynamics of real history which ultimately revolve around threats of force in order to acquire resources. So, the ability to prevent entities from issuing threats that they then do not follow through on is crucial.

Improving the equilibria of a game is often about removing strategic options; in this case the goal is to remove the option of running what is essentially organized crime.

In the real world there are various mechanisms that prevent organized crime and protection rackets. If you threaten to use force on someone in exchange for resources, the mere threat of force is itself illegal at least within most countries and is punished by a loss of resources far greater than the threat could win.

People can still engage in various forms of protest that are mutually destructive of resources (AKA civil disobedience).

The ability to have civil disobedience without protection rackets does seem kind of crucial.

zach-stein-perlman on LLMs can learn about themselves by introspection

I'm confused/skeptical about this being relevant, I thought honesty is orthogonal to whether the model has access to its mental states.

eliezer_yudkowsky on The Hidden Complexity of Wishes

Your distinction between "outer alignment" and "inner alignment" is both ahistorical and unYudkowskian. It was invented years after this post was written, by someone who wasn't me; and though I've sometimes used the terms in occasions where they seem to fit unambiguously, it's not something I see as a clear ontological division, especially if you're talking about questions like "If we own the following kind of blackbox, would alignment get any easier?" which on my view breaks that ontology. So I strongly reject your frame that this post was "clearly portraying an outer alignment problem" and can be criticized on those grounds by you; that is anachronistic.

You are now dragging in a very large number of further inferences about "what I meant", and other implications that you think this post has, which are about Christiano-style proposals that were developed years after this post. I have disagreements with those, many disagreements. But it is definitely not what this post is about, one way or another, because this post predates Christiano being on the scene.

What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want. This point is true! If you then want to take in a bunch of anachronistic ideas developed later, and claim (wrongly imo) that this renders irrelevant the simple truth of what this post actually literally says, that would be a separate conversation. But if you're doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.

michael-roe on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

https://www.bbc.co.uk/news/technology-67012224