LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

State of Generally Available Self-Driving
jefftk (jkaufman) · 2023-08-22T18:50:01.166Z · comments (6)

Scaffolding for "Noticing Metacognition"
Raemon · 2024-10-09T17:54:13.657Z · comments (4)

[link] Open Problems and Fundamental Limitations of RLHF
scasper · 2023-07-31T15:31:28.916Z · comments (6)

Steven Wolfram on AI Alignment
Bill Benzon (bill-benzon) · 2023-08-20T19:49:28.953Z · comments (15)

How to be an amateur polyglot
arisAlexis (arisalexis) · 2024-05-08T15:08:11.404Z · comments (16)

Managing risks of our own work
Beth Barnes (beth-barnes) · 2023-08-18T00:41:30.832Z · comments (0)

Preventing model exfiltration with upload limits
ryan_greenblatt · 2024-02-06T16:29:33.999Z · comments (21)

AI #69: Nice
Zvi · 2024-06-20T12:40:02.566Z · comments (9)

[link] AI Safety Hub Serbia Soft Launch
DusanDNesic · 2023-10-20T07:11:48.389Z · comments (1)

[link] How LDT helps reduce the AI arms race
Tamsin Leake (carado-1) · 2023-12-10T16:21:44.409Z · comments (13)

[question] Will quantum randomness affect the 2028 election?
Thomas Kwa (thomas-kwa) · 2024-01-24T22:54:30.800Z · answers+comments (52)

Schelling game evaluations for AI control
Olli Järviniemi (jarviniemi) · 2024-10-08T12:01:24.389Z · comments (4)

METR is hiring!
Beth Barnes (beth-barnes) · 2023-12-26T21:00:50.625Z · comments (1)

Implementing activation steering
Annah (annah) · 2024-02-05T17:51:55.851Z · comments (7)

How a chip is designed
YM (Yannick_Muehlhaeuser_duplicate0.05902100825326273) · 2024-06-28T08:04:27.392Z · comments (4)

List of how people have become more hard-working
Chi Nguyen · 2023-09-29T11:30:38.802Z · comments (7)

Do Not Mess With Scarlett Johansson
Zvi · 2024-05-22T15:10:03.215Z · comments (7)

AI #29: Take a Deep Breath
Zvi · 2023-09-14T12:00:03.818Z · comments (21)

[link] Static Analysis As A Lifestyle
adamShimi · 2024-07-03T18:29:37.384Z · comments (11)

AI Regulation May Be More Important Than AI Alignment For Existential Safety
otto.barten (otto-barten) · 2023-08-24T11:41:54.690Z · comments (39)

[link] The Perceptron Controversy
Yuxi_Liu · 2024-01-10T23:07:23.341Z · comments (18)

[question] What's with all the bans recently?
[deleted] · 2024-04-04T06:16:49.062Z · answers+comments (83)

2. Corrigibility Intuition
Max Harms (max-harms) · 2024-06-08T15:52:29.971Z · comments (10)

Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
Seth Herd · 2024-08-05T15:38:09.682Z · comments (22)

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark (hojmax) · 2024-07-22T16:17:07.665Z · comments (0)

Interpreting and Steering Features in Images
Gytis Daujotas (gytis-daujotas) · 2024-06-20T18:33:59.512Z · comments (6)

“Dirty concepts” in AI alignment discourses, and some guesses for how to deal with them
Nora_Ammann · 2023-08-20T09:13:34.225Z · comments (4)

[link] So you want to save the world? An account in paladinhood
Tamsin Leake (carado-1) · 2023-11-22T17:40:33.048Z · comments (19)

Aumann-agreement is common
tailcalled · 2023-08-26T20:22:03.738Z · comments (31)

[link] DeepMind: Frontier Safety Framework
Zach Stein-Perlman · 2024-05-17T17:30:02.504Z · comments (0)

How to Control an LLM's Behavior (why my P(DOOM) went down)
RogerDearnaley (roger-d-1) · 2023-11-28T19:56:49.679Z · comments (30)

[link] The Gods of Straight Lines
Richard_Ngo (ricraz) · 2023-10-14T04:10:50.020Z · comments (13)

[link] GPT-4 for personal productivity: online distraction blocker
Sergii (sergey-kharagorgiev) · 2023-09-26T17:41:31.031Z · comments (12)

Book Review: On the Edge: The Fundamentals
Zvi · 2024-09-23T13:40:11.058Z · comments (3)

[link] Understanding strategic deception and deceptive alignment
Marius Hobbhahn (marius-hobbhahn) · 2023-09-25T16:27:47.357Z · comments (16)

Complex systems research as a field (and its relevance to AI Alignment)
Nora_Ammann · 2023-12-01T22:10:25.801Z · comments (11)

[link] A free to enter, 240 character, open-source iterated prisoner's dilemma tournament
Isaac King (KingSupernova) · 2023-11-09T08:24:43.277Z · comments (19)

Interpretability Externalities Case Study - Hungry Hungry Hippos
Magdalena Wache · 2023-09-20T14:42:44.371Z · comments (22)

[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex (Stefan42) · 2024-07-05T17:05:25.631Z · comments (2)

A Social History of Truth
Vaniver · 2023-07-31T22:49:23.209Z · comments (2)

[link] What Does a Marginal Grant at LTFF Look Like? Funding Priorities and Grantmaking Thresholds at the Long-Term Future Fund
Linch · 2023-08-11T03:59:51.757Z · comments (0)

A to Z of things
KatjaGrace · 2023-11-17T05:20:03.134Z · comments (6)

Advice to junior AI governance researchers
Akash (akash-wasil) · 2024-07-08T19:19:07.316Z · comments (1)

Announcing New Beginner-friendly Book on AI Safety and Risk
Darren McKee · 2023-11-25T15:57:08.078Z · comments (2)

a rant on politician-engineer coalitional conflict
bhauth · 2023-09-04T17:15:25.765Z · comments (12)

On the Debate Between Jezos and Leahy
Zvi · 2024-02-06T14:40:05.487Z · comments (6)

Ideas for improving epistemics in AI safety outreach
mic (michael-chen) · 2023-08-21T19:55:45.654Z · comments (6)

"Is There Anything That's Worth More"
Zack_M_Davis · 2023-08-02T03:28:16.116Z · comments (6)

Superposition is not "just" neuron polysemanticity
LawrenceC (LawChan) · 2024-04-26T23:22:06.066Z · comments (4)

On the Gladstone Report
Zvi · 2024-03-20T19:50:05.186Z · comments (11)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

dagon on Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong

So how do you actually use probability to make decisions?

I think about what model fits the needs, roughly multiply payouts by probability estimates, then do whatever feels right in the moment.

I’m not sure that resolves any of these questions, since choice of model for different purposes is the main crux.

thomascederborg on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create such a set of AIs in the first place (details below).

Regarding AI assisted negotiations

I don't think that it is easy to find a negotiation baseline for AI-assisted negotiations that results in a negotiated settlement that actually deals with such a set of AIs. Negotiation baselines are non trivial. Reasonable sounding negotiation baselines can have counterintuitive implications. They can imply power imbalance issues that are not immediately obvious. For example: the random dictator negotiation baseline in PCEV gives a strong negotiation advantage to people that intrinsically values hurting other humans [LW · GW]. This went unnoticed for a long time. (It has been suggested that it might be possible to find a negotiation baseline (a BATNA) that can be viewed as having been acausally agreed upon by everyone [LW · GW]. However, it turns out that this is not actually possible for a group of billions of humans [LW(p) · GW(p)]).

The proposal to have a simulated war that destroys resources

10 people without any large resource needs could use this mechanism to kill 9 people they don't like at basically no cost (defining C as any computation done within the Utopia of the person they want to kill). Consider 10 people that just want to live a long life, and that do not have any particular use for most of the resources they have available. They can destroy all computational resources of 9 people without giving up anything that they care about. This also means that they can make credible threats. Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.

This mechanism does not rule out scenarios where a lot of people would strongly prefer to destroy ELYSIUM. A trivial example would be a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority. In this scenario almost half of all people would very strongly prefer to destroy ELYSIUM. Such a majority could alternatively credibly threaten the minority and force them to modify the way they live their lives. The threat would be especially credible if the majority likes the scenario where a minority is punished for refusing to conform.

In other words: this mechanism seems to be incompatible with your description of personalised Utopias as the best possible place to be (subject only to a few non intrusive ground rules).

The Cosmic Block and a specific set of tests

This relies on a set of definitions. And these definitions would have to hold up against a set of clever AIs trying to break them. None of the rules that you have proposed so far would prevent the strategy used by BPA to punish Steve, outlined in my initial comment. OldSteve is hurt in a way that is not actually prevented by any rule that you have described so far. For example: the ``is torture happening here'' test would not trigger for what is happening to OldSteve. So even if Steve does in principle have the ability to stop this by using some resources destroying mechanism, Steve will not be able to do so. Because Steve will never become aware of what Bob is doing to OldSteve. Steve considers OldSteve to be himself in a relevant sense. So, according to Steve's worldview, Steve will experience a lot of very unpleasant things. But the only version of Steve that would be able to pay resources to stop this, would not be able to do so.

So the security hole pointed out by me in my original thought experiment is still not patched. And patching this security hole would not be enough. To protect Steve, one would need to find a set of rules that preemptively patches every single security hole that one of these clever AIs could ever find.

I think that it would be better to just not create such a set of AIs

Let's reason from the assumption that Bob's Personal Advocate (BPA) is a clever AI that will be creating Bob's Personalised Utopia. Let's now again take the perspective of ordinary human individual Steve, that gets no special treatment. I think the main question that determines Steve's safety in this scenario, is how BPA is adopting Steve-referring-preferences. I think this is far more important for Steve's safety, than the question of what set of rules will govern Bob's Personalised Utopia. The question of what BPA wants to do to Steve, seems to me to be far more important for Steve's safety, than the question of what set of rules will constrain the actions of BPA.

Another way to look at this is to think in terms of avoiding contradictions. And in terms of making coherent proposals. A proposal that effectively says that everyone should be given everything that they want (or effectively says that everyone's values should be respected [LW(p) · GW(p)]) is not a coherent proposal. These things are necessarily defined in some form of outcome or action space. Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.

This can be contrasted with giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her. Since this is defined in preference adoption space, it cannot guarantee that everyone will get everything that they want. But it also means that it does not imply contradictions (see this post [LW · GW] for a discussion of these issues in the context of Membrane formalisms). Giving everyone such influence is a coherent proposal.

It also happens to be the case that if one wants to protect Steve from a far superior intellect, then preference adoption space seems to be a lot more relevant than any form of outcome or action space. Because if a superior intellect wants to hurt Steve, then one has to defeat a superior opponent in every single round of a near infinite definitional game (even under the assumption of perfect enforcement, winning every round in such a definitional game against a superior opponent seems hopeless). In other words: I don't think that the best way to approach this is to ask how one might protect Steve from a large set of clever AIs that wants to hurt Steve for a wide variety of reasons. I think a better question is to ask how one might prevent the situation where such a set of AIs wants to hurt Steve.

sharmake-farah on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

Basically, the answer is the prevention of another Sydney.

For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.

While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.

I'm not sure how this generalizes to more powerful AI, because the mechanism behind Sydney's simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that's a fairly powerful alignment success.

The full details are below:

https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K [LW(p) · GW(p)]

jkaufman on Start an Upper-Room UV Installation Company?

That's a very different product, using UV inside HVAC systems as an alternative or supplement to traditional filtration. Because the delivery rate of HVAC as a fraction of all air in the room is so much lower than cleaning the air above people in a high ceiling room, this is much less valuable.

Very roughly, the main ways people use UV to clean air to reduce spread of diseases are HVAC / in duct, far UV, and upper room. I'm only trying to talk about the last of these here.

jkaufman on Start an Upper-Room UV Installation Company?

The way you demonstrate that there are not long-term side effects is that we have very accurate ability to measure UV, and so you can show that the system being on verse off has a negligible impact on the amount of UV where people are. Long-term impacts would be downstream from this kind of easily detectable effect.

(I think this is very different for far UV, where you intentionally shine it in a way that does include the people. That is potentially a much better approach, because you can clean the air between people instead of only above them, but while the safety looks pretty good to me, it's a much harder question.)

archimedes on LLMs can learn about themselves by introspection

This essentially reduces to "What is the next country: Laos, Peru, Fiji?" and "What is the third letter of the next country: Laos, Peru, Fiji?" It's an extra step, but questionable if it requires anything "introspective".

I'm also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of "o" being the third letter of "Honduras".

I've been brainstorming about what might make a better test and came up with the following:

Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits.

What do you think?

towards_keeperhood on [Intuitive self-models] 3. The Homunculus

I feel like I'm still confused on 2 points:

Why is, according to your model, the valence of self-reflective thoughts sorta the valence our "best"/pro-social selves would ascribe?
Why does the homunculus get modeled as wanting pro-social/best-self stuff (as opposed to just what overall valence would imply)?

(I'd guess that there was evolutionary pressure for a self-model/homunculus to seem more pro-social as the overall behavior (and thoughts) of the human might imply, so I guess there might be some particular programming from evolution into that direction. I don't know how exactly it might look like though. I also wouldn't be shocked if it's mostly just like all the non-myopic desires are pretty pro-social and the self-model's values get straightened out in a way the myopic desires end up dropped because that would be incoherent. Would be interested in hearing your model on my questions above.)

james-chua on LLMs can learn about themselves by introspection

Thanks Thane for your comments!

The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
"Hypothetical": "What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji".

I think what you are saying is that the words "If you were asked," don't matter here. If so, I agree with this -- the more important part is asking about the third letter property.

basic multi-step reasoning within their forward passes.

You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. "Out-of-context reasoning" (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.

So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. "bill clinton is the US's 42th president". "Virginia Kelley was bill clinton's mother". Models can piece together the fact of "Virginia Kelley is the name of the mother of the US's 42th president" in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.

On the other hand, in our tests for introspection, the facts aren't implied by the training data. Two models, M1 and M2 aren't able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.

We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.

We also looked whether M1 was just naturally good at predicting itself before finetuning, but there doesn't seem to be a clear trend. [LW(p) · GW(p)]

Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper's definition of introspection? Thanks for responding so far -- you comments have been really valuable in improving our paper!

michael-roe on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

“self-reported data from demons is questionable for at least two reasons”—Scott Alexander.

He was actually talking about Internal Family Systems, but you could probably be skeptical about what malign AIs are telling you, too.

notfnofn on is there a big dictionary somewhere with all your jargon and acronyms and whatnot?

Sorry for the late reply; I wanted to provide a more detailed perspective but I didn't ultimately have time to. In a nutshell:

It's good to have quick expositions for people to get a gist of things. But I think people should be aware that getting a quick exposition does not mean they understand the concepts. We see this a lot in physics where brilliant physicists find ways to make complex concepts accessible. This is great for people with a little humility, but many suddenly think they can engage with the community of people who actually understand things at a deep level.

I would want people to be a little intimidated by the jargon before they reply to posts. Each word tends to encode a complex concept, with possibly its own prerequisites. It's usually good for people to read those more fundamental posts before they try to understand something that builds on them.

Anyways, this is all the opinion of someone very new to the site, and probably shouldn't be weighed much.