Posts

Can anybody give/reference an ELI5 or ELI15 explanation of this example? How can we use the models without creating them? I know that gradient descent is used to update neural networks, but how can you get the predictions of those NNs without having them?

Comment by Q Home on Clarifying the Agent-Like Structure Problem · 2025-04-21T10:34:37.470Z · LW · GW

I feel very confused about the problem. Would appreciate anyone's help with the questions below.

Why doesn't the Gooder Regulator theorem solve the Agent-Like Structure Problem?
The separation between the "world model", "search process" and "problem specification" should be in space (not in time)? We should be able to carve the system into those parts, physically?
Why would problem specification nessecerily be outside of the world model??? I imagine it could be encoded as an extra object in the world model. Any intuition for why keeping them separate is good for the agent? (I'll propose one myself, see 5.)
Why are the "world model" and "search process" two different entities, what does each of them do? What is the fundamental difference between "modeling the world" and "searching"? Like, imagine I have different types of heuristics (A, B, C) for predicting the world, but I also can use them for search.
Doesn't the inner alignment problem resolve the Agent-Like Structure Problem? Let me explain. Take a human, e.g. me. I have a big, changing brain. Parts of my brain can be said to want different things. That's an instance of the inner alignment problem. And that's a reason why having my goals completely entangled with all other parts of my brain could be dangerous (in such case it could be easier for any minor misalignment to blow up and overwrite my entire personality).
As I understand, the arguments from here would at least partially solve the problem, right? If they were formalized.

Comment by Q Home on Q Home's Shortform · 2025-03-17T09:26:11.143Z · LW · GW

I have a couple of silly, absurd questions related to mesa-optimizers and mesa-controllers. I'm asking them to get a fresh look on the problem of inner alignment. I want to get a better grip on what basic properties of a model make it safe.

Question 1. How do we know that Quantum Mechanics theory is not plotting to kill humanity?

It's a model, so it could be unsafe just like an AI.

QM is not an agent, but its predictions strongly affect humanity. Oracles can be dangerous.
QM is highly interpretable, so we can check that it's not doing internal search. Or can we? Maybe it does search in some implicit way? Eliezer brought up this possibility: if you prohibit an AI from modeling its programmers' psychology, the AI might start modelling something seemingly irrelevant which is actually equivalent to modeling the programmers' psychology.

Maybe the AI reasons about certain very complicated properties of the material object on the pedestal… in fact, these properties are so complicated that they turn out to contain implicit models of User2′s psychology

Even if QM doesn't do search in any way... maybe it still was optimized to steer humanity towards disaster?

Or maybe QM is "grounded" in some special way (e.g. it's easy to split into parts and verify that each part is correct), so we're very confident that it does physics and only physics?

Question 2. Crazier version of the previous question: how do we know that Peano arithmetic isn't plotting to destroy humanity? how do we know that the game of chess isn't plotting to end humanity?

Maybe Peano arithmetic contains theorems trying to prove which steers the real world towards disaster. How can we know and when do we care?

Question 3. Imagine you came up with a plan to achieve your goals. You did it yourself. How do you know that this plan is not optimizing for your ruin?

Humans do go insane and fall into addictions. But not always. So why are our thoughts relatively safe to us? Why doesn't every new thought / experience turn into addiction which wipes out all of your previous personality?

Question 4. You're the Telepath. You can read the mind of the Killer. The Killer can reason about some things which aren't comprehensible to you, but otherwise your cognition is very similar. Can you always tell if the Killer is planning to kill you?

Here are some thoughts the Killer might think:

"I need to do <something incomprehensible> so the Telepath dies."
"I need to get the Telepath to eat this food with <something incomprehensible> in it."
"I need to do <something incomprehensible> without any comprehensible reason."

With 1 we can understand the outcome and that's all that matters. With 2 we can still tell that something dodgy is going on. Even in 3 we see that the Killer tries to make his reasoning illegible. Maybe the Killer can never deceive us if the incomprehensible concepts he's thinking about are "embedded" into the comprehensible concepts?

Comment by Q Home on Half-baked idea: a straightforward method for learning environmental goals? · 2025-02-12T01:33:59.164Z · LW · GW

So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."

I assume we get an easily interpretable model where the difference between "real strawberries" and "pictures of strawberries" and "things sometimes correlated with strawberries" is easy to define, so we can use the model to directly pick the physical things AI should care about. I'm trying to address the problem of environmental goals, not the problem of teaching AI morals. Or maybe I'm misunderstanding your point?

The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields ("What they want is to obey the laws of physics. What they believe is their local state."), or your individual cells, etc.

If you're talking about AI learning morals, my idea is not about that. Not about modeling desires and beliefs.

The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)

I disagree too, but in a slightly different way. IIRC, John says approximately the following:

All reasoning systems converge on the same space of abstractions. This space of abstractions is the best way to model the universe.
In this space of abstractions it's easy to find the abstraction corresponding to e.g. real diamonds.

I think (1) doesn't need to be true. I say:

By default, humans only care about things they can easily interact with in humanly comprehensible ways. "Things which are easy to interact with in humanly comprehensible ways" should have a simple definition.
Among all "things which are easy to interact with in humanly comprehensible ways", it's easy to find the abstraction corresponding to e.g. real diamonds.

Comment by Q Home on Half-baked idea: a straightforward method for learning environmental goals? · 2025-02-08T09:53:16.077Z · LW · GW

The subproblem of environmental goals is just to make AI care about natural enough (from the human perspective) "causes" of sensory data, not to align AI to the entirety of human values. Fundamental variables have no (direct) relation to the latter problem.

However, fundamental variables would be helpful for defining impact measures if we had a principled way to differentiate "times when it's OK to sidestep fundamental variables" from "times when it's NOT OK to sidestep fundamental variables". That's where the things you're talking about definitely become a problem. Or maybe I'm confused about your point.

Comment by Q Home on Half-baked idea: a straightforward method for learning environmental goals? · 2025-02-07T11:01:31.104Z · LW · GW

Thank you for actually engaging with the idea (pointing out problems and whatnot) rather than just suggesting reading material.

Btw, would you count a data packet as an object you move through space?

A couple of points:

I only assume AI models the world as "objects" moving through space and time, without restricting what those objects could be. So yes, a data packet might count.
"Fundamental variables" don't have to capture all typical effects of humans on the world, they only need to capture typical human actions which humans themselves can easily perceive and comprehend. So the fact that a human can send an Internet message at 2/3 speed of light doesn't mean that "2/3 speed of light" should be included in the range of fundamental variables, since humans can't move and react at such speeds.
Conclusion: data packets can be seen as objects, but there are many other objects which are much easier for humans to interact with.
Also note that fundamental variables are not meant to be some kind of "moral speed limits", prohibiting humans or AIs from acting at certain speeds. Fundamental variables are only needed to figure out what physical things humans can most easily interact with (because those are the objects humans are most likely to care about).

This range is quite huge. In certain contexts, you'd want to be moving through space at high fractions of the speed of light, rather than walking speed. Same goes for moving other objects through space.

What contexts do you mean? Maybe my point about "moral speed limits" addresses this.

Hopefully the AI knows you mean moving in sync with Earth's movement through space.

Yes, relativity of motion is a problem which needs to be analyzed. Fundamental variables should refer to relative speeds/displacements or something.

The paper is surely at least partially relevant, but what's your own opinion on it? I'm confused about this part: (4.2 Defining Utility Functions in Terms of Learned Models)

For example a person may be specified by textual name and address, by textual physical description, and by images and other recordings. There is very active research on recognizing people and objects by such specifications (Bishop, 2006; Koutroumbas and Theodoris, 2008; Russell and Norvig, 2010). This paper will not discuss the details of how specifications can be matched to structures in learned environment models, but assumes that algorithms for doing this are included in the utility function implementation.

Does it just completely ignore the main problem?

I know Abram Demski wrote about Model-based Utility Functions, but I couldn't fully understand his post too.

(Disclaimer: I'm almost mathematically illiterate, except knowing a lot of mathematical concepts from popular materials. Halting problem, Godel, uncountability, ordinals vs. cardinals, etc.)

Comment by Q Home on Q Home's Shortform · 2025-01-28T07:56:40.996Z · LW · GW

Epistemic status: Draft of a post. I want to propose a method of learning environmental goals (a super big, super important subproblem in Alignment). It's informal, so has a lot of gaps. I worry I missed something obvious, rendering my argument completely meaningless. I asked LessWrong feedback team, but they couldn't get someone knowledgeable enough to take a look.

Can you tell me the biggest conceptual problems of my method? Can you tell me if agent foundations researchers are aware of this method or not?

If you're not familiar with the problem, here's the context: Environmental goals; identifying causal goal concepts from sensory data; ontology identification problem; Pointers Problem; Eliciting Latent Knowledge.

Explanation 1

One naive solution

Imagine we have a room full of animals. AI sees the room through a camera. How can AI learn to care about the real animals in the room rather than their images on the camera?

Assumption 1. Let's assume AI models the world as a bunch of objects interacting in space and time. I don't know how critical or problematic this assumption is.

Idea 1. Animals in the video are objects with certain properties (they move continuously, they move with certain relative speeds, they have certain sizes, etc). Let's make the AI search for the best world-model which contains objects with similar properties (P properties).

Problem 1. Ideally, AI will find clouds of atoms which move similarly to the animals on the video. However, AI might just find a world-model (X) which contains the screen of the camera. So it'll end up caring about "movement" of the pixels on the screen. Fail.

Observation 1. Our world contains many objects with P properties which don't show up on the camera. So, X is not the best world-model containing the biggest number of objects with P properties.

Idea 2. Let's make the AI search for the best world-model containing the biggest number of objects with P properties.

Question 1. For "Idea 2" to make practical sense, we need to find a smart way to limit the complexity of the models. Otherwise AI might just make any model contain arbitrary amounts of any objects. Can we find the right complexity prior?

Question 2. Assume we resolved the previous question positively. What if "Idea 2" still produces an alien ontology humans don't care about? Can it happen?

Question 3. Assume everything works out. How do we know that this is a general method of solving the problem? We have an object in sense data (A), we care about the physical thing corresponding to it (B): how do we know B always behaves similarly to A and there are always more instances of B than of A?

One philosophical argument

I think there's a philosophical argument which allows to resolve Questions 2 & 3 (giving evidence that Question 1 should be resolvable too).

By default, we only care about objects with which we can "meaningfully" interact with in our daily life. This guarantees that B always has to behave similarly to A, in some technical sense (otherwise we wouldn't be able to meaningfully interact with B). Also, sense data is a part of reality, so B includes A, therefore there are always more instances of B than of A, in some technical sense. This resolves Question 3.
By default, we only care about objects with which we can "meaningfully" interact with in our daily life. This guarantees that models of the world based on such objects are interpretable. This resolves Question 2.
Can we define what "meaningfully" means? I think that should be relatively easy, at least in theory. There doesn't have to be One True Definition Which Covers All Cases.

If the argument is true, the pointers problem should be solvable without Natural Abstraction hypothesis being true.

Anyway, I'll add a toy example which hopefully helps to better understand what's this all about.

One toy example

You're inside a 3D video game. 1st person view. The game contains landscapes and objects, both made of small balls (the size of tennis balls) of different colors. Also a character you control.

The character can push objects. Objects can break into pieces. Physics is Newtonian. Balls are held together by some force. Balls can have dramatically different weights.

Light is modeled by particles. Sun emits particles, they bounce off of surfaces.

The most unusual thing: as you move, your coordinates are fed into a pseudorandom number generator. The numbers from the generator are then used to swap places of arbitrary balls.

You care about pushing boxes (as everything, they're made of balls too) into a certain location.

...

So, the reality of the game has roughly 5 levels:

The level of sense data (2D screen of the 1st person view).
A. The level of ball structures. B. The level of individual balls.
A. The level of waves of light particles. B. The level of individual light particles.

I think AI should be able to figure out that it needs to care about 2A level of reality. Because ball structures are much simpler to control (by doing normal activities with the game's character) than individual balls. And light particles are harder to interact with than ball structures, due to their speed and nature.

Explanation 2

An alternative explanation of my argument:

Imagine activities which are crucial for a normal human life. For example: moving yourself in space (in a certain speed range); moving other things in space (in a certain speed range); staying in a single spot (for a certain time range); moving in a single direction (for a certain time range); having varied visual experiences (changing in a certain frequency range); etc. Those activities can be abstracted into mathematical properties of certain variables (speed of movement, continuity of movement, etc). Let's call them "fundamental variables". Fundamental variables are defined using sensory data or abstractions over sensory data.
Some variables can be optimized (for a long enough period of time) by fundamental variables. Other variables can't be optimized (for a long enough period of time) by fundamental variables. For example: proximity of my body to my bed is an optimizable variable (I can walk towards the bed — walking is a normal activity); the amount of things I see is an optimizable variable (I can close my eyes or hide some things — both actions are normal activities); closeness of two particular oxygen molecules might be a non-optimizable variable (it might be impossible to control their positions without doing something weird).
By default, people only care about optimizable variables. Unless there are special philosophical reasons to care about some obscure non-optimizable variable which doesn't have any significant effect on optimizable variables.
You can have a model which describes typical changes of an optimizable variable. Models of different optimizable variables have different predictive power. For example, "positions & shapes of chairs" and "positions & shapes of clouds of atoms" are both optimizable variables, but models of the latter have much greater predictive power. Complexity of the models needs to be limited, by the way, otherwise all models will have the same predictive power.
Collateral conclusions: typical changes of any optimizable variable are easily understandable by a human (since it can be optimized by fundamental variables, based on typical human activities); all optimizable variables are "similar" to each other, in some sense (since they all can be optimized by the same fundamental variables); there's a natural hierarchy of optimizable variables (based on predictive power). Main conclusion: while the true model of the world might be infinitely complex, physical things which ground humans' high-level concepts (such as "chairs", "cars", "trees", etc.) always have to have a simple model (which works most of the time, where "most" has a technical meaning determined by fundamental variables).

Formalization

So, the core of my idea is this:

AI is given "P properties" which a variable of its world-model might have. (Let's call a variable with P properties P-variable.)
AI searches for a world-model with the biggest amount of P-variables. AI makes sure it doesn't introduce useless P-variables. We also need to be careful with how we measure the "amount" of P-variables: we need to measure something like "density" rather than "amount" (i.e. the amount of P-variables contributing to a particular relevant situation, rather than the amount of P-variables overall?).
AI gets an interpretable world-model (because P-variables are highly interpretable), adequate for defining what we care about (because by default, humans only care about P-variables).

How far are we from being able to do something like this? Are agent foundations researches pursuing this or something else?

Comment by Q Home on Q Home's Shortform · 2025-01-18T03:43:34.316Z · LW · GW

Sorry if it's not appropriate for this site. But is anybody interested in chess research? I've seen that people here might be interested in chess. For example, here's a chess post barely related to AI.

Intro

In chess, what positions have the longest forced wins? "Mate in N" positions can be split into 3 types:

Positions which use "tricks" to get a big number of moves before checkmate. Such as cycles of repeating moves. For example, this manmade mate in 415 (see the last position) uses obvious cycles. Not to mention mates in omega.
Tablebase checkmates, discovered by brute force, showing absolutely incomprehensible play with no discernible logic. See this mate in 549 moves. One should assume it's based on some hidden cycles or something?
Positions which are similar to immortal games. Where the winning variation requires a combination without any cycles. For example: Kasparov's Immortal (14 moves long combination), Stoofvlees vs. Igel (down a rook for 21 moves) - the examples lack optimal play tho.

Surprisingly, nobody seems to look for the longest mates of Type 3. Well, I did look for them and discovered some. Down below I'll explain multiple ways to define what exactly I did. Won't go into too much detail. If you want more detail - Research idea: the longest non-trivial middlegames. There you also can see the puzzles I've created.

My longest puzzle is 42 moves: https://lichess.org/study/sTon08Mb/JG4YGbcP Overall, I've created 7 unique puzzles. Worked a lot on 1 more (mate in 52 moves), but couldn't make it work.

Among other things, I made this absurd mate in 34 puzzle. Almost the entire board is filled with pieces (62 pieces on the board!), only two squares are empty. And despite that the position has deep content. It's kinda a miracle. I think it deserves recognition.

Definition 1

Unlike Type 1 and Type 2 mates, my mates involve many sacrifices of material. So my mates can be defined as "the longest sacrificial combinations".

Definition 2

We can come up with important metrics which make a long mate more special, harder to find, more rare. Material disbalance, amount of non-check moves, amount of freedom of pieces, etc. Then we can search for the longest mates compatible with high enough values of those metrics.

Well, that's what I did.

Definition 3

This is an idea of a definition rather than a definition. But it might be important.

Take a sequential game with perfect information.
Take positions with the longest forced wins.
Out of those positions, choose positions where the defending side has the greatest control over the attacking side's optimal strategy.

My mates are an example of positions where the defending side has especially great control over the flow of the game.

Deeper meaning?

Can there be any deep meaning behind researching my type of mates? I think yes. There are two relevant things.

First thing is hard to explain, because I'm not a mathematician. But I'll try. Math can often be seen as skipping stuff which is the most interesting to humans. For example, math can prove theorems about games in general, without explaining why a specific game is interesting or why a specific position is interesting. However, here it seems like we can define something very closely related to subjective "interestingness".
Hardness of defining valuable things is relevant to Alignment. The definitions above reveal that maybe sometimes valuable things are easier to define than it seems.

Reception

How did chess community receive my work?

On Reddit, some posts got a moderate amount of upvotes (enough to get into daily top). A silly middlegame position. With checkmate in 50-80 moves? (110+); Does this position set any record? (60+). Sadly the pattern didn't continue: New long non-trivial middlegame mate found. Nobody asked for this. (1).
On a computer chess forum, people mostly ignored it. I hoped they could help me find the longest attacks in computer games.
On the Discord of chess composers, a bunch of people complimented my project. But nobody showed any proactive interest (e.g. "hey, I'd like to preserve your work"). One person reacted like ~"I'm not a specialist on that type of thing, I don't know with whom you could talk about that"
On Reddit communities where you can ask mathematicians things, people told that game theory is too abstract for tackling such things.

Comment by Q Home on Making a conservative case for alignment · 2024-11-30T04:56:12.733Z · LW · GW

Agree that neopronouns are dumb. Wikipedia says they're used by 4% LGBTQ people and criticized both within and outside the community.

But for people struggling with normal pronouns (he/she/they), I have the following thoughts:

Contorting language to avoid words associated with beliefs... is not easier than using the words. Don't project beliefs onto words too hard.
Contorting language to avoid words associated with beliefs... is still a violation of free speech (if we have such a strong notion of free speech). So what is the motivation to propose that? It's a bit like a dog in the manger. "I'd rather cripple myself than help you, let's suffer together".
Don't maximize free speech (in a negligible way) while ignoring every other human value.
In an imperfect society, truly passive tolerance (tolerance which doesn't require any words/actions) is impossible. For example, in a perfect society, if my school has bigoted teachers, it immediately gets outcompeted by a non-bigoted school. In an imperfect society it might not happen. So we get enforceable norms.

Employees get paid, which kinda automatically reduces their free speech, because saying the wrong words can make them stop getting paid. (...) Employment is really a different situation. You get laws, and recommendations of your legal department; there is not much anyone can do about that.

I'm not familiar with your model of free speech (i.e. how you imagine free speech working if laws and power balances were optimal). People who value free speech usually believe that free speech should have power above money and property, to a reasonable degree. What's "reasonable" is the crux.

I think in situations where people work together on something unrelated to their beliefs, prohibiting to enforce a code of conduct is unreasonable. Because respect is crucial for the work environment and protecting marginalized groups. I assume people who propose to "call everyone they" or "call everyone by proper name" realize some of that.

If I let people use my house as a school, but find out that a teacher openly doesn't respect minority students (by rejecting to do the smallest thing for them), I'm justified to not let the teacher into my house.

I do not talk about people's past for no good reason, and definitely not just to annoy someone else. But if I have a good reason to point out that someone did something in the past, and the only way to do that is to reveal their previous name, then I don't care about the taboo.

I just think "disliking deadnaming under most circumstances = magical thinking, like calling Italy Rome" was a very strong, barely argued/explained opinion. In tandem with mentioning delusion (Napoleon) and hysteria. If you want to write something insulting, maybe bother to clarify your opinions a little bit more? Like you did in our conversation.

Comment by Q Home on Making a conservative case for alignment · 2024-11-29T08:38:36.413Z · LW · GW

I think there should be more spaces where controversial ideas can be debated. I'm not against spaces without pronoun rules, just don't think every place should be like this. Also, if we create a space for political debate, we need to really make sure that the norms don't punish everyone who opposes centrism & the right. (Over-sensitive norms like "if you said that some opinion is transphobic you're uncivil/shaming/manipulative and should get banned" might do this.) Otherwise it's not free speech either. Will just produce another Grey or Red Tribe instead of Red/Blue/Grey debate platform.

I do think progressives underestimate free speech damage. To me it's the biggest issue with the Left. Though I don't think they're entirely wrong about free speech.

For example, imagine I have trans employees. Another employee (X) refuses to use pronouns, in principle (using pronouns is not the same as accepting progressive gender theories). Why? Maybe X thinks my trans employees live such a great lie that using pronouns is already an unacceptable concession. Or maybe X thinks that even trying to switch "he" & "she" is too much work, and I'm not justified in asking to do that work because of absolute free speech. Those opinions seem unnecessarily strong and they're at odds with the well-being of my employees, my work environment. So what now? Also, if pronouns are an unacceptable concession, why isn't calling a trans woman by her female name an unacceptable concession?

Imagine I don't believe something about a minority, so I start avoiding words which might suggest otherwise. If I don't believe that gay love can be as true as straight love, I avoid the word "love" (in reference to gay people or to anybody) at work. If I don't believe that women are as smart as men, I avoid the word "master" / "genius" (in reference to women or anybody) at work. It can get pretty silly. Will predictably cost me certain jobs.

Comment by Q Home on Making a conservative case for alignment · 2024-11-28T11:18:14.793Z · LW · GW

I'll describe my general thoughts, like you did.

I think about transness in a similar way to how I think about homo/bisexuality.

If homo/bisexuality is outlawed, people are gonna suffer. Bad.
If I could erase homo/bisexuality from existence without creating suffering, I wouldn't anyway. Would be a big violation of people's freedom to choose their identity and actions (even if in practice most people don't actually "choose" to be homo/bisexual).
Different people have homo/bisexuality of different "strength" and form. One man might fall in love with another man, but dislike sex or even kissing. Maybe he isn't a real homosexual, if he doesn't need to prove it physically? Another man might identify as a bisexual, but be in a relationship with a woman... he doesn't get to prove his bisexuality (sexually or romantically). Maybe we shouldn't trust him unless he walks the talk? As a result of all such situations, we might have certain "inconsistencies": some people identifying as straight have done more "gay" things than people identifying as gay. My opinion on this? I think all of this is OK. Pushing for an "objective gay test" would be dystopian and suffering-inducing. I don't think it's an empirical matter (unless we choose it to be, which is a value-laden choice). Even if it was, we might be very far away from resolving it. So just respecting people's self-identification in the meantime is best, I believe. Moreover, a lot of this is very private information anyway. Less reason to try measuring it "objectively".

My thoughts about transness specifically:

We strive for gender equality (I hope). Which makes the concept of gender less important for society as a whole.
The concept of gender is additionally damaged by all the things a person can decide to do in their social/sexual life. For example, take an "assigned male at birth" (AMAB) person. AMAB can appear and behave very feminine without taking hormones. Or vice-versa (take hormones, get a pair of boobs, but present masculine). Additionally there are different degrees of medical transition and different types of sexual preferences.
A lot of things which make someone more or less similar to a man/woman (behavior with friends, behavior with romantic partners, behavior with sexual partners, thoughts) are private. Less reason to try measuring those "objectively".
I have a choice to respect people's self-identified genders or not. I decide to respect them. Not just because I care about people's feelings, but also because of points 1 & 2 & 3 and because of my general values (I show similar respect to homo/bisexuals). So I respect pronouns, but on top of that I also respect if someone identifies as a man/woman/nonbinary. I believe respect is optimal in terms of reducing suffering and adhering to human values.

When I compare your opinion to mine, most of my confusion is about two things: what exactly do you see as an empirical question? how does the answer (or its absence) affect our actions?

Zack insists that Blanchard is right, and that I fail at rationality if I disagree with him. People on Twitter and Reddit insist that Blanchard is wrong, and that I fail at being a decent human if I disagree with them. My opinion is that I have no comparative advantage at figuring out who is right and who is wrong on this topic, or maybe everyone is wrong, anyway it is an empirical question and I don't have the data. I hope that people who have more data and better education will one day sort it out, but until that happens, my position firmly remains "I don't know (and most likely neither do you), stop bothering me".

I think we need to be careful to not make a false equivalence here:

Trans people want us to respect their pronouns and genders.
I'm not very familiar with Blanchard, so far it seems to me like Blanchard's work is (a) just a typology for predicting certain correlations and (b) this work is sometimes used to argue that trans people are mistaken about their identities/motivations.

2A is kinda tangential to 1. So is this really a case of competing theories? I think uncertainty should make one skeptical of Blanchard work's implications rather than make one skeptical about respecting trans people.

(Note that this is about the representatives, not the people being represented. Two trans people can have different opinions, but you are required to believe the woke one and oppose the non-woke one.) Otherwise, you are transphobic. I completely reject that.

Two homo/bisexuals can have different opinions on what's "true homo/bisexuality" is too. Some opinions can be pretty negative. Yes, that's inconvenient, but that's just an expected course of events.

Shortly: disagreement is not hate. But it often gets conflated, especially in environments that overwhelmingly contain people of one political tribe.

I feel it's just the nature of some political questions. Not in all questions, not in all spaces you can treat disagreement as something benign.

But if there is a person who actually feels dysphoria from not being addressed as "ve" (someone who would be triggered by calling them any of: "he", "she", or "they"), then I believe that this is between them and their psychiatrist, and I want to be left out of this game.

Agree. Also agree that lynching for accidental misgendering is bad.

(That's when you get the "attack helicopters" as an attempt to point out the absurdity of the system.)

I'm pretty sure the helicopter argument began as an argument against trans people, not as an argument against weird-ass novel pronouns.

Comment by Q Home on Q Home's Shortform · 2024-11-27T08:44:01.944Z · LW · GW

Draft of a future post, any feedback is welcome. Continuation of a thought from this shortform post.

(picture: https://en.wikipedia.org/wiki/Drawing_Hands)

The problem

There's an alignment-related problem: how do we make an AI care about causes of a particular sensory pattern? What are "causes" of a particular sensory pattern in the first place? You want the AI to differentiate between "putting a real strawberry on a plate" and "creating a perfect illusion of a strawberry on a plate", but what's the difference between doing real things and creating perfect illusions, in general?

(Relevant topics: environmental goals; identifying causal goal concepts from sensory data; "look where I'm pointing, not at my finger"; Pointers Problem; Eliciting Latent Knowledge; symbol grounding problem; ontology identification problem.)

I have a general answer to those questions. My answer is very unfinished. Also it isn't mathematical, it's philosophical in nature. But I believe it's important anyway. Because there's not a lot of philosophical or non-philosophical ideas about the questions above. With questions like these you don't know where to even start thinking, so it's hard to imagine even a bad answer.

Obvious observations

Observation 1. Imagine you come up with a model which perfectly predicts your sensory experience (Predictor). Just having this model is not enough to understand causes of a particular sensory pattern, i.e. differentiate between stuff like "putting a real strawberry on a plate" and "creating a perfect illusion of a strawberry on a plate".

Observation 2. Not every Predictor has variables which correspond to causes of a particular sensory pattern. Not every Predictor can be used to easily derive something corresponding to causes of a particular sensory pattern. For example, some Predictors might make predictions by simulating a large universe with a superintelligent civilization inside which predicts your sensory experiences. See "Transparent priors".

The solution

So, what are causes of a particular sensory pattern?

"Recursive Sensory Models" (RSMs).

I'll explain what an RSM is and provide various examples.

What is a Recursive Sensory Model?

An RSM is a sequence of N models (Model 1, Model 2, ..., Model N) for which the following two conditions hold true:

Model (K + 1) is good at predicting more aspects of sensory experience than Model (K). Model (K + 2) is good at predicting more aspects than Model (K + 1). And so on.
Model 1 can be transformed into any of the other models according to special transformation rules. Those rules are supposed to be simple. But I can't give a fully general description of those rules. That's one of the biggest unfinished parts of my idea.

The second bullet point is kinda the most important one, but it's very underspecified. So you can only get a feel for it through looking at specific examples.

Core claim: when the two conditions hold true, the RSM contains easily identifiable "causes" of particular sensory patterns. The two conditions are necessary and sufficient for the existence of such "causes". The universe contains "causes" of particular sensory patterns to the extent to which statistical laws describing the patterns also describe deeper laws of the universe.

Example: object permanence

Imagine you're looking at a landscape with trees, lakes and mountains. You notice that none of those objects disappear.

It seems like a good model: "most objects in the 2D space of my vision don't disappear". (Model 1)

But it's not perfect. When you close your eyes, the landscape does disappear. When you look at your feet, the landscape does disappear.

So you come up with a new model: "there is some 3D space with objects; the space and the objects are independent from my sensory experience; most of the objects don't disappear". (Model 2)

Model 2 is better at predicting the whole of your sensory experience.

However, note that the "mathematical ontology" of both models is almost identical. (Both models describe spaces whose points can be occupied by something.) They're just applied to slightly different things. That's why "recursion" is in the name of Recursive Sensory Models: an RSM reveals similarities between different layers of reality. As if reality is a fractal.

Intuitively, Model 2 describes "causes" (real trees, lakes and mountains) of sensory patterns (visions of trees, lakes and mountains).

Example: reductionism

You notice that most visible objects move smoothly (don't disappear, don't teleport).

"Most visible objects move smoothly in a 2D/3D space" is a good model for predicting sensory experience. (Model 1)

But there's a model which is even better: "visible objects consist of smaller and invisible/less visible objects (cells, molecules, atoms) which move smoothly in a 2D/3D space". (Model 2)

However, note that the mathematical ontology of both models is almost identical.

Intuitively, Model 2 describes "causes" (atoms) of sensory patterns (visible objects).

Example: a scale model

Imagine you're alone in a field with rocks of different size and a scale model of the whole environment. You've already learned object permanence.

"Objects don't move in space unless I push them" is a good model for predicting sensory experience. (Model 1)

But it has a little flaw. When you push a rock, the corresponding rock in the scale model moves too. And vice-versa.

"Objects don't move in space unless I push them; there's a simple correspondence between objects in the field and objects in the scale model" is a better model for predicting sensory experience. (Model 2)

However, note that the mathematical ontology of both models is identical.

Intuitively, Model 2 describes a "cause" (the scale model) of sensory patterns (rocks of different size being at certain positions). Though you can reverse the cause and effect here.

Example: empathy

If you put your hand on a hot stove, you quickly move the hand away. Because it's painful and you don't like pain. This is a great model (Model 1) for predicting your own movements near a hot stove.

But why do other people avoid hot stoves? If another person touches a hot stove, pain isn't instantiated in your sensory experience.

Behavior of other people can be predicted with this model: "people have similar sensory experience and preferences, inaccessible to each other". (Model 2)

However, note that the mathematical ontology of both models is identical.

Intuitively, Model 2 describes a "cause" (inaccessible sensory experience) of sensory patterns (other people avoiding hot stoves).

Counterexample: a chaotic universe

Imagine yourself in a universe where your sensory experience is produced by very simple, but very chaotic laws. Despite the chaos, your sensory experience contains some simple, relatively stable patterns. Purely by accident.

In such universe, RSMs might not find any "causes" underlying particular sensory patterns (except the simple chaotic laws).

But in such case there are probably no "causes".

Comment by Q Home on Making a conservative case for alignment · 2024-11-27T05:58:36.938Z · LW · GW

Napoleon is merely an argument for "just because you strongly believe it, even if it is a statement about you, does not necessarily make it true".

When people make arguments, they often don't list all of the premises. That's not unique to trans discourse. Informal reasoning is hard to make fully explicit. "Your argument doesn't explicitly exclude every counterexample" is a pretty cheap counter-argument. What people experience is important evidence and an important factor, it's rational to bring up instead of stopping yourself with "wait, I'm not allowed to bring that up unless I make an analytically bulletproof argument". For example, if you trust someone that they feel strongly about being a woman, there's no reason to suspect them of being a cosplayer who chases Twitter popularity.

I expect that you will disagree with a lot of this, and that's okay; I am not trying to convince you, just explaining my position.

I think I still don't understand the main conflict which bothers you. I thought it was "I'm not sure if trans people are deluded in some way (like Napoleons, but milder) or not". But now it seems like "I think some people really suffer and others just cosplay, the cosplayers take something away from true sufferers". What is taken away?

Comment by Q Home on Making a conservative case for alignment · 2024-11-26T06:06:33.338Z · LW · GW

Even if we assume that there should be a crisp physical cause of "transness" (which is already a value-laden choice), we need to make a couple of value-laden choices before concluding if "being trans" is similar to "believing you're Napoleon" or not. Without more context it's not clear why you bring up Napoleon. I assume the idea is "if gender = hormones (gender essentialism), and trans people have the right hormones, then they're not deluded". But you can arrive at the same conclusion ("trans people are not deluded") by means other than gender essentialism.

I assume that for trans people being trans is something more than mere "choice"

There doesn't need to be a crisp physical cause of "transness" for "transness" to be more than mere choice. There's a big spectrum between "immutable physical features" and "things which can be decided on a whim".

If you introduce yourself as "Jane" today, I will refer to you as "Jane". But if 50 years ago you introduced yourself as "John", that is a fact about the past. I am not saying that "you were John" as some kind of metaphysical statement, but that "everyone, including you, referred to you as John" 50 years ago, which is a statement of fact.

This just explains your word usage, but doesn't make a case that disliking deadnaming is magical thinking.

I've decided to comment because bringing up Napoleon, hysteria and magical thinking all at once is egregiously bad faith. I think it's not a good epistemic norm to imply something like "the arguments of the outgroup are completely inconsistent trash" without elaborating.

Comment by Q Home on Making a conservative case for alignment · 2024-11-25T05:59:13.577Z · LW · GW

There are people who feel strongly that they are Napoleon. If you want to convince me, you need to make a stronger case than that.

It's confusing to me that you go to "I identify as an attack helicopter" argument after treating biological sex as private information & respecting pronouns out of politeness. I thought you already realize that "choosing your gender identity" and "being deluded you're another person" are different categories.

If someone presented as male for 50 years, then changed to female, it makes sense to use "he" to refer to their first 50 years, especially if this is the pronoun everyone used at that time. Also, I will refer to them using the name they actually used at that time. (If I talk about the Ancient Rome, I don't call it Italian Republic either.) Anything else feels like magical thinking to me.

The alternative (using new pronouns / name) makes perfect sense too, due to trivial reasons, such as respecting a person's wishes. You went too far calling it magical thinking. A piece of land is different from a person in two important ways: (1) it doesn't feel anything no matter how you call it, (2) there's less strong reasons to treat it as a single entity across time.

Comment by Q Home on Evolution's selection target depends on your weighting · 2024-11-20T02:49:49.380Z · LW · GW

Meta-level comment: I don't think it's good to dismiss original arguments immediately and completely.

Object-level comment:

Neither of those claims has anything to do with humans being the “winners” of evolution.

I think it might be more complicated than that:

We need to define what "a model produced by a reward function" means, otherwise the claims are meaningless. Like, if you made just a single update to the model (based on the reward function), calling it "a model produced by the reward function" is meaningless ('cause no real optimization pressure was applied). So we do need to define some goal of optimization (which determines who's a winner and who's a loser).
We need to argue that the goal is sensible. I.e. somewhat similar to a goal we might use while training our AIs.

Here's some things we can try:

We can try defining all currently living species as winners. But is it sensible? Is it similar to a goal we would use while training our AIs? "Let's optimize our models for N timesteps and then use all surviving models regardless of any other metrics" <- I think that's not sensible, especially if you use an algorithm which can introduce random mutations into the model.
We can try defining species which avoided substantial changes for the longest time as winners. This seems somewhat sensible, because those species experienced the longest optimization pressure. But then humans are not the winners.
We can define any species which gained general intelligence as winners. Then humans are the only winners. This is sensible because of two reasons. First, with general intelligence deceptive alignment is possible: if humans knew that Simulation Gods optimize organisms for some goal, humans could focus on that goal or kill all competing organisms. Second, many humans (in our reality) value creating AGI more than solving any particular problem.

I think the later is the strongest counter-argument to "humans are not the winners".

Comment by Q Home on Q Home's Shortform · 2024-11-19T06:26:58.321Z · LW · GW

My point is that chairs and humans can be considered in a similar way.

Please explain how your point connects to my original message: are you arguing with it or supporting it or want to learn how my idea applies to something?

Comment by Q Home on Q Home's Shortform · 2024-11-19T02:23:02.566Z · LW · GW

I see. But I'm not talking about figuring out human preferences, I'm talking about finding world-models in which real objects (such as "strawberries" or "chairs") can be identified. Sorry if it wasn't clear in my original message because I mentioned "caring".

Models or real objects or things capture something that is not literally present in the world. The world contains shadows of these things, and the most straightforward way of finding models is by looking at the shadows and learning from them.

You might need to specify what you mean a little bit.

The most straightforward way of finding a world-model is just predicting your sensory input. But then you're not guaranteed to get a model in which something corresponding to "real objects" can be easily identified. That's one of the main reasons why ELK is hard, I believe: in an arbitrary world-model, "Human Simulator" can be much simpler than "Direct Translator".

So how do humans get world-models in which something corresponding to "real objects" can be easily identified? My theory is in the original message. Note that the idea is not just "predict sensory input", it has an additional twist.

Comment by Q Home on Q Home's Shortform · 2024-11-18T08:05:04.878Z · LW · GW

Creating an inhumanly good model of a human is related to formulating their preferences.

How does this relate to my idea? I'm not talking about figuring out human preferences.

Thus it's a step towards eliminating path-dependence of particular life stories

What is "path-dependence of particular life stories"?

I think things (minds, physical objects, social phenomena) should be characterized by computations that they could simulate/incarnate.

Are there other ways to characterize objects? Feels like a very general (or even fully general) framework. I believe my idea can be framed like this, too.

Comment by Q Home on Q Home's Shortform · 2024-11-17T08:35:53.238Z · LW · GW

There's an alignment-related problem, the problem of defining real objects. Relevant topics: environmental goals; task identification problem; "look where I'm pointing, not at my finger"; The Pointers Problem; Eliciting Latent Knowledge.

I think I realized how people go from caring about sensory data to caring about real objects. But I need help with figuring out how to capitalize on the idea.

So... how do humans do it?

Humans create very small models for predicting very small/basic aspects of sensory input (mini-models).
Humans use mini-models as puzzle pieces for building models for predicting ALL of sensory input.
As a result, humans get models in which it's easy to identify "real objects" corresponding to sensory input.

For example, imagine you're just looking at ducks swimming in a lake. You notice that ducks don't suddenly disappear from your vision (permanence), their movement is continuous (continuity) and they seem to move in a 3D space (3D space). All those patterns ("permanence", "continuity" and "3D space") are useful for predicting aspects of immediate sensory input. But all those patterns are also useful for developing deeper theories of reality, such as atomic theory of matter. Because you can imagine that atoms are small things which continuously move in 3D space, similar to ducks. (This image stops working as well when you get to Quantum Mechanics, but then aspects of QM feel less "real" and less relevant for defining object.) As a result, it's easy to see how the deeper model relates to surface-level patterns.

In other words: reality contains "real objects" to the extent to which deep models of reality are similar to (models of) basic patterns in our sensory input.

Comment by Q Home on Stable Pointers to Value II: Environmental Goals · 2024-11-06T07:46:41.715Z · LW · GW

I don't understand Model-Utility Learning (MUL) section, what pathological behavior does AI do?

Since humans (or something) must be labeling the original training examples, the hypothesis that building bridges means “what humans label as building bridges” will always be at least as accurate as the intended classifier. I don’t mean “whatever humans would label”. I mean they hypothesis that “build a bridge” means specifically the physical situations which were recorded as training examples for this system in particular, and labeled by humans as such.

So it's like overfitting? If I train MUL AI to play piano in a green room, MUL AI learns that "playing piano" means "playing piano in a green room" or "playing piano in a room which would be chosen for training me in the past"?

Now, we might reasonably expect that if the AI considers a novel way of “fooling itself” which hasn’t been given in a training example, it will reject such things for the right reasons: the plan does not involve physically building a bridge.

But "sensory data being a certain way" is a physical event which happens in reality, so MUL AI might still learn to be a solipsist? MUL doesn't guarantee to solve misgeneralization in any way?

If the answer to my questions is "yes", what did we even hope for with MUL?

Comment by Q Home on Being nicer than Clippy · 2024-04-30T08:35:37.616Z · LW · GW

I'm noticing two things:

It's suspicious to me that values of humans-who-like-paperclips are inherently tied to acquiring an unlimited amount of resources (no matter in which way). Maybe I don't treat such values as 100% innocent, so I'm OK keeping them in check. Though we can come up with thought experiments where the urge to get more resources is justified by something. Like, maybe instead of producing paperclips those people want to calculate Busy Beaver numbers, so they want more and more computronium for that.
How consensual were the trades if their outcome is predictable and other groups of people don't agree with the outcome? Looks like coercion.

Comment by Q Home on Examples of Highly Counterfactual Discoveries? · 2024-04-24T06:00:46.888Z · LW · GW

Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me.

The most important thing, I think, is not even hitting the nail on the head, but knowing (i.e. really acknowledging) that a nail can be hit in multiple places. If you know that, the rest is just a matter of testing.

Comment by Q Home on Why I no longer identify as transhumanist · 2024-02-06T09:50:24.619Z · LW · GW

But avoidance of value drift or of unendorsed long term instability of one's personality is less obvious.

What if endorsed long term instability leads to negation of personal identity too? (That's something I thought about.)

Comment by Q Home on AI #27: Portents of Gemini · 2023-12-05T23:33:34.947Z · LW · GW

I think corrigibility is the ability to change a value/goal system. That the literal meaning of the term... "Correctable". If an AI were fully aligned, there would be no need to correct it.

Perhaps I should make a better argument:

It's possible that AGI is correctable, but (a) we don't know what needs to be corrected or (b) we cause new, less noticeable problems, while correcting AGI.

So, I think there's not two assumptions "alignment/interpretability is not solved + AGI is incorrigible", but only one — "alignment/interpretability is not solved". (A strong version of corrigibility counts as alignment/interpretability being solved.)

Yes, and that's the specific argument I am addressing,not AI risk in general. Except that if it's many many times smarter, it's ASI, not AGI.

I disagree that "doom" and "AGI going ASI very fast" are certain (> 90%) too.

Comment by Q Home on AI #27: Portents of Gemini · 2023-12-04T22:41:26.433Z · LW · GW

It's not aligned at every possible point in time.

I think corrigibility is "AGI doesn't try to kill everyone and doesn't try to prevent/manipulate its modification". Therefore, in some global sense such AGI is aligned at every point in time. Even if it causes a local disaster.

Over 90% , as I said

Then I agree, thank you for re-explaining your opinion. But I think other probabilities count as high too.

To me, the ingredients of danger (but not "> 90%") are those:

1st. AGI can be built without Alignment/Interpretability being solved. If that's true, building AGI slowly or being able to fix visible problems may not matter that much.
2nd and 3rd. AGI can have planning ability. AGI can come up with the goal pursuing which would kill everyone.
2nd (alternative). AIs and AGIs can kill most humans without real intention of doing so, by destabilizing the world/amplifying already existing risks.

If I remember correctly, Eliezer also believes in "intelligence explosion" (AGI won't be just smarter than humanity, but many-many times smarter than humanity: like humanity is smarter than ants/rats/chimps). Haven't you forgot to add that assumption?

Comment by Q Home on AI #27: Portents of Gemini · 2023-12-04T02:15:15.674Z · LW · GW

why is “superintelligence + misalignment” highly conjunctive?

In the sense that matters, it needs to be fast, surreptitious, incorrigible, etc.

What opinion are you currently arguing? That the risk is below 90% or something else? What counts as "high probability" for you?

Incorrigible misalignment is at least one extra assumption.

I think "corrigible misalignment" doesn't exist, corrigble AGI is already aligned (unless AGI can kill everyone very fast by pure accident). But we can have differently defined terms. To avoid confusion, please give examples of scenarios you're thinking about. The examples can be very abstract.

If AGI is AGI, there won’t be any problems to notice

Huh?

I mean, you haven't explained what "problems" you're talking about. AGI suddenly declaring "I think killing humans is good, actually" after looking aligned for 1 year? If you didn't understand my response, a more respectful answer than "Huh?" would be to clarify your own statement. What noticeable problems did you talk about in the first place?

Please, proactively describe your opinions. Is it too hard to do? Conversation takes two people.

Comment by Q Home on AI #27: Portents of Gemini · 2023-12-03T00:10:05.933Z · LW · GW

I've confused you with people who deny that a misaligned AGI is even capable of killing most humans. Glad to be wrong about you.

But I am not saying that the doom is unlikely given superintelligence and misalignment, I am saying the argument that gets there -- superintelligence + misalignment -- is highly conjunctive. The final step., the execution as it were, is no highly conjunctive.

But I don't agree that it's highly conjunctive.

If AGI is possible, then its superintelligence is a given. Superintelligence isn't given only if AGI stops at human level of intelligence + can't think much faster than humans + can't integrate abilities of narrow AIs naturally. (I.e. if AGI is basically just a simulation of a human and has no natural advantages.) I think most people don't believe in such AGI.
I don't think misalignment is highly conjunctive.

I agree that hard takeoff is highly conjunctive, but why is "superintelligence + misalignment" highly conjunctive?

I think its needed for the "likely". Slow takeoff gives humans more time to notice and fix problems, so the likelihood of bad outcomes goes down. Wasn't that obvious?

If AGI is AGI, there won't be any problems to notice. That's why I think probability doesn't decrease enough.

...

I hope that Alignment is much easier to solve than it seems. But I'm not sure (a) how much weight to put into my own opinion and (b) how much my probability of being right decreases the risk.

Comment by Q Home on AI #27: Portents of Gemini · 2023-12-01T22:29:28.858Z · LW · GW

Yes, I probably mean something other than ">90%".

[lists of various catastrophes. many of which have nothing to do with AI]

Why are you doing this? I did not say there is zero risk of anything. (...) Are you using "risk" to mean the probability of the outcome , or the impact of the outcome?

My argument is based on comparing the phenomenon of AGI to other dangerous phenomena. The argument is intended to show that bad outcome is likely (if AGI wants to do a bad thing, it can achieve it) and that impact of the outcome can kill most humans.

I think its needed for the "likely". Slow takeoff gives humans more time to notice and fix problems, so the likelihood of bad outcomes goes down. Wasn't that obvious?

To me the likelihood doesn't go down enough (to the tolerable levels).

Comment by Q Home on AI #27: Portents of Gemini · 2023-11-27T05:01:51.965Z · LW · GW

Informal logic is more holistic than not, I think, because it relies on implicit assumptions.

It's not black and white. I don't think they are zero risk, and I don't think it is Certain Doom, so it's not what I am talking about. Why are you bringing it up? Do you think there is a simpler argument for Certain Doom?

Could you proactively describe your opinion? Or re-describe it, by adding relevant details. You seemed to say "if hard takeoff, then likely doom; but hard takeoff is unlikely, because hard takeoff requires a conjunction of things to be true". I answered that I don't think hard takeoff is required. You didn't explain that part of your opinion. Now it seems your opinion is more general (not focused on hard takeoff), but you refuse to clarify it. So, what is the actual opinion I'm supposed to argue with? I won't try to use every word against you, so feel free to write more.

Doom meaning what? It's obvious that there is some level of risk, but some level of risk isn't Certain Doom. Certain Doom is an extraordinary claim,and the burden of proof therefore is on (certain) doomers. But you seem to be switching between different definitions.

I think "AGI is possible" or "AGI can achieve extraordinary things" is the extraordinary claim. The worry about its possible extraordinary danger is natural. Therefore, I think AGI optimists bear the burden of proving that a) likely risk of AGI is bounded by something and b) AGI can't amplify already existing dangers.

By "likely doom" I mean likely (near-)extinction. "Likely" doesn't have to be 90%.

Saying “the most dangerous technology with the worst safety and the worst potential to control it” doesn't actually imply a high level of doom (p>9) or a high level of risk (> 90% dead)-- it's only a relative statement.

I think it does imply so, modulo "p > 90%". Here's a list of the most dangerous phenomena: (L1)

Nuclear warfare. World wars.
An evil and/or suicidal world-leader.
Deadly pandemics.
Crazy ideologies, e.g. fascism. Misinformation. Addictions. People being divided on everything. (Problems of people's minds.)

And a list of the most dangerous qualities: (L2)

Being superintelligent.
Wanting, planning to kill everyone.
Having a cult-following. Humanity being dependent on you.
Having direct killing power (like a deadly pandemic or a set of atomic bombs).
Multiplicity/simultaneity. E.g. if we had TWO suicidal world-leaders at the same time.

Things from L1 can barely scrap two points from L2, yet they can cause mass disruptions and claim many victims and also trigger each other. Narrow AI could secure three points from the list (narrow superintelligence + cult-following, dependency + multiplicity/simultaneity) — weakly, but potentially better than a powerful human ever could. However, AGI can easily secure three points from L3 in full. Four points, if AGI is developed more than in a single place. And I expect you to grant that general superintelligence presents a special, unpredictable danger.

Given that, I don't see what should bound the risk from AGI or prevent it from amplifying already existing dangers.

Comment by Q Home on AI #27: Portents of Gemini · 2023-11-26T02:11:32.707Z · LW · GW

Why ? I'm saying p(doom) is not high. I didn't mention P(otherstuff).

To be able to argue something (/decide how to go about arguing something), I need to have an idea about your overall beliefs.

That doesn't imply a high probability of mass extinction.

Could you clarify what your own opinion even is? You seem to agree that rapid self-improvement would mean likely doom. But you aren't worried about gradual self-improvement or AGI being dangerously smart without much (self-)improvement?

Comment by Q Home on AI #27: Portents of Gemini · 2023-11-25T01:09:52.357Z · LW · GW

I think I have already answered that: I don't think anyone is going to deliberately build something they can't control at all. So the probability of mass extinction depends on creating an uncontrollable superintelligence accidentally-- for instance, by rapid recursive self improvement. And RRSI , AKA Foom Doom, is a conjunction of claims, all of which are p<1, so it is not high probability.

I agree that probability mostly depends on accidental AGI. I don't agree that probability mostly depends on (very) hard takeoff. I believe probability mostly depends on just "AGI being smarter than all of humanity". If you have a kill-switch or whatever, an AGI without Alignment theory being solved is still "the most dangerous technology with the worst safety and the worst potential to control it".

So, could you go into more cruxes of your beliefs, more context? (More or less full context of my own beliefs is captured by the previous comment. But I'm ready to provide more if needed.) To provide more context to your beliefs, you could try answering "what's the worst disaster (below everyone being dead) an AGI is likely to cause" or "what's the best benefit an AGI is likely to give". To make sure you aren't treating an AGI as impotent in negative scenarios and as a messiah in positive scenarios. Or not treating humans as incapable of sinking even a safe non-sentient boat and refusing to vaccinate from viruses.

Comment by Q Home on AI #27: Portents of Gemini · 2023-11-24T02:30:01.406Z · LW · GW

I want to discuss this topic with you iff you're ready to proactively describe the cruxes of your own beliefs. I believe in likely doom and I don't think the burden of proof is on "doomers".

Maybe there just isn't a good argument for Certain Doom (or at least high probability near-extinction). I haven't seen one

What do you expect to happen when you're building uninterpretable technology without safety guarantees, smarter than all of humanity? Looks like the most dangerous technology with the worst safety and the worst potential to control it.

To me, those abstract considerations are enough a) to conclude likely doom and b) to justify common folk in blocking AI capability research — if common folk could do so.

I believe experts should have accountability (even before a disaster happens) and owe some explanation of what they're doing. If an expert is saying "I'm building the most impactful technology without safety but that's suddenly OK this time around because... ... I can't say, you need to be an expert to understand", I think it's OK to not accept the answer and block the research.

Comment by Q Home on [Bias] Restricting freedom is more harmful than it seems · 2023-11-23T07:43:38.899Z · LW · GW

You are correct that critical thinkers may want to censor uncritical thinkers. However, independent-minded thinkers do not want to censor conventional-minded thinkers.

I still don't see it. Don't see a causal mechanism that would cause it. Even if we replace "independent-minded" with "independent-minded and valuing independent-mindedness for everyone". I have the same problems with it as Ninety-Three and Raphael Harth.

To give my own example. Algorithms in social media could be a little too good at radicalizing and connecting people with crazy opinions, such as flat earth. A person censoring such algorithms/their output could be motivated by the desire to make people more independent-minded.

I deliberately avoided examples for the same reason Paul Graham's What You Can't Say deliberately avoids giving any specific examples: because either my examples would be mild and weak (and therefore poor illustrations) or they'd be so shocking (to most people) they'd derail the whole conversation. (comment)

I think the value of a general point can only stem from re-evaluating specific opinions. Therefore, sooner or later the conversation has to tackle specific opinions.

If "derailment" is impossible to avoid, then "derailment" is a part of the general point. Or there are more important points to be discussed. For example, if you can't explain to cave people General Relativity, maybe you should explain "science" and "language" first — and maybe those tangents are actually more valuable than General Relativity.

I dislike Graham's essay for the same reason: when Graham does introduce some general opinions ("morality is like fashion", "censuring is motivated by the fear of free-thinking", "there's no prize for figuring out quickly", "a statement can't be worse than false"), they're not discussed critically, with examples. Re:say looks weird to me. Invisible opponents are allowed to say only one sentence and each sentence gets a lengthy "answer" with more opinions.

Comment by Q Home on [Bias] Restricting freedom is more harmful than it seems · 2023-11-22T10:47:25.365Z · LW · GW

We only censor other people more-independent-minded than ourselves. (...) Independent-minded people do not censor conventional-minded people.

I'm not sure that's true. Not sure I can interpret the "independent/dependent" distinction.

In "weirdos/normies" case, a weirdo can want to censor ideas of normies. For example, some weirdos in my country want to censor LGBTQ+ stuff. They already do.
In "critical thinkers/uncritical thinkers" case, people with more critical thinking may want to censor uncritical thinkers. (I believe so.) For example, LW in particular has a couple of ways to censor someone, direct and indirect.

In general, I like your approach of writing this post like an "informal theorem".

Comment by Q Home on It's OK to be biased towards humans · 2023-11-20T10:16:49.738Z · LW · GW

I tried to describe necessary conditions which are needed for society and culture to exist. Do you agree that what I've described are necessary conditions?

I realize I'm pretty unusual in the regard, which may be biasing my views. However, I think I am possibly evidence against the notion that a desire to leave a mark on the culture is fundamental to human identity

Relevant part of my argument was "if your personality gets limitlessly copied and modified, your personality doesn't exist (in the cultural sense)". You're talking about something different, you're talking about ambitions and desire of fame.

My thesis (to not lose the thread of the conversation):

If human culture and society are natural, then the rights about information are natural too, because culture/society can't exist without them.

Comment by Q Home on It's OK to be biased towards humans · 2023-11-20T08:38:37.190Z · LW · GW

I think we can just judge by the consequences (here "consequences" don't have to refer to utility calculus). If some way of "injecting" art into culture is too disruptive, we can decide to not allow it. Doesn't matter who or how makes the injection.

Comment by Q Home on It's OK to be biased towards humans · 2023-11-20T08:28:54.118Z · LW · GW

To exist — not only for itself, but for others — a consciousness needs a way to leave an imprint on the world. An imprint which could be recognized as conscious. Similar thing with personality. For any kind of personality to exist, that personality should be able to leave an imprint on the world. An imprint which could be recognized as belonging to an individual.

Uncontrollable content generation can, in principle, undermine the possibility of consciousness to be "visible" and undermine the possibility of any kind of personality/individuality. And without those things we can't have any culture or society expect a hivemind.

Are you OK with such disintegration of culture and society?

In general, I think people have a right to hear other people, but not a right to be heard.

To me that's very repugnant, if taken to the absolute. What emotions and values motivate this conclusion? My own conclusions are motivated by caring about culture and society.

Alternatively, it could be the case that the artist has more to say that isn't or can't be expressed by the imitations- other ideas, interesting self expression, and so on- but the imitations prevent people from finding that new work. I think that case is a failure of whatever means people are using to filter and find art. A good social media algorithm or friend group who recommend content to each other should recognize that the inventor of an good idea might invent other good ideas in the future, and should keep an eye out for and platform those ideas if they do.

I was going for something slightly more subtle. Self-expression is about making a choice. If all choices are realized before you have a chance to make them, your ability to express yourself is undermined.

Comment by Q Home on It's OK to be biased towards humans · 2023-11-20T00:56:35.991Z · LW · GW

Thank you for the answer, clarifies your opinion a lot!

Artistic expression, of course, is something very different. I'm definitely going to keep making art in my spare time for the rest of my life, for the sake of fun and because there are ideas I really want to get out. That's not threatened at all by AI.

I think there are some threats, at least hypothetical. For example, the "spam attack". People see that a painter starts to explore some very niche topic — and thousands of people start to generate thousands of paintings about the same very niche topic. And the very niche topic gets "pruned" in a matter of days, long before the painter has said at least 30% of what they have to say. The painter has to fade into obscurity or radically reinvent themselves after every couple of paintings. (Pre-AI the "spam attack" is not really possible even if you have zero copyright laws.)

In general, I believe for culture to exist we need to respect the idea "there's a certain kind of output I can get only from a certain person, even if it means waiting or not having every single of my desires fulfilled" in some way. For example, maybe you shouldn't use AI to "steal" a face of an actor and make them play whatever you want.

Do you think that unethical ways to produce content exist at least in principle? Would you consider any boundary for content production, codified or not, to be a zero-sum competition?

Comment by Q Home on It's OK to be biased towards humans · 2023-11-19T10:03:00.041Z · LW · GW

Maybe I've misunderstood your reply, but I wanted to say that hypothetically even humans can produce art in non-cooperative and disruptive ways, without breaking existing laws.

Imagine a silly hypothetical: one of the best human artists gets a time machine and starts offering their art for free. That artist functions like an image generator. Is such an artist doing something morally questionable? I would say yes.

Comment by Q Home on It's OK to be biased towards humans · 2023-11-19T09:31:13.686Z · LW · GW

Could you explain your attitudes towards art and art culture more in depth and explain how exactly your opinions on AI art follow from those attitudes? For example, how much do you enjoy making art and how conditional is that enjoyment? How much do you care about self-expression, in what way? I'm asking because this analogy jumped out at me as a little suspicious:

And as terrible as this could be for my career, spending my life working in a job that could be automated but isn't would be as soul-crushing as being paid to dig holes and fill them in again. It would be an insultingly transparent facsimile of useful work.

But creative work is not mechanical work, it can't be automated that way, AI doesn't replace you that way. AI doesn't have the model of your brain, it can't make the choices you would make. It replaces you by making something cheaper and on the same level of "quality". It doesn't automate your self-expression. If you care about self-expression, the possibility of AI doesn't have to feel soul-crushing.

I apologize for sounding confrontational. You're free to disagree with everything above. I just wanted to show that the question has a lot of potential nuances.

Comment by Q Home on It's OK to be biased towards humans · 2023-11-19T08:51:22.433Z · LW · GW

I like the angle you've explored. Humans are allowed to care about humans — and propagate that caring beyond its most direct implications. We're allowed to care not only about humans' survival, but also about human art and human communication and so on.

But I think another angle is also relevant: there are just cooperative and non-cooperative ways to create art (or any other output). If AI creates art in non-cooperative ways, it doesn't matter how the algorithm works or if it's sentient or not.

Comment by Q Home on It's OK to be biased towards humans · 2023-11-12T07:21:47.626Z · LW · GW

Thus, it doesn't matter in the least if it stifles human output, because the overwhelming majority of us who don't rely on our artistic talent to make a living will benefit from a post-scarcity situation for good art, as customized and niche as we care to demand.

How do you know that? Art is one of the biggest outlets of human potential; one of the biggest forces behind human culture and human communities; one of the biggest communication channels between people.

One doesn't need to be a professional artist to care about all that.

Comment by Q Home on Open Thread – Autumn 2023 · 2023-11-06T11:10:52.819Z · LW · GW

I think you're going for the most trivial interpretation instead of trying to explore interesting/unique aspects of the setup. (Not implying any blame. And those "interesting" aspects may not actually exist.) I'm not good at math, but not that bad to not know the most basic 101 idea of multiplying utilities by probabilities.

I'm trying to construct a situation (X) where the normal logic of probability breaks down, because each possibility is embodied by a real person and all those persons are in a conflict with each other.

Maybe it's impossible to construct such situation, for example because any normal situation can be modeled the same way (different people in different worlds who don't care about each other or even hate each other). But the possibility of such situation is an interesting topic we could explore.

Here's another attempt to construct "situation X":

We have 100 persons.
1 person has 99% chance to get big reward and 1% chance to get nothing. If they drink.
99 persons each have 0.0001% chance to get big punishment and 99.9999% chance to get nothing.

Should a person drink? The answer "yes" is a policy which will always lead to exploiting 99 persons for the sake of 1 person. If all those persons hate each other, their implicit agreement to such policy seems strange.

Here's an explanation of what I'd like to explore from another angle.

Imagine I have a 99% chance to get reward and 1% chance to get punishment. If I take a pill. I'll take the pill. If we imagine that each possibility is a separate person, this decision can be interpreted in two ways:

1 person altruistically sacrifices their well-being for the sake of 99 other persons.
100 persons each think, egoistically, "I can get lucky". Only 1 person is mistaken.

And the same is true for other situations involving probability. But is there any situation (X) which could differentiate between "altruistic" and "egoistic" interpretations?

Comment by Q Home on Open Thread – Autumn 2023 · 2023-11-05T23:01:07.767Z · LW · GW

For all intents and purposes it's equivalent to say "you have only one shot" and after memory erasure it's not you anymore, but a person equivalent to other version of you next room.

Let's assume "it's not you anymore" is false. At least for a moment (even if it goes against LDT or something else).

Yes, you have a 0.1 chance of being punished. But who cares if they will erase your memory anyway.

Let's assume that the persons do care.

Comment by Q Home on Petrov Day Retrospective, 2023 (re: the most important virtue of Petrov Day & unilaterally promoting it) · 2023-09-28T06:52:16.776Z · LW · GW

To me, the initial poll options make no sense without each other. For example, "avoid danger" and "communicate beliefs" don't make sense without each other [in context of society].

If people can't communicate (report epistemic state), "avoid danger" may not help or be based on 100% biased opinions on what's dangerous.

If some people solve Alignment, but don't communicate, humanity may perish due to not building a safe AGI.
If nobody solves Alignment, but nobody communicates about Alignment, humanity may perish because careless actors build an unsafe AGI without even knowing they do something dangerous.

I like communication, so I chose the second option. Even though "communicating without avoiding danger" doesn't make sense either.

Since the poll options didn't make much sense to me, I didn't see myself as "facing alien values" or "fighting off babyeaters". I didn't press the link, because I thought it may "blow up" the site (similar to the previous Petrov's Day) + I wasn't sure it's OK to click, I didn't think my unilateralism would be analogous to Petrov's unilateralism (did Petrov cure anyone's values, by the way?). I decided it's more Petrov-like to not click.

But is AGI (or anything else) related to the lessons of Petrov's Day? That's another can of worms. I think we should update the lessons of the past to fit the future situations. I think it doesn't make much sense to take away from Petrov's Day only lessons about "how to deal with launching nukes".

Another consideration: Petrov did accurately report his epistemic state. Or would have, if it were needed (if it were needed, he would lie to accurately report his epistemic state - "there are no launches"). Or "he accurately non-reported the non-presence of nuclear missiles".

Comment by Q Home on A Case for AI Safety via Law · 2023-09-22T02:31:46.640Z · LW · GW

Maybe you should edit the post to add something like this:

My proposal is not about the hardest parts of the Alignment problem. My proposal is not trying to solve theoretical problems with Inner Alignment or Outer Alignment (Goodhart, loopholes). I'm just assuming those problems won't be relevant enough. Or humanity simply won't create anything AGI-like (see CAIS).

Instead of discussing the usual problems in Alignment theory, I merely argue X. X is not a universally accepted claim, here's evidence that it's not universally accepted: [write the evidence here].

...

By focusing on the external legal system, many key problems associated with alignment (as recited in the Summary of Argument) are addressed. One worth highlighting is 4.4, which suggests AISVL can assure alignment in perpetuity despite changes in values, environmental conditions, and technologies, i.e., a practical implementation of Yudkowsky's CEV.

I think the key problems are not "addressed", you just assume they won't exist. And laws are not a "practical implementation of CEV".

Comment by Q Home on A Case for AI Safety via Law · 2023-09-20T22:19:58.930Z · LW · GW

Maybe there's a misunderstanding. Premise (1) makes sure that your proposal is different from any other proposal. It's impossible to reject premise (1) without losing the proposal's meaning.

Premise (1) is possible to reject only if you're not solving Alignment but solving some other problem.

I'm arguing for open, external, effective legal systems as the key to AI alignment and safety. I see the implementation/instilling details as secondary. My usage refers to specifying rules/laws/ethics externally so they are available and usable by all intelligent systems.

If an AI can be Aligned externally, then it's already safe enough. It feels like...

You're not talking about solving Alignment, but talking about some different problem. And I'm not sure what that problem is.
For your proposal to work, the problem needs to be already solved. All the hard/interesting parts need to be already solved.

Comment by Q Home on A Case for AI Safety via Law · 2023-09-20T09:37:30.894Z · LW · GW

Perhaps the most important and (hopefully) actionable recommendation of the proposal is in the conclusion:

"For the future safety and wellbeing of all sentient systems, work should occur in earnest to improve legal processes and laws so they are more robust, fair, nimble, efficient, consistent, understandable, accepted, and complied with." (comment)

Sorry for sounding harsh. But to say something meaningful, I believe you have to argue two things:

Laws are distinct enough from human values (1), but following laws / caring about laws / reporting about predicted law violations prevents the violation of human values (2).

I think the post fails to argue both points. I see no argument that instilling laws is distinct enough from instilling values/corrigibility/human semantics in general (1) and that laws actually prevent misalignment (2).

Later I write, "Suggested improvements to law and legal process are mostly beyond the scope of this brief. It is possible, however, that significant technological advances will not be needed for implementing some key capabilities. For example, current Large Language Models are nearly capable of understanding vast legal corpora and making appropriate legal decisions for humans and AI systems (Katz et al., 2023). Thus, a wholesale switch to novel legal encodings (e.g., computational and smart contracts) may not be necessary."

If AI can be just asked to follow your clever idea, then AI is already safe enough without your clever idea. "Asking AI to follow something" is not what Bostrom means by direct specification, as far as I understand.

Comment by Q Home on Which Questions Are Anthropic Questions? · 2023-09-18T13:58:51.719Z · LW · GW

I like how you explain your opinion, very clear and short, basically contained in a single bit of information: "you're not a random sample" or "this equivalence between 2 classes of problems can be wrong".

But I think you should focus on describing the opinion of others (in simple/new ways) too. Otherwise you're just repeating yourself over and over.

If you're interested, I could try helping to write a simplified guide to ideas about anthropics.

User info