Posts

What are the weirdest things a human may want for their own sake? 2024-03-20T11:15:09.791Z
Three Types of Constraints in the Space of Agents 2024-01-15T17:27:27.560Z
'Theories of Values' and 'Theories of Agents': confusions, musings and desiderata 2023-11-15T16:00:48.926Z
Charbel-Raphaël and Lucius discuss Interpretability 2023-10-30T05:50:34.589Z
"Wanting" and "liking" 2023-08-30T14:52:04.571Z
GPTs' ability to keep a secret is weirdly prompt-dependent 2023-07-22T12:21:26.175Z
How do you manage your inputs? 2023-03-28T18:26:36.979Z
Mateusz Bagiński's Shortform 2022-12-26T15:16:17.970Z
Kraków, Poland – ACX Meetups Everywhere 2022 2022-08-24T23:07:07.542Z

Comments

Comment by Mateusz Bagiński (mateusz-baginski) on Ten Levels of AI Alignment Difficulty · 2024-04-26T09:59:33.850Z · LW · GW

Behavioural Safety is Insufficient

Past this point, we assume following Ajeya Cotra that a strategically aware system which performs well enough to receive perfect human-provided external feedback has probably learned a deceptive human simulating model instead of the intended goal. The later techniques have the potential to address this failure mode. (It is possible that this system would still under-perform on sufficiently superhuman behavioral evaluations)

There are (IMO) plausible threat models in which alignment is very difficult but we don't need to encounter deceptive alignment. Consider the following scenario:

Our alignment techinques (whatever they are) scale pretty well, as far as we can measure, even up to well-beyond-human-level AGI. However, in the year (say) 2100, the tails come apart. It gradually becomes pretty clear that what we want out powerful AIs to do and what they actually do turns out not to generalize that well outside of the distribution on which we have been testing them so far. At this point, it is to late to roll them back, e.g. because the AIs have become uncorrigible and/or power-seeking. The scenario may also have more systemic character, with AI having already been so tightly integrated into the economy that there is no "undo button".

This doesn't assume either the sharp left turn or deceptive alignment, but I'd put it at least at level 8 in your taxonomy.

I'd put the scenario from Karl von Wendt's novel VIRTUA into this category.

Comment by Mateusz Bagiński (mateusz-baginski) on Ten Levels of AI Alignment Difficulty · 2024-04-26T09:58:46.902Z · LW · GW
Comment by Mateusz Bagiński (mateusz-baginski) on Examples of Highly Counterfactual Discoveries? · 2024-04-24T12:00:33.340Z · LW · GW

Maybe Hanson et al.'s Grabby aliens model? @Anders_Sandberg  said that some N years before that (I think more or less at the time of working on Dissolving the Fermi Paradox), he "had all of the components [of the model] on the table" and it just didn't occur to him that they can be composed in this way. (personal communication, so I may be misremembering some details). Although it's less than 10 years, so...

Speaking of Hanson, prediction markets seem like a more central example. I don't think the idea was [inconceivable in principle] 100 years ago.

ETA: I think Dissolving the Fermi Paradox may actually be a good example. Nothing in principle prohibited people puzzling about "the great silence" from using probability distributions instead of point estimates in the Drake equation. Maybe it was infeasible to compute this back in the 1950s/60s, but I guess it should be doable in 2000s and still, the paper was published only in 2017.

Comment by Mateusz Bagiński (mateusz-baginski) on Tamsin Leake's Shortform · 2024-04-22T16:26:24.174Z · LW · GW

Taboo "evil" (locally, in contexts like this one)?

Comment by Mateusz Bagiński (mateusz-baginski) on shortplav · 2024-04-21T13:47:43.051Z · LW · GW

If you want to use it for ECL, then it's not clear to me why internal computational states would matter.

Comment by Mateusz Bagiński (mateusz-baginski) on FHI (Future of Humanity Institute) has shut down (2005–2024) · 2024-04-19T04:14:36.848Z · LW · GW

Why did FHI get closed down? In the end, because it did not fit in with the surrounding administrative culture. I often described Oxford like a coral reef of calcified institutions built on top of each other, a hard structure that had emerged organically and haphazardly and hence had many little nooks and crannies where colorful fish could hide and thrive. FHI was one such fish but grew too big for its hole. At that point it became either vulnerable to predators, or had to enlarge the hole, upsetting the neighbors. When an organization grows in size or influence, it needs to scale in the right way to function well internally – but it also needs to scale its relationships to the environment to match what it is.

Comment by Mateusz Bagiński (mateusz-baginski) on An ethical framework to supersede Utilitarianism · 2024-04-17T18:19:22.185Z · LW · GW

Reminds me of https://atheistethicist.blogspot.com/2011/12/basic-review-of-desirism.html?m=1

Comment by Mateusz Bagiński (mateusz-baginski) on Generalized Stat Mech: The Boltzmann Approach · 2024-04-13T15:28:59.698Z · LW · GW

I don't quite get what actions are available in the heat engine example.

Is it just choosing a random bit from H or C (in which case we can't see whether it's 0 or 1) OR a specific bit from W (in which case we know whether it's 0 or 1) and moving it to another pool?

Comment by Mateusz Bagiński (mateusz-baginski) on Open Thread Spring 2024 · 2024-04-10T05:52:44.125Z · LW · GW

Any thoughts on Symbolica? (or "categorical deep learning" more broadly?)

All current state of the art large language models such as ChatGPT, Claude, and Gemini, are based on the same core architecture. As a result, they all suffer from the same limitations.

Extant models are expensive to train, complex to deploy, difficult to validate, and infamously prone to hallucination. Symbolica is redesigning how machines learn from the ground up. 

We use the powerfully expressive language of category theory to develop models capable of learning algebraic structure. This enables our models to have a robust and structured model of the world; one that is explainable and verifiable.

It’s time for machines, like humans, to think symbolically.

  1. How likely is it that Symbolica [or sth similar] produces a commercially viable product?
  2. How likely is it that Symbolica creates a viable alternative for the current/classical DL?
    1. I don't think it's that different from the intentions behind Conjecture's CoEms proposal. (And it looks like Symbolica have more theory and experimental results backing up their ideas.)
      1. Symbolica don't use the framing of AI [safety/alignment/X-risk], but many people behind the project are associated with the Topos Institute that hosted some talks from e.g. Scott Garrabrant or Andrew Critch.
  3. What is the expected value of their research for safety/verifiability/etc?
    1. Sounds relevant to @davidad's plan, so I'd be especially curious to know his take.
  4. How likely is it that whatever Symbolica produces meaningfully contributes to doom (e.g. by advancing capabilities research without at the same time sufficiently/differentially advancing interpretability/verifiability of AI systems)?

(There's also PlantingSpace but their shtick seems to be more "use probabilistic programming and category theory to build a cool Narrow AI-ish product" whereas Symbolica want to use category theory to revolutionize deep learning.)

Comment by Mateusz Bagiński (mateusz-baginski) on Ontological Crisis in Humans · 2024-04-08T04:46:00.106Z · LW · GW

I'm not aware of any, but you may call it "hybrid ontologies" or "ontological interfacing".

Comment by Mateusz Bagiński (mateusz-baginski) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-06T17:12:52.241Z · LW · GW

There is an unsolved meta-problem but the meta-problem is an easy problem.

Comment by Mateusz Bagiński (mateusz-baginski) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-06T17:12:00.850Z · LW · GW

(I skipped straight to ch7, according to your advice, so I may be missing relevant parts from the previous chapters if there are any.)

I probably agree with you on the object level regarding phenomenal consciousness.

That being said, I think it's "more" than a meme. I witnessed at least two people not exposed to the scientific/philosophical literature on phenomenal consciousness reinvent/rediscover the concept on their own.

It seems to me that the first-person perspective we necessarily adopt makes inclines to ascribe to sensations/experiences some ineffable, seemingly irreducible quality. My guess is that we (re)perceive our perception as a meta-modality different from ordinary modalities like vision, hearing, etc, and that causes the illusion. It's plausible that being raised in a WEIRD culture contributes to that inclination.

A butterfly conjecture: While phenomenal consciousness is an illusion, there is something to be said about the first-person perspective being an interesting feature of some minds (sufficiently sophisticated? capable of self-reflection?). It can be viewed as a computational heuristic that makes you "vulnerable" to certain illusions or biases, such as phenomenal consciousness, but also:

  • the difficulty to accept one-boxing in the Newcomb's problem
  • mind-body dualism
  • the naive version of free will illusion, difficulty in accepting physicalism/determinism
  • (maybe) the illusion of being in control over your mind (various sources say that meditation-naive people are often surprised to discover how little control they have over their own mind when they first try meditation)

A catchy term for this line of investigation could be "computational phenomenology".

Comment by Mateusz Bagiński (mateusz-baginski) on The lattice of partial updatelessness · 2024-04-05T16:55:10.046Z · LW · GW

Wouldn't total updatelessness amount to constantly taking one action? If not, I'm missing something important and would appreciate an explanation of what it is that I'm missing.

Comment by Mateusz Bagiński (mateusz-baginski) on What is the purpose and application of AI Debate? · 2024-04-04T04:01:30.250Z · LW · GW

My impression was that people stopped seriously working on debate a few years ago

ETA: I was wrong

Comment by Mateusz Bagiński (mateusz-baginski) on Co-Proofs · 2024-04-02T13:23:37.415Z · LW · GW

I think a proof of not-not-X[1] is more apt to be called "a co-proof of X" (which implies X if you [locally] assume the law of the excluded middle).


  1. Or, weaker, very strong [evidence of]/[argument for] not-not-X. ↩︎

Comment by Mateusz Bagiński (mateusz-baginski) on Is requires ought · 2024-04-02T09:43:10.264Z · LW · GW

If a reasonable agent expects itself to perform some function satisfactorily, then according to that agent, that agent ought to perform that function satisfactorily.

 

[this] is somewhat subtle. If I use a fork as a tool, then I am applying an "ought" to the fork; I expect it ought to function as an eating utensil. Similar to using another person as a tool (alternatively "employee" or "service worker"), giving them commands and expecting that they ought to follow them.

Can you taboo ought? I think I could rephrase these as:

  • I am trying to use a fork as an eating utensil because I expect that if I do, it will function like I expect eating utensils to function.
  • I am giving a person commands because I expect that if I do, they will follow my commands. (Which is what I want.)

More generally, there's probably a difference between oughts like "I ought to do X" and oughts that could be rephrased in terms of conditionals, e.g.

"I believe there's a plate in front of me because my visual system is a reliable producer of visual knowledge about the world."

to

"Conditional on my visual system being a reliable producer of visual knowledge about the world, I believe there's a plate in front of me and because I believe a very high credence in the latter, I have a similarly high credence in the former."

Comment by Mateusz Bagiński (mateusz-baginski) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T12:09:53.667Z · LW · GW

Maybe it's a good time to make something like a semi-official/curated LW playlist? Do we have enough material for that? Aside from this album, the foreign aid song, I only recall a song about killing the dragon (as an analogy for defeating death) but I can't find it right now.

Comment by Mateusz Bagiński (mateusz-baginski) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T09:20:59.739Z · LW · GW

The Litany of Tarrrrski is beyond wholesome!

Thank you for doing this!

Comment by Mateusz Bagiński (mateusz-baginski) on [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate · 2024-03-31T09:18:42.223Z · LW · GW

Ultimately was performing

missing subject, who was performing? I guess WIV?

Comment by Mateusz Bagiński (mateusz-baginski) on Timaeus's First Four Months · 2024-03-30T09:15:13.714Z · LW · GW

Generalization, from thermodynamics to statistical physics (Hoogland 2023) — A review of and introduction to generalization theory from classical to contemporary. 

FYI this link is broken

Comment by Mateusz Bagiński (mateusz-baginski) on AI Alignment Metastrategy · 2024-03-29T07:46:25.436Z · LW · GW

For people who (like me immediately after reading this reply) are still confused about the meaning of "humane/acc", the header photo of Critch's X profile is reasonably informative

 

Image
Comment by Mateusz Bagiński (mateusz-baginski) on Open Thread Spring 2024 · 2024-03-29T07:40:39.722Z · LW · GW

I have the mild impression that Jacqueline Carey's Kushiel trilogy is somewhat popular in the community?[1] Is it true and if so, why?

  1. ^

    E.g. Scott Alexander references Elua in Mediations on Moloch and I know of at least one prominent LWer who was a big enough fan of it to reference Elua in their discord handle.

Comment by Mateusz Bagiński (mateusz-baginski) on Linda Linsefors's Shortform · 2024-03-28T10:36:09.736Z · LW · GW

Moreover, legal texts are not super strict (much is left to interpretation) and we are often selective about "whether it makes sense to apply this law in this context" for reasons not very different from religious people being very selective about following the laws of their holy books.

Comment by Mateusz Bagiński (mateusz-baginski) on Have we really forsaken natural selection? · 2024-03-27T07:00:21.382Z · LW · GW

I fully agree that something like persistence/[continued existence in ~roughly the same shape] is the most natural/appropriate/joint-carving way to think about whatever-natural-selection-is-selecting-for in its full generality. (At least that's the best concept I know at the moment.)

(Although there is still some sloppiness in what does it mean for a thing at time t0 to be "the same" as some other thing at time t1.)

This view is not entirely novel, see e.g., Bouchard's PhD thesis (from 2004) or the SEP entry on "Fitness" (ctrl+F "persistence").

I also agree that [humans are]/[humanity is] obviously massively successful on that criterion.

I'm very uncertain as to what implications this has for AI alignment.

Comment by Mateusz Bagiński (mateusz-baginski) on otto.barten's Shortform · 2024-03-26T11:43:31.550Z · LW · GW

I think it might have been kinda the other way around. We wanted to systematize (put on a firm, principled grounding) a bunch of related stuff like care-based ethics, individuality, identity (and the void left by the abandonment of the concept of "soul"), etc, and for that purpose, we coined the concept of (phenomenal) consciousness.

Comment by Mateusz Bagiński (mateusz-baginski) on What's Hard About The Shutdown Problem · 2024-03-26T09:17:48.397Z · LW · GW

My best and only guess is https://www.philiptrammell.com/

Comment by Mateusz Bagiński (mateusz-baginski) on Some Rules for an Algebra of Bayes Nets · 2024-03-25T08:28:05.445Z · LW · GW

I think the parentheses are off here. IIUC you want to express the equality of divergences, not divergences multiplied by probabilities (which wouldn't make sense I think).

Typo:  -> 

Comment by Mateusz Bagiński (mateusz-baginski) on Natural Latents: The Concepts · 2024-03-24T09:12:11.048Z · LW · GW

In the Alice and Bob example, suppose there is a part (call it X) of the image that was initially painted green but where both Alice and Bob painted purple. Would that mean that the part of the natural latent over images A and B corresponding to X should be purple?

Comment by Mateusz Bagiński (mateusz-baginski) on Open Thread Spring 2024 · 2024-03-22T19:36:27.566Z · LW · GW

Bug report: I got notified about Joe Carlsmith's most recent post twice, the second time after ~4 hours

Comment by Mateusz Bagiński (mateusz-baginski) on AI Alignment Metastrategy · 2024-03-22T15:30:10.231Z · LW · GW

Can you link to what "h/acc" is about/stands for?

Comment by Mateusz Bagiński (mateusz-baginski) on Richard Ngo's Shortform · 2024-03-22T11:40:35.400Z · LW · GW

Can new traders be "spawned"?

Comment by Mateusz Bagiński (mateusz-baginski) on Counting arguments provide no evidence for AI doom · 2024-03-22T09:49:30.592Z · LW · GW

I'd actually love to read a dialogue on this topic between the two of you.

Comment by Mateusz Bagiński (mateusz-baginski) on Counting arguments provide no evidence for AI doom · 2024-03-22T09:07:10.842Z · LW · GW

The principle fails even in these simple cases if we carve up the space of outcomes in a more fine-grained way. As a coin or a die falls through the air, it rotates along all three of its axes, landing in a random 3D orientation. The indifference principle suggests that the resting states of coins and dice should be uniformly distributed between zero and 360 degrees for each of the three axes of rotation. But this prediction is clearly false: dice almost never land standing up on one of their corners, for example.

 

The only way I can parse this is that you are conflating (1) the position of a dice/coin when it makes contact with the ground and (2) its position when it stabilizes/[comes to rest]. A dice/coin can be in any position when it touches the ground but a vast majority of those are unstable, so it doesn't remain in it for long. 

Comment by Mateusz Bagiński (mateusz-baginski) on Counting arguments provide no evidence for AI doom · 2024-03-22T08:59:51.370Z · LW · GW

More generally, John Miller and colleagues have found training performance is an excellent predictor of test performance, even when the test set looks fairly different from the training set, across a wide variety of tasks and architectures.

 

Counterdatapoint to [training performance being an excellent predictor of test performance]: in this paper, GPT-3 was fine-tuned to multiply "small" (e.g., 3-digit by 3-digit) numbers, which didn't generalize to multiplying bigger numbers.

Comment by Mateusz Bagiński (mateusz-baginski) on What are the weirdest things a human may want for their own sake? · 2024-03-21T04:35:10.314Z · LW · GW

Yeah, that's interesting... unlike fetishes and math, this is something other animals should (?) in principle be capable of but apparently it's a uniquely human thing.

Comment by Mateusz Bagiński (mateusz-baginski) on What are the weirdest things a human may want for their own sake? · 2024-03-20T18:58:27.342Z · LW · GW

Nah, IMO it's a straightforward extrapolation of some subset of normal human values; not that different from what I would do

Comment by Mateusz Bagiński (mateusz-baginski) on What are the weirdest things a human may want for their own sake? · 2024-03-20T18:25:19.954Z · LW · GW

How do you define/measure "weird" (and strength of "want", for that matter)?

I don't have anything more concrete than "seemingly not in the category of things humans tend to intrinsically want". ¯\_(ツ)_/¯

To try to give a concrete answer, I'd say suicide by an otherwise-healthy human is the weirdest desire I know of.

Yeah, that's a good example and brought to my mind obvious-in-retrospect [body integrity dysphoria]/xenomelia where (otherwise seemingly psychologically normal?) people want to get rid of some part of their body. (I haven't looked into it that much but AFAIR it's probably something going somewhat precisely wrong with the body schema?)

Comment by Mateusz Bagiński (mateusz-baginski) on What are the weirdest things a human may want for their own sake? · 2024-03-20T17:45:09.444Z · LW · GW

I tentatively agree with this view.

This still leaves open the question: "What are some uncommon/peculiar attractor states corresponding to [people seemingly terminally valuing 'weird' things]?".

Comment by Mateusz Bagiński (mateusz-baginski) on What are the weirdest things a human may want for their own sake? · 2024-03-20T12:04:45.231Z · LW · GW

Yeah, I linked a new version of this plot in the OP.

Comment by Mateusz Bagiński (mateusz-baginski) on What are the weirdest things a human may want for their own sake? · 2024-03-20T12:03:58.454Z · LW · GW

Something like "terminal"/"intrinsic", i.e. not in service of any other desire.

ETA: the terminal/instrumental distinction probably doesn't apply cleanly to humans but think of the difference between Alice who reads book XYZ because she really likes XYZ (in the typical ways humans like books) and Bob who reads the same book only to impress Charlie.

Comment by Mateusz Bagiński (mateusz-baginski) on Neuroscience and Alignment · 2024-03-19T14:51:59.920Z · LW · GW

because in those worlds all computations in the brain are necessary to do a "human mortality"

I think you meant "human morality"

Comment by Mateusz Bagiński (mateusz-baginski) on Hints about where values come from · 2024-03-18T16:32:50.595Z · LW · GW

Can you give pointers to where Quine and Nozick talk about this?

Comment by Mateusz Bagiński (mateusz-baginski) on Hints about where values come from · 2024-03-18T16:30:50.617Z · LW · GW

"All my values are implicit, explicit labels are just me attempting to name a feeling. The ground truth is the feeling."

Can you elaborate? (I don't have a specific question, just double-clicking, asking for more detail or rephrasing that uses other concepts.)

Comment by Mateusz Bagiński (mateusz-baginski) on Celiefs · 2024-03-17T07:04:02.513Z · LW · GW

Alief - I don't really "believe X in my head" but act as if I believed X.

Anti-alief - I really "believe X in my head" but don't act as if I believed X. (That's at least one way of reversing it.) (Although this could also be seen as "alieving not-X".)

Celief (AFAIU) - I have good reasons to have high credence in X but can't see the gears/am not sufficiently capable of assessing the claims on my own, so I have a substantial model uncertainty. This may be accompanied by different degrees of "acting as if believing X/not-X (with some probability)".

Comment by Mateusz Bagiński (mateusz-baginski) on Open Thread Spring 2024 · 2024-03-17T04:19:43.090Z · LW · GW

Has anybody tried to estimate how prevalent sexual abuse is in EA circles/orgs compared to general population?

Comment by Mateusz Bagiński (mateusz-baginski) on Open Thread Spring 2024 · 2024-03-15T13:22:43.444Z · LW · GW

Does anybody know what happened to Julia Galef?

Comment by Mateusz Bagiński (mateusz-baginski) on Insulate your ideas · 2024-03-15T09:37:21.603Z · LW · GW

“The first thing I do is run faster on the uphills. Everyone else slows down because of how difficult it is, and this is a great time to gain ground. The second thing I do is run faster on the downhills. Everyone else slows down because of how fast you go anyways on the slope down, and this is a great time to gain ground.”

 

Conjecture from personal life: one reason I almost never (maybe once a year?) get noticeably/symptomatically sick may be that when I do get sick, I don't perceive it as a reason to give myself a break, work less, skip exercise, etc. Maybe "the body learns that getting sick doesn't pay off, so why get sick?". There is some evidence that the onset of sickness symptoms is partially neurally mediated, so this kind of learning is at least plausible (ref).

(obviously, the law of equal and opposite advice applies here)

Comment by Mateusz Bagiński (mateusz-baginski) on What could a policy banning AGI look like? · 2024-03-14T19:15:19.778Z · LW · GW

I expect AGI within 5 years. I give it a 95% chance that if an AGI is built, it will self-improve and wipe out humanity. In my view, the remaining 5% depends very little on who builds it. Someone who builds AGI while actively trying to end the world has almost exactly as much chance of doing so as someone who builds AGI for any other reason.

There is no "good guy with an AGI" or "marginally safer frontier lab." There is only "oops, all entity smarter than us that we never figured out how to align or control."

So what do you allocate the remaining 5% to? No matter who builds the AGI, there's 5% chance that it doesn't wipe out humanity because... what? (Or is it just model uncertainty?)

Comment by Mateusz Bagiński (mateusz-baginski) on Fixing The Good Regulator Theorem · 2024-03-13T10:26:16.891Z · LW · GW

I asked Abram about this live here https://youtu.be/zIC_YfLuzJ4?si=l5xbTCyXK9UhofIH&t=5546

Comment by Mateusz Bagiński (mateusz-baginski) on AISC team report: Soft-optimization, Bayes and Goodhart · 2024-03-11T09:08:46.374Z · LW · GW

People have already done a fair bit of work on this in RL in terms of 'cautious' RL which tries to take into account uncertainty in the world model to avoid accidentally falling into traps in the environment.

I would appreciate some pointers to resources