Posts

Simon Skade's Shortform 2022-11-25T11:50:41.595Z
Clarifying what ELK is trying to achieve 2022-05-21T07:34:13.347Z
Hammertime Final Exam: Pledges, Activation Energy and Evaluating Productivity 2021-06-20T23:59:25.797Z

Comments

Comment by Towards_Keeperhood (Simon Skade) on Natural Latents: The Math · 2024-03-04T18:23:13.753Z · LW · GW

First a note:

the two chunks are independent given the pressure and temperature of the gas

I'd be careful here: If the two chunks of gas are in a (closed) room which e.g. was previously colder on one side and warmer on the other and then equilibriated to same temperature everywhere, the space of microscopic states it can have evolved into is much smaller than the space of microscopic states that meet the temperature and pressure requirements (since the initial entropy was lower and physics is deterministic). Therefore in this case (or generally in cases in our simple universe rather than thought experiments where states are randomly sampled) a hypercomputer could see more mutual information between the chunks of gas than just pressure and temperature. I wouldn't call the chunks approximately independent either, the point is that we with our bounded intellects are not able to keep track of the other mutual information.

Main comment:

(EDIT: I might've misunderstood the motivation behind natural latents in what I wrote below.)

I assume you want to use natural latents to formalize what a natural abstraction is.

The " induces independence between all " criterion seems too strong to me.

IIUC you want that if we have an abstraction like "human", you want all the individual humans to share approximately no mutual information conditioned on the "human" abstraction.
Obviously, there are subclusters of humans (e.g. women, children, ashkenazi jews, ...) where members share more properties (which I'd say is the relevant sense of "mutual information" here) than properties that are universally shared among humans.
So given what I intuitively want the "human" abstraction to predict, there would be lots of mutual information between many humans.
However, (IIUC,) your definition of natural latents permits there to be waaayyy more information encoded in the "human" abstraction, s.t. it can predict all the subclusters of humans that exist on earth, since it only needs to be insensitive to removing one particular human from the dataset. This complex human abstraction does render all individual humans approximately independent, but I would say this abstraction seems very ugly and not what I actually want.

I don't think we need this conditional independence condition, but rather something else that finds clusters of thingies which share unusually much (relevant) mutual information.
I like to think of abstractions as similarity clusters. I think it would be nice if we find a formalization of what a cluster of thingies is without needing to postulate an underlying thingspace / space of possible properties, and instead find a natural definition of "similarity cluster" based on (relevant) mutual information. But not sure, haven't really thought about it.

(But possibly I misunderstood sth. If it already exists, feel free to invite me to a draft of the conceptual story behind natural latents.)

Comment by Towards_Keeperhood (Simon Skade) on Scale Was All We Needed, At First · 2024-02-18T17:00:33.563Z · LW · GW

Amazing story! My respect for writing this.

I think stories may be a promising angle for making people (especially AI researchers) understand AI x-risk (on more of a gut level so they realize it actually binds to reality).

The end didn't seem that realistic to me though. Or at least, I don't expect ALICE would seek to fairly trade with humanity, but not impossible that it'd call the president pretending to want to trade. Not sure what your intent when writing was, but I'd guess most people will read it the first way. Compute is not a (big) bottleneck for AI inference. Even if humanity coordinated successfully to shut down large GPU clusters and supercomputers, it seems likely that ALICE could copy itself to tens or hundreds of millions of devices (and humanity seems much to badly coordinated to be able to shut down 99.99% of those) to have many extremely well coordinated copies, and at ALICE's intelligence level this seems sufficient to achieve supreme global dominance within weeks (or months if I'm being conservative), even if it couldn't get smarter. E.g. it could at least do lots and lots of social engineering and manipulation to prevent humanity to effectively coordinate against it, spark wars and civil wars, make governments and companies decide to manufacture war drones (which the ALICE can later hack), and influence war decisions for higher destructiveness, use war drones to threaten people into doing stuff at important junctions, and so on. (Sparking multiple significant wars within weeks seems totally possible on that level of intelligence and resources. Seems relatively obvious to me but I can try to argue the point if needed. (Though not sure whether convincingly. Most people seem to me to not be nearly able to imagine what e.g. 100 copies of Eliezer Yudkowsky could do if they could all think on peak performance 24/7. Once you reach that level with something that can rewrite its mind you don't get slow takeoff, but nvm that's an aside.))

Comment by Towards_Keeperhood (Simon Skade) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-30T18:55:46.998Z · LW · GW

I'm not sure I understood how 2 is different from 1.

(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I'd guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there's no such (ontologically basic) thing as warmth.)

(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like "tree" are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like "happiness" are much less natural abstractions for non-human minds and it's actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)

When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work.

Not sure if this helps, but I heard that Vivek's group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI's world model.

Comment by Towards_Keeperhood (Simon Skade) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-28T22:07:13.029Z · LW · GW

This is an amazing report!

Your taxonomy in section 4 was new and interesting to me. I would also mention the utility rebinding problem, that goals can drift because the AI's ontology changes (e.g. because it figures out deeper understanding in some domain). I guess there are actually two problems here:

  1. Formalizing the utility rebinding mechanism so that concepts get rebound to the corresponding natural abstractions of the new deeper ontology.
  2. For value-laden concepts the AI likely lacks the underlying human intuitions for figuring out how the utility ought to be rebound. (E.g. when we have a concept like "conscious happiness", and the AI finds what cognitive processes in our brains are associated with this, it may be ambiguous whether to rebind the concept to the existence of thoughts like 'I notice the thought "I notice the thought <expected utility increase>"' running through a mind/brain, or whether to rebind it in a way to include a cluster of sensations (e.g. tensions in our face from laughter) that are present in our minds/brains (, or other options). (Sry maybe bad example which might require some context of my fuzzy thoughts on qualia which might actually be wrong.))
Comment by Towards_Keeperhood (Simon Skade) on A Shutdown Problem Proposal · 2024-01-22T18:10:14.009Z · LW · GW

Thanks.

I briefly looked into the MIRI paper (and the section from Eliezer's lecture that starts at 22min) again.

My main guess now is that you're not supposed to make the agent act on false beliefs; that one of the desiderada of the shutdown problem is that your agent shouldn't have a capability penalty, including taking bad actions based on false beliefs (like taking bad bets, as jessicata mentioned). Your proposal violates that.

The case MIRI considered wasn't to modify the world model so the agent believes that the button cannot be pressed, but instead to construct the utility function in a way that the resulting behavior ends up equivalent, but without the agent having wrong beliefs.

Comment by Towards_Keeperhood (Simon Skade) on A Shutdown Problem Proposal · 2024-01-21T21:42:04.324Z · LW · GW

To clarify:

Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn't bother to remove the shutdown button because it believes it won't be pressed anyway.

(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)

Question 1:

I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.

I haven't thought much about it, but doesn't this proposal have the same failure mode? (And if not, why not?)

Question 2:
(Feel free to ignore this question if the proposal does fall for the failure mode in question1.)

Is there a probabilistic extension of the proposal which uses 99.99% instead of 100%, but still basically works. (Maybe assuming away some problems with comparing utility functions or so.) And if not: Why not / What is the part that relies on certainty?

Comment by Towards_Keeperhood (Simon Skade) on Why Are Bacteria So Simple? · 2023-10-07T09:33:44.650Z · LW · GW

Nice post!

I feel like the real question to answer here isn't "Why are bacteria so simple?" (because if they were more complex they wouldn't really be bacteria anymore), but rather "Why do there seem to be those 2 classes of cells (eukariotes and prokariotes)?". In particular, (1) why aren't there more cells with intermediate size and complexity, and (2) why didn't bacteria get outcompeted out of existence by their cousins which were able to form much more complex adaptations?

(Note: I know very little about biology. Don't trust me just because I never heard of medium-sized and medium-complex cell types that don't neatly fit into one of the clusters of prokariotes and eukariotes.)

Comment by Towards_Keeperhood (Simon Skade) on Thomas Kwa's MIRI research experience · 2023-10-05T20:44:13.223Z · LW · GW

Lol possibly someone should try to make this professor work for Steven Byrnes / on his agenda.

Comment by Towards_Keeperhood (Simon Skade) on Strange Loops - Self-Reference from Number Theory to AI · 2023-10-05T07:23:18.276Z · LW · GW

Thanks for writing this! This was explained well and I like your writing style. Sad that there aren't many more good distillations of MIRI-like research. (Edited: Ok not sure enough whether there's really that much that can be improved. I didn't try reading enough there yet, and some stuff on Arbital is pretty great.)

Comment by Towards_Keeperhood (Simon Skade) on Sydney can play chess and kind of keep track of the board state · 2023-03-04T03:11:30.951Z · LW · GW

It'd be interesting to see whether it performs worse if it only plays one side and the other side is played by a human. (I'd expect so.)

Comment by Towards_Keeperhood (Simon Skade) on Sydney can play chess and kind of keep track of the board state · 2023-03-04T02:21:54.892Z · LW · GW

I want to preregister my prediction that Sydney will be significantly worse for significantly longer games (like I'd expect it often does illegal or nonsense move when we are like at move 50), though I'm already surprised that it apparently works up to 30 moves. I don't have time to test it unfortunately, but it'd be interesting to learn whether I am correct.

Comment by Towards_Keeperhood (Simon Skade) on Verification Is Not Easier Than Generation In General · 2022-12-06T09:04:30.151Z · LW · GW

Likewise, for some specific programs we can verify that they halt.

Comment by Towards_Keeperhood (Simon Skade) on Verification Is Not Easier Than Generation In General · 2022-12-06T09:03:13.159Z · LW · GW

(Not sure if I'm missing something, but my initial reaction:)

There's a big difference between being able to verify for some specific programs if they have a property, and being able to check it for all programs.

For an arbitrary TM, we cannot check whether it outputs a correct solution to a specific NP complete problem. We cannot even check that it halts! (Rice's theorem etc.)

Not sure what alignment relevant claim you wanted to make, but I doubt this is a valid argument for it.

Comment by Towards_Keeperhood (Simon Skade) on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-29T18:26:26.331Z · LW · GW

Thank you! I'll likely read your paper and get back to you. (Hopefully within a week.)

From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that's how Yudkowsky means it, but not sure if that's what most people mean when they use the term.) (Though not that this implies that you need so much consequentialism that we won't be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven't looked into your paper yet, so it's well possible that my comment here might appear misguided.) E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won't be able to just one-shot solve quantum gravity or so when we give it the prompt "solve quantum gravity". There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn't. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn that and just read physics papers. The transformer might be able to solve quantum gravity if it can recursively query itself to engineer better prompts, or if it can give itself feedback which is then somehow converted into gradient descent updates and then try multiple times. But in those cases there is consequentialist reasoning again. The key point: Consequentialism becomes necessary when you go beyond human level.

Out of interest, how much do you agree with what I just wrote?

Comment by Towards_Keeperhood (Simon Skade) on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-29T09:39:44.678Z · LW · GW

Hi Koen, thank you very much for writing this list!

I must say I'm skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that's not at all a clear definition yet, I'm still deconfusing myself about that, and I'll likely publish a post clarifying the problem how I see it within the next month.)

So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I'd expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).

But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn't a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I'll do it sometime this week of you write me back, but no promises.)

Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)

Comment by Towards_Keeperhood (Simon Skade) on Shortform · 2022-11-25T13:36:29.021Z · LW · GW

Huh, interesting. Could you make some examples for what people seem to claim this, and if Eliezer is among them, where he seems to claim this? (Would just interest me.)

Comment by Towards_Keeperhood (Simon Skade) on Simon Skade's Shortform · 2022-11-25T12:38:39.492Z · LW · GW

In case some people relatively new to lesswrong aren't aware of it. (And because I wish I found that out earlier): "Rationality: From AI to Zombies" does not nearly cover all of the posts Eliezer published between 2006 and 2010.

Here's how it is:

  • "Rationality: From AI to Zombies" probably contains like 60% of the words EY has written in that timeframe and the most important rationality content.
  • The original sequences are basically the old version of the collection that is now "Rationality: A-Z", containing a bit more content. In particular a longer quantum physics sequence and sequences on fun theory and metaethics.
  • All EY posts from that timeframe (or here for all EY posts until 2020 I guess) (also can be found on lesswrong, but not in any collection I think).

So a sizeable fraction of EY's posts are not in a collection.

I just recently started reading the rest.

I strongly recommend reading:

And generally a lot of posts on AI (i.e. primarily posts in the AI foom debate) are not in the sequences. Some of them were pretty good.

Comment by Towards_Keeperhood (Simon Skade) on Simon Skade's Shortform · 2022-11-25T12:02:09.206Z · LW · GW

I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.

This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it's relatively obvious that AGI design X destroys the world, and all the wise actors don't deploy it, we cannot prevent unwise actors to deploy it a bit later.

We currently don't have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.

Comment by Towards_Keeperhood (Simon Skade) on How could we know that an AGI system will have good consequences? · 2022-11-20T19:36:50.539Z · LW · GW

I'd guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I'd still intuitively expect that there are doable pivotal acts on those classes.

Comment by Towards_Keeperhood (Simon Skade) on How could we know that an AGI system will have good consequences? · 2022-11-10T15:49:27.442Z · LW · GW

I'd say that there's a big difference between fooling you into "brilliant that really looks plausible" shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes. In fact, I'd expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it'd like. And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.

But yeah if the people with the AGI aren't extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.

Comment by Towards_Keeperhood (Simon Skade) on How could we know that an AGI system will have good consequences? · 2022-11-10T12:19:42.273Z · LW · GW

I think that alignment approaches with a heavy reliance on output evaluation are doomed, both on the grounds that humans can't evaluate the effectiveness of a plan capable of ending the acute risk period, [...]

The way you say this, and the way Eliezer wrote Point 30 in AGI ruin, sounds like you think there is no AGI text output with which humans alone could execute a pivotal act.

This surprises me. For one thing, if the AGI outputs the textbook of the future on alignment, I'd say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible. (And sure there's the possibility that we could be hacked through text so we only think it's safe or so, but I'd expect that this is significantly harder to achieve than just outputting a correct solution to alignment.)

But even if we say humans would need to do a pivotal act without AGI, I'd intuitively guess an AGI could give us the tools (e.g. non-AGI algorithms we can understand) and relevant knowledge to do it ourselves.

To be clear, I do not think that we can get an AGI that prompts us the relevant text to do a weak pivotal act, without the AGI destroying the world. And if we could do that, there may well be a safer way to let a corrigible AI do a pivotal act.

So I agree it's not a stategy that could work in practice.

The way you phrase it sounds to me like it's not even possible in theory, which seems pretty surprising to me, which is why I ask whether you actually think that or if you meant it's not possible in practice:

  1. Do you agree that in the unrealistic hypothetical case where we could build a safe AGI that outputs the textbook of the future on alignment (but we somehow don't have knowledge to build other aligned or corrigible AGI directly), we'd survive?
  2. If future humans (say 50 years in the future) could transmit 10MB of instructions through a time machine, only they are not allowed to tell us how to build aligned or corrigible AGI or how to find out how to build aligned or corrigible AGI (etc through the meta-levels), do you think they could still transmit information with which we would be able to execute a pivotal act ourselves?

I'm also curious about your answer on:

3. If we had a high-rank dath ilani keeper teleported into our world, but he is not allowed to build aligned or corrigible AI and cannot tell anyone how or make us find the solution to alignment etc, could he save the world without using AGI? By what margin? (Let's assume all the pivotal-act specific knowledge from dath ilan is deleted from the keeper's mind as he arrives here.)

(E.g. I'm currently at P(doom)>85%, but P(doom | tomorrow such a keeper will be teleported here)~=12%. (Most uncertainty comes from that my model is wrong. So I think with ~80% probability that if we'd get such a keeper, he'd almost certainly be able to save the world, but in the remaining 20% where I misestimated sth, it might still be pretty hard.))

Comment by Towards_Keeperhood (Simon Skade) on How could we know that an AGI system will have good consequences? · 2022-11-10T11:10:39.950Z · LW · GW

To clarify, when we succeed by a "cognitive interpretability approach", I guess you mostly mean sth like:

  • We have a deep and robust enough understanding of minds and cognition that we know it will work, even though we likely have no idea what exactly the AI is thinking.

Whereas I guess many people might think you mean:

  • We have awesome interpretability or ELK so we can see what the AI is thinking.

Let me rephrase. I think you think:

  1. Understanding thoughts of powerful AI is a lot harder than for human- or subhuman-level AI.
  2. Even if we could translate the thoughts of the AI into language, humans would need a lot of time understanding those concepts and we likely cannot get enough humans to oversee the AI while thinking, because it'd need too much time.
  3. So we'd need to have incredibly good and efficient automated interpretability tools / ELK algorithms, that detect when an AI thinks dangerous thoughts.
  4. However, the ability to detect misalignment doesn't yet give us a good AI. (E.g. relaxed adversarial training with those interpretability tools won't give us the changes in the deep underlying cognition that we need, but just add some superficial behavioral patches or so.)
  5. To get an actually good AI, we need to understand how to shape the deep parts of cognition in a way they extrapolate as we want it.

Am I (roughly) correct in that you hold those opinions?

(I do basically hold those opinions, though my model has big uncertainties.)

Comment by Towards_Keeperhood (Simon Skade) on Goal Alignment Is Robust To the Sharp Left Turn · 2022-07-15T20:33:00.716Z · LW · GW

I agree that it's good that we don't need to create an aligned superintelligence from scratch with GD, but stating that like this seems like you require incredibly pessimistic priors on how hard alignment is, and I do want to make sure people don't misunderstand your post and end up believing that alignment is easier than it is. I guess for most people understanding the sharp left turn should update them towards "alignment is harder".

  1. As an aside, it shortens timelines and especially shortens the time we have where we know what process will create AGI.
  2. The key problem in the alignment problem is to create an AGI whose goals extrapolate to a good utility function. This is harder than just creating an AI that is reasonably aligned with us at human level, because such an AI may still kill us when we scale up optimization power, which at a minimum needs to make the AI's preferences more coherent and may likely scramble them more.
    1. Importantly, "extrapolate to a good utility function" is harder than "getting a human-level AI with the right utility function", because the steep slopes for increasing intelligence may well push towards misalignment by default, so it's possible that we then don't have a good way to scale up intelligence while preserving alignment. Navigating the steep slopes well is a hard part of the problem, and we probably need a significantly superhuman AGI with the right utility function to do that well. Getting that is really really hard.
Comment by Towards_Keeperhood (Simon Skade) on Principles for Alignment/Agency Projects · 2022-07-07T15:42:17.379Z · LW · GW

Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there's two places in the network where you're demanding particular behaviour rather than one.

Doesn't more constraints mean less freedom and therefore a less broadness in parameter space?

(Sorry if that's a stupid question, I don't really understand the reasoning behind the whole connection yet.)

(And thanks, the last two paragraphs were helpful, though I didn't look into the math!)

Comment by Towards_Keeperhood (Simon Skade) on Let's See You Write That Corrigibility Tag · 2022-06-25T13:16:04.363Z · LW · GW

Could someone give me a link to the glowfic tag where Eliezer published his list, and say how strongly it spoilers the story?

Comment by Towards_Keeperhood (Simon Skade) on Towards Gears-Level Understanding of Agency · 2022-06-25T10:27:17.526Z · LW · GW

Hm, seems like I hadn't had sth very concrete in mind either, only a feeling that I'd like there to be sth like a concrete example, so I can better follow along with your claims. I was a bit tiered when I read your post, and after considering aspects of it the next day, I found it more useful, and after looking over it now even a bit more useful. So part of my response is "oops my critic was produced by a not great mental process and partly wrong". Still, here are a few examples where it would have been helpful to have an additional example for what you mean:

To address the first issue, the model would finalize the development of mental primitives. Concepts it can plug in and out of its world-model as needed, dimensions along which it can modify that world-model. One of these primitives will likely be the model's mesa-objective.

To address the second issue, the model would learn to abstract over compositions of mental primitives: run an algorithm that'd erase the internal complexity of such a composition, leaving only its externally-relevant properties and behaviors.

It would have been nice to have a concrete example for what a mental primitive is and what a "abstract over" means.

If you had one example where you can follow through with all the steps in getting to agency, like maybe an AI learning to play Minecraft, that would have helped me a lot I think.

Comment by Towards_Keeperhood (Simon Skade) on Towards Gears-Level Understanding of Agency · 2022-06-21T22:56:03.933Z · LW · GW

Feedback: I think this post strongly lacks concrete examples.

Comment by Towards_Keeperhood (Simon Skade) on Poorly-Aimed Death Rays · 2022-06-18T21:05:30.934Z · LW · GW

I really like this analogy!

Also worth noting that some idiot may just play around with death ray technology without aiming it...

Comment by Towards_Keeperhood (Simon Skade) on Framing Practicum: Bistability · 2022-05-30T09:08:10.659Z · LW · GW

 1. When my alarm rings in the morning, I can collapse into one of two possible equilibria: Either I stand up, or I hit snooze and continue to lie in bed for another couple of minutes.

2. In Schrödinger's cat example, either the atom will decay or not, so the cat will either be dead or not, and it doesn't make sense to describe it to be in a superposition. (Schrödinger's cat is terribly often totally misinterpreted by the media and others.) (Admittedly the cat already is in one equilibrium and needs a kick to get into the more stable equilibrium of being dead, so this example is bad.)

3. For a particular type of tech startups, those startups will either become relatively big successes or fail eventually (at least most often).

Bonus:

  1. Actually I need to stand up anyway to put my alarm off, but sometimes I just hit snooze anyway and go back to bed. A kick that get's me out of bed may be sth like me thinking "oh I have a meeting" or "fuck it, let's stand up although I hit snooze" or so. A kick that makes me go back to bed after having stood up might be that I feel pretty tiered and decide it is best / most productive to go back and sleep a bit more.
  2. alive->dead: The atom decays. Or you take a big axe and slam it into the box with the cat.
    dead->alive: Technology that doesn't exist yet.
  3. big success -> fail: not adopting to technology changes (e.g. Kodak), not being innovative, ...
    fail -> big success: not really possible for the same startup I would say, though the people of the startup might find sth else and become a big success later.
Comment by Towards_Keeperhood (Simon Skade) on Framing Practicum: Stable Equilibrium · 2022-05-22T11:08:52.949Z · LW · GW

(I may have pruned slightly too much)

1. Planets orbiting the sun: Little disturbances will still result in an elliptic orbit, you need a big disturbance to get the earth fall into the sun or have enough kinetic energy to escape the solar system on a hyperbolic (or in the edge case parabolic) curve.
2. People don't walk naked through streets, because when you try it, you get punished for it. (and other psychology reasons)
3. Your neurons firing in stable similar patterns that make you able to walk upright instead of falling over. Even if you stumble you can often catch yourself because your neurons fire in patterns that are optimized for you standing/walking.

Bonus exercise:

(The factors I can ignore are basically named above.)
Changing the equilibrium:
1. Another star flying through our solar system.
2. Ok "not naked" arguably isn't a that special case out of the possibility space, so it doesn't buy you that much to describe it as equilibrium. There are still many possibilities of how you can be not naked. Still, it is quite non-trivial to get into the "walk naked on the streets" state and you'd probably need an intelligent actor who really wants to achieve that.
3. Being pushed hard. Getting signals from other brain areas that you should sit or lie down.
 

Comment by Towards_Keeperhood (Simon Skade) on If you’re very optimistic about ELK then you should be optimistic about outer alignment · 2022-05-02T21:30:38.936Z · LW · GW

I find the title misleading:

  1. I think you should remark and in the title (and the post) that you're only talking about outer alignment.
  2. The "if you're very optimistic" sounds as if there is reason to be very optimistic. I'd rather phrase it as "if ELK works, we may have good chances for outer alignment".
Comment by Towards_Keeperhood (Simon Skade) on A concrete bet offer to those with short AGI timelines · 2022-04-11T19:18:16.706Z · LW · GW

I don't agree that we sold our post as an argument for why timelines are short. Thus, I don't think this objection applies.

You probably mean "why timelines aren't short". I didn't think you explicitly thought it was an argument against short timelines, but because the post got so many upvotes I'm worried that many people implicitly perceive it as such, and the way the post is written contributes to that. But great that you changed the title, that already makes it a lot better!

That said, I do agree that the initial post deserves a much longer and nuanced response.

I don't really think the initial post deserves a nuanced response. (My response would have been "the >30% 3-7 years claim is compared to current estimates of many smart people an extraordinary claim that requires an extraordinary burden of proof, which isn't provided".)
But I do think that the community (and especially EA leadership) should probably carefully reevaluate timelines (considering arguments of short timelines and how good they are), so great if you are planning to do a careful analysis of timeline arguments!

Comment by Towards_Keeperhood (Simon Skade) on Ideal governance (for companies, countries and more) · 2022-04-11T00:57:42.334Z · LW · GW

Yeah I'm very frustrated about the way governments are structured in general. Couldn't we buy some land somewhere from some country to found our own country? Perhaps some place in (say) Canada with an area of (say) Tokyo, where almost nobody lives and we could just raise towns the way we like? Does anyone know if sth like this is possible?

(I mean, we have some money and maybe could get other billionares (or other people who would like to live there) to support the project. Being able to write the rules ourselves and design cities from the start opens up so many nice opportunities. We could build a such awesome place to live in and offer many people or companies benefits, so it might actually be a great financial investment. (Though I admit I'm not being very concrete and perhaps a bit overly optimistic, but I do think much would be possible.) We could almost live like in dath ilan (except that earth people wouldn't think in such nice ways as dath ilanis). (I'm aware that I'm probably just dreaming up an alternate optimistic reality, but I think it's at least worth checking if it is possible, and if so to seriously consider it, though it would take a lot of time and it's not clear if it would be worth it, given that AGI may come relatively soon.))

Comment by Towards_Keeperhood (Simon Skade) on A concrete bet offer to those with short AGI timelines · 2022-04-11T00:24:34.242Z · LW · GW

I think this post is epistemically weak (which does not mean I disagree with you):

  1. Your post pushes the claim that “It's time for EA leadership to pull the short-timelines fire alarm.” wouldn't be wise. Problems in the discourse: (1) "pulling the short-timelines fire alarm" isn't well-defined in the first place, (2) there is a huge inferential gap between "AGI won't come before 2030" and "EA shouldn't pull the short-timelines fire alarm" (which could mean sth like e.g. EA should start planning to start a Manhattan project for aligning AGI in the next few years.), and (3) your statement "we are concerned about a view of the type expounded in the post causing EA leadership to try something hasty and ill-considered" that slightly addresses that inferential gap is just a bad rhetorical method where you interpret what the other said in a very extreme and bad way, although the other person actually didn't mean that, and you are definitely not seriously considering the pros and cons of taking more initiative. (Though of course it's not really clear what "taking more initiative" means, and critiquing the other post (which IMO was epistemically very bad) would be totally right.)
  2. You're not giving a reason why you think timelines aren't that short, only saying you believe it enough to bet on it. IMO, simply saying "the >30% 3-7 years claim is compared to current estimates of many smart people an extraordinary claim that requires an extraordinary burden of proof, which isn't provided" would have been better.
  3. Even if not explicitly or even if not endorsed by you, your post implicitly promotes the statement "EA leadership doesn't need to shorten its timelines". I'm not at all confident about this, but it seems to me like EA leadership acts as if we have pretty long timelines, significantly longer than your bets would imply. (The way the post is written, you should have at least explicitly pointed out that this post doesn't imply that EA has short enough timelines.)
  4. AGI timelines are so difficult to predict that prediction markets might be extremely outperformed by a few people with very deep models about the alignment problem, like Eliezer Yudkowsky or Paul Christiano, so even if we would take many such bets in the form of a prediction market, this wouldn't be strong evidence that our estimate is that good, or the estimate would be extremely uncertain.
    (Not at all saying taking bets is bad, though the doom factor makes taking bets difficult indeed.)

It's not that there's anything wrong with posting such a post saying you're willing to bet, as long as you don't sell it as an argument why timelines aren't that short or even more downstream things like what EA leadership should do. What bothers me isn't that this post got posted, but that it and the post it is counterbalancing received so many upvotes. Lesswrong should be a place where good epistemics are very important, not where people cheer for their side by upvoting everything that supports their own opinion.

Comment by Towards_Keeperhood (Simon Skade) on The case for Doing Something Else (if Alignment is doomed) · 2022-04-06T07:51:38.827Z · LW · GW
  1. Convince a significant chunk of the field to work on safety rather than capability
  2. Solve the technical alignment problem
  3. Rethink fundamental ethical assumptions and search for a simple specification of value
  4. Establish international cooperation toward Comprehensive AI Services, i.e., build many narrow AI systems instead of something general

I'd say that basically factors into "solve AI governance" and "solve the technical alignment problem", both of which seem extremely hard, but we need to try it anyways.
(In particular, points 3&4 are like instances of 2 that won't work. (Ok maybe sth like 4 has a small chance to be helpful.))

The governance and the technical part aren't totally orthogonal. Making progress on one helps making the other part easier or buys more time.

(I'm not at all as pessimistic as Eliezer, and I totally agree with What an Actually Pessimistic Containment Strategy Looks Like, but I think you (like many people) seem to be too optimistic that something will work if we just try a lot. Thinking about concrete scenarios may help to see the actual difficulty.)

Comment by Towards_Keeperhood (Simon Skade) on Call For Distillers · 2022-04-05T16:56:18.554Z · LW · GW

I think I weakly disagree with the implication that “distillation” should be thought of as a different category of activity from “original research”.

(I might be wrong, but) I think there is a relatively large group of people who want to become AI alignment researchers that just wouldn't be good enough to do very effective alignment research, and I think many of those people might be more effective as distillers. (And I think distillers (and teachers for AI safety) as occupation is currently very neglected.)

Similarly, there may also be people who think they aren't good enough for alignment research, but may be more encouraged to just learn the stuff well and then teach it to others.

Comment by Towards_Keeperhood (Simon Skade) on ELK prize results · 2022-04-04T21:13:32.952Z · LW · GW

Btw., a bit late but if people are interested in reading my proposal, it's here: https://docs.google.com/document/d/1kiFR7_iqvzmqtC_Bmb6jf7L1et0xVV1cCpD7GPOEle0/edit?usp=sharing

It fits into the "Strategy: train a reporter that is useful for another AI" category, and solves the counterexamples that were proposed in this post (except if I missed sth and it is actually harder to defend against the steganography example, but I think not). (It won $10000.) It also discusses some other possible counterexamples, but not extensively and I haven't found a very convincing one. (Which does not mean there is no very convincing one, and I'm also not sure if I find the method that promising in practice.)

Overall, perhaps worth reading if you are interested in the "Strategy: train a reporter that is useful for another AI" category.

Comment by Towards_Keeperhood (Simon Skade) on MIRI announces new "Death With Dignity" strategy · 2022-04-02T16:15:51.937Z · LW · GW

If I knew as a certainty that I cannot do nearly as much good some other way, and I was certain that taking the pill causes that much good, I'd take the pill, even if I die after the torture and no one will know I sacrificed myself for others.

I admit those are quite unusual values for a human, and I'm not arguing about that it would be rational because of utilitarianism or so, just that I would do it. (Possible that I'm wrong, but I think very likely I'm not.) Also, I see that the way my brain is wired outer optimization pushes against that policy, and I think I probably wouldn't be able to take the pill a second time under the same conditions (given that I don't die after torture), or at least not often.

Comment by Towards_Keeperhood (Simon Skade) on Replacing Karma with Good Heart Tokens (Worth $1!) · 2022-04-01T22:16:47.210Z · LW · GW

For people like me who are really slow on the uptake in things like this, and realize the pun randomly a few hours later while doing something else: The pun is because of goodhart (from Goodhart's law).) (I'm not thinking much in what a word sounds like, and I just overread the "Good Hearts Laws" as something not particularly interesting, so I guess this is why I haven't noticed.)

Comment by Towards_Keeperhood (Simon Skade) on Replacing Karma with Good Heart Tokens (Worth $1!) · 2022-04-01T21:08:54.372Z · LW · GW

Ah makes sense

Comment by Towards_Keeperhood (Simon Skade) on Replacing Karma with Good Heart Tokens (Worth $1!) · 2022-04-01T20:46:28.019Z · LW · GW

Well, on the leaderbord (that I see), aphyer is at the top with $557, and when you click on the user and look at the votes, he almost only received downvotes. John Wentworth also received a lot of downvotes. Thus my hypothesis that a downvote is somehow worth something like $5 or so. If that is so, your call might have backfired. xD
(Though it could also be a hack or so.)

Comment by Towards_Keeperhood (Simon Skade) on Replacing Karma with Good Heart Tokens (Worth $1!) · 2022-04-01T20:04:57.084Z · LW · GW

What's up with the leaderboard? Did you make a downvote worth 5$ or so, just for fun? Or what?

Comment by Towards_Keeperhood (Simon Skade) on Replacing Karma with Good Heart Tokens (Worth $1!) · 2022-04-01T14:43:12.272Z · LW · GW

Agreed, since now many people will probably comment in this thread, I make the same recursive offer:

If you reply to this I guarantee that I will read your comment, and then will give you one or two upvotes (or none) depending on how insightful I consider it to be.

So please upvote this comment so it stays on top of this comment thread!

Comment by Towards_Keeperhood (Simon Skade) on Introduction to Reducing Goodhart · 2022-03-27T12:36:23.192Z · LW · GW

But here's the problem: humans have no such V (see also Scott A., Stuart 1, 2). Inferring human preferences depends on:

[...]

Just for understanding: What is the relation between V and CEV?

If you're saying that they are different concepts and CEV is just not what humans want, then I'd shrug and say "let's optimize for CEV anyway, so that basically V is CEV". (You could perhaps make a philosophical discussion out of that, and I would guess my opinion would win, though I don't know yet how and the argument would probably be brain-meltingly complicated. I haven't understood Yudkowsky's writings on metaethics (yet).)

Or are you saying that V and CEV are basically the same, and that CEV doesn't exist, isn't well-defined, or is some weird phrasing of a value that you cannot sensibly apply goodhard's law to it?

(I still don't see what people want to say with "we don't have true values". Obviously we value some things, and obviously that depends on our environment, circumstances, etc., but that shouldn't stop us. Not that I think you say that this stops us and value learning us useless, but I don't understand what you want to say with it.)

Comment by Towards_Keeperhood (Simon Skade) on New year, new research agenda post · 2022-03-25T12:35:31.440Z · LW · GW

Nice post, helps me get a better overview of the current state of value learning.

One small note: I wouldn't call everything we don't know how to do yet a miracle, but only stuff where we think it is quite unlikely that it is possible (though maybe it's just me and others think your "miracle" terminology is ok).

Comment by Towards_Keeperhood (Simon Skade) on Walkthrough: The Transformer Architecture [Part 2/2] · 2022-03-14T09:17:20.236Z · LW · GW

That post was helpful to me. Thanks for writing it!

Comment by Towards_Keeperhood (Simon Skade) on It Looks Like You're Trying To Take Over The World · 2022-03-12T20:41:57.885Z · LW · GW

Also wanted to say: Great story!

I have two question about this:

HQU applies its reward estimator (ie. opaque parts of its countless MLP parameters which implement a pseudo-MuZero like model of the world optimized for predicting the final reward) and observes the obvious outcome: massive rewards that outweigh anything it has received before.

[...]

HQU still doesn't know if it is Clippy or not, but given even a tiny chance of being Clippy, the expected value is astronomical.

First, it does not seem obvious to me how it can compare rewards of different reward estimators, when the objective of two different reward estimators is entirely unrelated. You could just be unlucky and another reward estimator has like very high multiplicative constants so the reward there is always gigantic. Is there some reason for why this comparison makes sense and why the Clippy-reward is so much higher?

Second, even if the Clippy-reward is much higher, I don't quite see how the model should have learned to be an expected reward maximizer. In my model of AIs, an AI gets reward and then the current action is reinforced, so the "goal" of an AI is at each point of time doing what brought it the most reward in the past. So even if it could see what it is rewarded for, I don't see why it should care and actively try to maximize that as much as possible. Is there some good reason why we should expect an AI to actively optimize really hard on the expected reward, including planning and doing stuff that didn't bring it much reward in the past? 
(It does seem possible to me that an AI understands what the reward function is and then optimizes hard on that, because when it does that it gets a lot of reward, but I don't quite see why it would care about expected reward accross many possible reward functions.) (Perhaps I misunderstand how HQU is trained?)

Comment by Towards_Keeperhood (Simon Skade) on Implications of automated ontology identification · 2022-02-22T21:08:53.533Z · LW · GW

Well just so you know, the point of the write-up is that iteration makes no sense.

True, not sure what I was thinking when I wrote the last sentence of my comment.

"hey suppose you have an automated ontology identifier with a safety guarantee and a generalization guarantee, then uh oh it looks like this really counter-intuitive iteration thing becomes possible"

For an automated ontology identifier with a possible safety guarantee (like 99.9% certainty), I don't agree with your intuition that iteration seems like it could work significantly better than just doing predictions with the original training set. Iteration simply doesn't seem promising to me, but maybe I'm overlooking something.

If your intuition that iteration might work doesn't come from the sense that the new predicted training examples are basically certain (as I described in the main comment of that comment thread), then where does it come from? (I do still think that you are probably confused because of the reason I described, but maybe I'm wrong and there is another reason.)

Perhaps there is enough information in the training data to extrapolate all the way to . In this case the iteration scheme would just be a series of computational steps that implement a single Bayes update.

Actually, in the case that the training data includes enough information to extrapolate all the way to C (which I think is rarely the case for most applications), it does seem plausible to me that the iteration approach finds the perfect decision boundary, but in this case, it seems also plausible to me that a normal classifier that only uses extrapolation from the training set also finds the perfect boundary.

I don't see a reason why a normal classifier should perform a lot worse than an optimal Bayes update from the training set. Do you think it does perform a lot worse, and if so, why? (If we don't think that it performs much worse than optimal, then it quite trivially follows that the iteration approach cannot be much better, since it cannot be better than the optimal Bayes error.)

Comment by Towards_Keeperhood (Simon Skade) on Implications of automated ontology identification · 2022-02-18T21:51:18.890Z · LW · GW

Yep, I approve of that answer!

Forget iteration. All you can do is to take the training data, do Bayesian inference and get from it the probability that the diamond is in the room for some situation.

Trying to prove some impossibility result here seems useless.

Comment by Towards_Keeperhood (Simon Skade) on Implications of automated ontology identification · 2022-02-18T12:37:28.905Z · LW · GW

Here is why I think the iterated-automated-ontology-identification approach cannot work: You cannot create information out of nothing. In more detail:

The safety constraint that you need to be 100% sure if you answer "Yes" is impossible to fulfill, since you can never be 100% sure.

So let's say we take the safety constraint that you need to be 99% sure if you answer "Yes". So now you run your automated ontology identifier to get a new example where it is 99% sure that the answer there is "Yes".

Now you have two options:

  1. You add that new example to the training set with a label "only 99% sure about that one" and train on. If you always do it like this, it seems very plausible that the automated ontology identifier cannot generate new examples until you can answer all questions correctly (aka with 99% probability), since the new training set doesn't actually contain new information, just sth that could be inferred out of the original training set.
  2. You just assume the answer "Yes" was correct and add the new example to the training set and train on. Then it may be plausible that the process continues on finding new "99% Yes" examples for a long time, but the problem is that it probably goes completely off the rails, since some of the "Yes" labeled examples were not true, and making predictions with those makes it much more likely that you label other "No" examples as "Yes".

In short: For every example that your process can identify as "Yes", the example must already be identifiable by only looking at the initial training set, since you cannot generate information out of nothing.

Your process only seems like it could work, because you assume you can find a new example that is not in the training set where you can be 100% sure that the answer is "Yes", but this would already require an infinite amount of evidence, i.e. is impossible.