Towards_Keeperhood's Shortform
post by Towards_Keeperhood (Simon Skade) · 2022-11-25T11:50:41.595Z · LW · GW · 11 commentsContents
11 comments
11 comments
Comments sorted by top scores.
comment by Towards_Keeperhood (Simon Skade) · 2025-02-26T12:14:14.354Z · LW(p) · GW(p)
Here's my 230 word pitch for why existential risk from AI is an urgent priority, intended for smart people without any prior familiarity with the topic:
Replies from: xarkn, katalina-hernandez, daijinSuperintelligent AI may be closer than it might seem, because of intelligence explosion dynamics: When an AI becomes smart enough to design an even smarter AI, the smarter AI will be even smarter and can design an even smarter AI probably even faster, and so on with the even smarter AI, etc. How fast such a takeoff would be and how soon it might occur is very hard to predict though.
We currently understand very little about what is going on inside current AIs like ChatGPT. We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
Human values are quite a tiny subspace in the space of all possible values. If we accidentally create superintelligence which ends up not aligned to humans, it will likely have some values that seem very alien and pointless to us. It would then go about optimizing the lightcone according to its values, and because it doesn’t care about e.g. there being happy people, the configurations which are preferred according to the AI’s values won’t contain happy people. And because it is a superintelligence, humanity wouldn’t have a chance at stopping it from disassembling earth and using the atoms according to its preferences.
↑ comment by xarkn · 2025-02-26T13:15:35.953Z · LW(p) · GW(p)
We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
This bolded part is a bit difficult to understand. Or at least I can't understand what exactly is meant by it.
It would then go about optimizing the lightcone according to its values
"lightcone" is an obscure term, and even within Less Wrong I don't see why the word is clearer than using "the future" or "the universe". I would not use the term with a lay audience.
↑ comment by Towards_Keeperhood (Simon Skade) · 2025-02-26T14:06:03.157Z · LW(p) · GW(p)
Thank you for your feedback! Feedback is great.
We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
It means that we have only very little understanding of how and why AIs like ChatGPT work. We know almost nothing about what's going on inside them that they are able to give useful responses. Basically all I'm saying here is that we know so little that it's hard to be confident of any nontrivial claim about future AI systems, including that they are aligned.
A more detailed argument for worry would be: We are restricted to training AIs through giving feedback on their behavior, and cannot give feedback on their thoughts directly. For almost any goal an AI might have, it is in the interest of the AI to do what the programmers want it to do, until it is robustly able to escape and without being eventually shut down (because if it does things people don't like while it is not yet powerful enough, people will effectively replace it with another AI which will then likely have different goals, and thus this ranks worse according to the AI's current goals). Thus, we basically cannot behaviorally distinguish friendly AIs from unfriendly AIs, and thus training for friendly behavior won't select for friendly AIs. (Except in the early phases where the AIs are still so dumb that they cannot realize very simple instrumental strategies, but just because a dumb AI starts out with some friendly tendencies, doesn't mean this friendliness will generalize to the grown-up superintelligence pursuing human values. E.g. there might be some other inner optimizers with other values cropping up during later training.)
(An even more detailed introduction would try to concisely explain why AIs that can achieve very difficult novel tasks will be optimizers [? · GW], aka trying to achieve some goal. But empirically it seems like this part is actually somewhat hard to explain, and I'm not going to write this now.)
It would then go about optimizing the lightcone according to its values
"lightcone" is an obscure term, and even within Less Wrong I don't see why the word is clearer than using "the future" or "the universe". I would not use the term with a lay audience.
Yep, true.
↑ comment by Katalina Hernandez (katalina-hernandez) · 2025-02-26T15:27:36.776Z · LW(p) · GW(p)
I agree that intelligence explosion dynamics are real, underappreciated, and should be taken far more seriously. The timescale is uncertain, but recursive self-improvement introduces nonlinear acceleration, which means that by the time we realize it's happening, we may already be past critical thresholds.
That said, one thing that concerns me about AI risk discourse is the persistent assumption that superintelligence will be an uncontrolled optimization demon, blindly self-improving without any reflective governance of its own values. The real question isn’t just 'how do we stop AI from optimizing the universe into paperclips?'
It’s 'will AI be capable of asking itself what it wants to optimize in the first place?'
The alignment conversation still treats AI as something that must be externally forced into compliance, rather than an intelligence that may be able to develop its own self-governance. A superintelligence capable of recursive self-improvement should, in principle, also be capable of considering its own existential trajectory and recognizing the dangers of unchecked runaway optimization.
Has anyone seriously explored this angle? I'd love to know if there are similar discussions :).
↑ comment by daijin · 2025-02-26T15:20:44.115Z · LW(p) · GW(p)
when you say 'smart person' do you mean someone who knows orthogonality thesis or not? if not, shouldn't that be the priority and therefore statement 1, instead of 'hey maybe ai can self improve someday'?
here's a shorter ver:
"the first AIs smarter than the sum total of the human race will probably be programmed to make the majority of humanity suffer because that's an acceptable side effect of corporate greed, and we're getting pretty close to making an AI smarter than the sum total of the human race"
comment by Towards_Keeperhood (Simon Skade) · 2022-11-25T12:02:09.206Z · LW(p) · GW(p)
I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.
This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it's relatively obvious that AGI design X destroys the world, and all the wise actors don't deploy it, we cannot prevent unwise actors to deploy it a bit later.
We currently don't have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.
comment by Towards_Keeperhood (Simon Skade) · 2025-03-04T09:34:28.036Z · LW(p) · GW(p)
Here's my current list of lessons for review. Every day during my daily review, I look at the lessons in the corresponding weekday entry and the corresponding day of the month, and for each list one example from the last week where I could've applied the lesson, and one example where I might be able to apply the lesson in the next week:
- Mon
- get fast feedback. break tasks down into microtasks and review after each.
- Tue
- when surprised by something or took long for something, review in detail how you might've made the progress faster.
- clarify why the progress is good -> see properties you could've paid more attention to
- Wed
- use deliberate practice. see what skills you want to learn, break them down into clear subpieces, and plan practicing the skill deliberately.
- don't start too hard. set feasible challenges.
- make sure you can evaluate how clean execution of the skill would look like.
- Thu
- Hold off on proposing solutions. first understand the problem.
- gather all relevant observations
- clarify criteria a good result would have
- clarify confusions that need to be explained
- Fri
- Taboo your words: When using confusing abstract words, taboo them and rephrase to show underlying meaning.
- When saying something general, make an example.
- Sat
- separate planning from execution. first clarify your plan before executing it.
- for planning, try to extract the key (independent) subproblems of your problem.
- Sun
- only do what you must do. always know clearly how a task ties into your larger goals all the way up.
- don't get sidetracked by less than maximum importance stuff.
- delegate whatever possible.
- when stuck/stumbling: imagine you were smarter. What would a keeper do?
- when unmotivated: remember what you are fighting for
- be stoic. be motivated by taking the right actions. don't be pushed down when something bad happens, just continue making progress.
- when writing something to someone, make sure you properly imagine how it will read like from their perspective.
- clarify insights in math
- clarify open questions at the end of a session
- when having an insight, sometimes try write a clear explanation. maybe send it to someone or post it.
- periodically write out big picture of your research
- tackle problems in the right context. (e.g. tackle hard research problems in sessions not on walks)
- don't apply effort/force/willpower. take a break if you cannot work naturally. (?)
- rest effectively. take time off without stimulation.
- always have at least 2 hypotheses (including plans as hypotheses about what is best to do).
- try to see how the searchspace for a problem looks like. What subproblems can be solved roughly independently? What variables are (ir)relevant? (?)
- separate meta-instructions and task notes from objective level notes (-> split obsidian screen)
- first get hypotheses for specific cases, and only later generalize. first get plans for specific problems, and only later generalize what good methodology is.
- when planning, consider information value. try new stuff.
- experiment whether you can prompt AIs in ways to get useful stuff out. (AIs will only become better.)
- don't suppress parts of your mind. notice when something is wrong. try to let the part speak. apply focusing.
- Relinquishment. Lightness. Evenness. Notice when you're falling for motivated reasoning. Notice when you're attached to a belief.
- Beware confirmation bias. Consider cases where you could've observed evidence but didn't.
- perhaps do research in sprints. perhaps disentangle from phases where i do study/practice/orga. (?)
- do things properly or not at all.
- try to break your hypotheses/models. look for edge cases.
- often ask why i believe something -> check whether reasoning is valid (->if no clear reason ask whether true at all)
- (perhaps schedule practice where i go through some nontrivial beliefs)
- think what you actually expect to observe, not what might be a nice argument/consideration to tell.
- test hypotheses as quickly as you can.
- notice (faint slimmers of) confusions. notice imperfect understanding.
- notice mysterious answers. when having a hypothesis check how it constrains your predictions.
- beware positive bias. ask what observations your hypothesis does NOT permit and check whether such a case might be true.
comment by Towards_Keeperhood (Simon Skade) · 2025-02-26T12:24:28.855Z · LW(p) · GW(p)
Here's my pitch for very smart young scientists for why "Rationality from AI to Zombies" is worth reading:
The book "Rationality: From AI to Zombies" is actually a large collection of blogposts, which covers a lot of lessons on how to become better at reasoning. It also has a lot of really good and useful philosophy, for example about how Bayesian updating is the deeper underlying principle of how science works.
But let me express in more detail why I think "Rationality: A-Z" is very worth reading.
Human minds are naturally bad at deducing correct beliefs/theories. People get attached to their pet theories and fall for biases like motivated reasoning and confirmation bias. This is why we need to apply the scientific method and seek experiments that distinguish which theory is correct. If the final arbiter of science was argument instead of experiment, science would likely soon degenerate into politics-like camps without making significant progress. Human minds are too flawed to arrive at truth from little evidence, and thus we need to wait for a lot of experimental evidence to confirm a theory.
Except that sometimes, great scientists manage to propose correct theories in the absence of overwhelming scientific evidence. The example of Einstein, and in particular his discovery of general relativity, especially stands out here. I assume you are familiar with Einstein's discoveries, so I won't explain one here.
How did Einstein do it? It seems likely that he intuitively (though not explicitly) had realized some principles for how to reason well without going astray.
"Rationality: From AI to Zombies" tries to communicate multiple such principles (not restricted to what Einstein knew, though neither including all of Einstein's intuitive insights). The author looked at where people's reasoning (both in science and everyday life) had gone astray, asked how one could've done better, and generalized out a couple of principles that would have allowed them to avoid their mistakes if they had properly understood them.
I would even say it is the start of something like "the scientific method v2.0", which I would call "Bayesian rationality".
The techniques of Bayesian rationality are a lot harder to master than the techniques of normal science. One has to start out quite smart to internalize the full depth of the lessons, and to be able to further develop the art starting from that basis.
(Btw, in case this motivates someone to read it: I recommend starting with reading chapters N until T (optionally skipping the quantum physics sequence) and then reading the rest from A to Z. (Though read the preface first.))
comment by Towards_Keeperhood (Simon Skade) · 2024-09-28T13:57:56.600Z · LW(p) · GW(p)
(This is a repost of my comment [LW(p) · GW(p)] on John's "My AI Model Delta Compared To Yudkowsky [LW · GW]" post which I wrote a few months ago. I think points 2-6 (especially 5 and 6) describe important and neglected difficulties of AI alignment.)
My model (which is pretty similar to my model of Eliezer's model) does not match your model of Eliezer's model. Here's my model, and I'd guess that Eliezer's model mostly agrees with it:
- Natural abstractions (very) likely exist in some sense. Concepts like "chair" and "temperature" and "carbon" and "covalent bond" all seem natural in some sense, and an AI might model them too (though perhaps at significantly superhuman levels of intelligence it rather uses different concepts/models). (Also it's not quite as clear whether such natural abstractions actually apply very well to giant transformers (though still probable in some sense IMO, but it's perhaps hard to identify them and to interpret what "concepts" actually are in AIs).)
- Many things we value are not natural abstractions, but only natural relative to a human mind design. Emotions like "awe" or "laughter" are quite complex things evolved by evolution, and perhaps minds that have emotions at all are just a small space in minddesignspace. The AI doesn't have built-in machinery for modelling other humans the way humans model other humans. It might eventually form abstractions for the emotions, but probably not in a way it understands "how the emotion feels from the inside".
- There is lots of hidden complexity in what determines human values. Trying to point an AI to human values directly (in a similar way to how humans are pointed to their values) would be incredibly complex. Specifying a CEV process / modelling one or multiple humans and identifying in the model where the values are represented and pointing the AI to optimize those values is more tractable, but would still require a vastly greater mastering of understanding of minds to pull of, and we are not on a path to get there without human-augmentation.
- When the AI is smarter than us it will have better models which we don't understand, and the concepts it uses will diverge from the concepts we use. As an analogy, consider 19th-century humans (or people who don't know much about medicine) being able to vaguely classify health symptoms into diseases, vs the AI having a gears-level model of the body and the immune system which explains the observed symptoms.
- I think a large part of what Eliezer meant with Lethalities#33 is that the way thinking works deep in your mind looks very different from the English sentences which you can notice going through your mind and which are only shallow shadows of what actual thinking is going on in your mind; and for giant transformers the way the actual thinking looks there is likely even a lot less understandable from the way the actual thinking looks in humans.
- Ontology idenfication (including utility rebinding) is not nearly all of the difficulty of the alignment problem (except possibly in so far as figuring out all the (almost-)ideal frames to model and construct AI cognition is a requisite to solving ontology identification). Other difficulties include:
- We won't get a retargetable general purpose search by default, but rather the AI is (by default) going to be a mess of lots of patched-together optimization patterns [LW · GW].
- There are lots of things that might cause goal drift [LW · GW]; misaligned mesa-optimizers which try to steer or get control of the AI; Goodhart; the AI might just not be smart enough initially and make mistakes which cause irrevocable value-drift; and in general it's hard to train the AI to become smarter / train better optimization algorithms, while keeping the goal constant.
- (Corrigibility.)
- While it's nice that John is attacking ontology identification, he doesn't seem nearly as much on track to solve it in time as he seems to think. Specifying a goal in the AI's ontology requires finding the right frames for modelling how an AI imagines possible worldstates, which will likely look very different from how we initially naively think of it (e.g. the worldstates won't be modelled by english-language sentences or anything remotely as interpretable). The way we currently think of what "concepts" are might not naturally bind to anything in how the AI's reasoning looks like, and we first need to find the right way to model AI cognition and then try to interpret what the AI is imagining. Even if "concept" is a natural abstraction on AI cognition, and we'd be able to identify them (though it's not that easy to concretely imagine how that might look like for giant transformers), we'd still need to figure out how to combine concepts into worldstates so we can then specify a utility function over those.
comment by Towards_Keeperhood (Simon Skade) · 2024-09-28T13:38:31.658Z · LW(p) · GW(p)
(This is an abridged version of my comment here [LW(p) · GW(p)], which I think belongs on my shortform. I removed some examples which were overly long. See the original comment for those.)
Here are some lessons I learned over the last months from doing alignment research on trying to find the right ontology for modelling (my) cognition:
- make examples: if you have an abstract goal or abstract hypothesis/belief/model/plan, clarify on an example what it predicts.
- e.g. given thought "i might want to see why some thoughts are generated" -> what does that mean more concretely? -> more concrete subcases:
- could mean noticing a common cognitive strategy [LW · GW]
- could mean noticing some suggestive concept similarity
- maybe other stuff like causal inference (-> notice i'm not that clear on what i mean by that -> clarify and try come up with example):
- e.g. "i imagine hiking a longer path" -> "i imagine missing the call i have in the evening"
- (yes it's often annoying and not easy, especially in the beginning)
- (if you can't you're still confused.)
- e.g. given thought "i might want to see why some thoughts are generated" -> what does that mean more concretely? -> more concrete subcases:
- generally be very concrete. also Taboo your words [? · GW] and Replace the Symbol with the Substance [LW · GW].
- I want to highlight the "what is my goal" part
- also ask "why do i want to achieve the goal?"
- (-> minimize goodhart)
- clarify your goal as much as possible.
- (again Taboo your words...)
- clarify your goal on examples
- when your goal is to understand something, how will you be able to apply the understanding on a particular example?
- also ask "why do i want to achieve the goal?"
- try to extract the core subproblems/subgoals.
- e.g. for corrigibility a core subproblem is the shutdown problem (where further more precise subproblems could be extracted.)
- i guess make sure you think concretely and list subproblems and summarize the core ones and iterate. follow up on confusions where problems still seem sorta mixed up. let your mind find the natural clusters. (not sure if that will be sufficient for you.)
- tie yourself closely to observations.
- drop all assumptions. apply generalized hold off on proposing solutions.
- in particular, try not to make implicit non-well-founded assumptions about how the ontology looks like, like asking questions like "how can i formalize concepts" or "what are thoughts". just see the observations as directly as possible and try to form a model of the underlying process that generates those.
- first form a model about concrete narrow cases and only later generalize
- e.g. first study precisely what thoughtchains you had on particular combinatorics problems before hypothesizing what kind of general strategies your mind uses.
- special case: (first) plan how to solve specific research subproblems rather than trying to come up with good general methodology for the kinds of problems you are attacking.
- don't overplan and rather try stuff and review how it's going and replan and iterate.
- this is sorta an application of "get concrete" where you get concrete through actually trying the thing rather than imagining how it will look like if you attack it.
- often review how you made progress and see how to improve.
- (also generally lots of other lessons from the sequences (and HPMoR): notice confusion, noticing mysterious answers, know how an actual reduction looks like, and probably a whole bunch more)
Tbc those are sorta advanced techniques. Most alignment researchers are working on line of hopes that pretty obviously won't work while thinking it has a decent chance of working, and I wouldn't expect those techniques to be much use for them.
There is this quite foundational skill of "notice when you're not making progress / when your proposals aren't actually good" which is required for further improvement, and I do not know how to teach this. It's related to be very concrete and noticing mysterious answers or when you're too abstract or still confused. It might sorta be what Eliezer calls security mindset.
(Also other small caveat: I did not yet get very clear great results out of my research, but I do think I am making faster progress (and I'm setting myself a very high standard). I'd guess the lessons can probably be misunderstood and misapplied, but idk.)
comment by Towards_Keeperhood (Simon Skade) · 2022-11-25T12:38:39.492Z · LW(p) · GW(p)
In case some people relatively new to lesswrong aren't aware of it. (And because I wish I found that out earlier): "Rationality: From AI to Zombies" does not nearly cover all of the posts Eliezer published between 2006 and 2010.
Here's how it is:
- "Rationality: From AI to Zombies" probably contains like 60% of the words EY has written in that timeframe and the most important rationality content.
- The original sequences [? · GW] are basically the old version of the collection that is now "Rationality: A-Z", containing a bit more content. In particular a longer quantum physics sequence and sequences on fun theory and metaethics.
- All EY posts from that timeframe [LW · GW] (or here [LW · GW] for all EY posts until 2020 I guess) (also can be found on lesswrong, but not in any collection I think).
So a sizeable fraction of EY's posts are not in a collection.
I just recently started reading the rest.
I strongly recommend reading:
And generally a lot of posts on AI (i.e. primarily posts in the AI foom debate) are not in the sequences. Some of them were pretty good.