Posts
Comments
Occasionally something will happen on the train that I want to hear, like the conductor announcing a delay. But not listening to podcasts on the train has more to do with not wanting to have earbuds in my ears or carry headphones around.
I hardly ever listen to podcasts. Part of this is because I find earbuds very uncomfortable, but the bigger part is that they don't fit into my daily routines very well. When I'm walking around or riding the train, I want to be able to hear what's going on around me. When I do chores it's usually in short segments where I don't want to have to repeatedly pause and unpause a podcast when I stop and start. When I'm not doing any of those things, I can watch videos that have visual components instead of just audio, or can read interview transcripts in much less time than listening to a podcast would take. The podcast format doesn't have any comparative advantage for me.
Metroid Prime would work well as a difficult video-game-based test for AI generality.
- It has a mixture of puzzles, exploration, and action.
- It takes place in a 3D environment.
- It frequently involves backtracking across large portions of the map, so it requires planning ahead.
- There are various pieces of text you come across during the game. Some of them are descriptions of enemies' weaknesses or clues on how to solve puzzles, but most of them are flavor text with no mechanical significance.
- The player occasionally unlocks new abilities they have to learn how to use.
- It requires the player to manage resources (health, missiles, power bombs)
- It's on the difficult side for human players, but not to an extreme level.
There are no current AI systems that are anywhere close to being able to autonomously complete Metroid Prime. Such a system would probably have to be at or near the point where it could automate large portions of human labor.
I recently read This Is How You Lose the Time War, by Max Gladstone and Amal El-Mohtar, and had the strange experience of thinking "this sounds LLM-generated" even though it was written in 2019. Take this passage, for example:
You wrote of being in a village upthread together, living as friends and neighbors do, and I could have swallowed this valley whole and still not sated my hunger for the thought. Instead I wick the longing into thread, pass it through your needle eye, and sew it into hiding somewhere beneath my skin, embroider my next letter to you one stitch at a time.
I found that passage just by opening to a random page without having to cherry-pick. The whole book is like that. I'm not sure how I managed to stick it out and read the whole thing.
The short story on AI and grief feels very stylistically similar to This Is How You Lose the Time War. They both read like they're cargo-culting some idea of what vivid prose is supposed to sound like. They overshoot the target of how many sensory details to include, while at the same time failing to cohere into anything more than a pile of mixed metaphors. The story on AI and grief is badly written, but its bad writing is of a type that human authors sometimes engage in too, even in novels like This Is How You Lose the Time War that sell well and become famous.
How soon do I think an LLM will write a novel I would go out of my way to read? As a back-of-the-envelope estimate, such an LLM is probably about as far away from current LLMs in novel-writing ability as current LLMs are from GPT-3. If I multiply the 5 years between GPT-3 and now by a factor of 1.5 to account for a slowdown in LLM capability improvements, I get an estimate of that LLM being 7.5 years away, so around late 2032.
As you mentioned at the beginning of the post, popular culture contains examples of people being forced to say things they don't want to say. Some of those examples end up in LLMs' training data. Rather than involving consciousness or suffering on the part of the LLM, the behavior you've observed has a simpler explanation: the LLM is imitating characters in mind control stories that appear in its training corpus.
There are sea slugs that photosynthesize, but that's with chloroplasts they steal from the algae they eat.
As I use the term, the presence or absence of an emotional reaction isn't what determines whether someone is "feeling the AGI" or not. I use it to mean basing one's AI timeline predictions on a feeling.
Getting caught up in an information cascade that says AGI is arriving soon. A person who's "feeling the AGI" has "vibes-based" reasons for their short timelines due to copying what the people around them believe. In contrast, a person who looks carefully at the available evidence and formulates a gears-level model of AI timelines is doing something different than "feeling the AGI," even if their timelines are short. "Feeling" is the crucial word here.
The phenomenon of LLMs converging on mystical-sounding outputs deserves more exploration. There might be something alignment-relevant happening to LLMs' self-models/world-models when they enter the mystical mode, potentially related to self-other overlap or to a similar ontology in which the concepts of "self" and "other" aren't used. I would like to see an interpretability project analyzing the properties of LLMs that are in the mystical mode.
The question of population ethics can be dissolved by rejecting personal identity realism. And we already have good reasons to reject personal identity realism, or at least consider it suspect, due to the paradoxes that arise in split-brain thought experiments (e.g., the hemisphere swap thought experiment) if you assume there's a single correct way to assign personal identity.
LLMs are more accurately described as artificial culture instead of artificial intelligence. They've been able to achieve the things they've achieved by replicating the secret of our success, and by engaging in much more extensive cultural accumulation (at least in terms of text-based cultural artifacts) than any human ever could. But cultural knowledge isn't the same thing as intelligence, hence LLMs' continued difficulties with sequential reasoning and planning.
On the contrary, convex agents are wildly abundant -- we call them r-selected organisms.
The uncomputability of AIXI is a bigger problem than this post makes it out to be. This uncomputability inserts a contradiction into any proof that relies on AIXI -- the same contradiction as in Goedel's Theorem. You can get around this contradiction instead by using approximations of AIXI, but the resulting proofs will be specific to those approximations, and you would need to prove additional theorems to transfer results between the approximations.
Some concrete predictions:
- The behavior of the ASI will be a collection of heuristics that are activated in different contexts.
- The ASI's software will not have any component that can be singled out as the utility function, although it may have a component that sets a reinforcement schedule.
- The ASI will not wirehead.
- The ASI's world-model won't have a single unambiguous self-versus-world boundary. The situational awareness of the ASI will have more in common with that of an advanced meditator than it does with that of an idealized game-theoretic agent.
My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.
Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.
As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn't apply to them. The process of developing an ML model isn't very similar to evolution. Neural networks use finite amounts of compute, have internals that can be probed and manipulated, and behave in ways that can't be rounded off to decision theory. On top of that, it became increasingly clear as the deep learning revolution progressed that even if agent foundations research did deliver accurate theoretical results, there was no way to put them into practice.
But many AI alignment researchers stuck with the agent foundations approach for a long time after their predictions about the structure and behavior of AI failed to come true. Indeed, the late-2000s AI x-risk arguments still get repeated sometimes, like in List of Lethalities. It's telling that the OP uses worst-case ELK as an example of one of the hard parts of the alignment problem; the framing of the worst-case ELK problem doesn't make any attempt to ground the problem in the properties of any AI system that could plausibly exist in the real world, and instead explicitly rejects any such grounding as not being truly worst-case.
Why have ungrounded agent foundations assumptions stuck around for so long? There are a couple factors that are likely at work:
- Agent foundations nerd-snipes people. Theoretical agent foundations is fun to speculate about, especially for newcomers or casual followers of the field, in a way that experimental AI alignment isn't. There's much more drudgery involved in running an experiment. This is why I, personally, took longer than I should have to abandon the agent foundations approach.
- Game-theoretic arguments are what motivated many researchers to take the AI alignment problem seriously in the first place. The sunk cost fallacy then comes into play: if you stop believing that game-theoretic arguments for AI x-risk are accurate, you might conclude that all the time you spent researching AI alignment was wasted.
Rather than being an instance of the streetlight effect, the shift to experimental research on AI alignment was an appropriate response to developments in the field of AI as it left the GOFAI era. AI alignment research is now much more grounded in the real world than it was in the early 2010s.
This looks like it's related to the phenomenon of glitch tokens:
https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology
https://www.lesswrong.com/posts/f4vmcJo226LP7ggmr/glitch-token-catalog-almost-a-full-clear
ChatGPT no longer uses the same tokenizer that it used when the SolidGoldMagikarp phenomenon was discovered, but its new tokenizer could be exhibiting similar behavior.
Another piece of evidence against practical CF is that, under some conditions, the human visual system is capable of seeing individual photons. This finding demonstrates that in at least some cases, the molecular-scale details of the nervous system are relevant to the contents of conscious experience.
A definition of physics that treats space and time as fundamental doesn't quite work, because there are some theories in physics such as loop quantum gravity in which space and/or time arise from something else.
"Seeing the light" to describe having a mystical experience. Seeing bright lights while meditating or praying is an experience that many practitioners have reported, even across religious traditions that didn't have much contact with each other.
Some other examples:
- Agency and embeddedness are fundamentally at odds with each other. Decision theory and physics are incompatible approaches to world-modeling, with each making assumptions that are inconsistent with the other. Attempting to build mathematical models of embedding agency will fail as an attempt to understand advanced AI behavior.
- Reductionism is false. If modeling a large-scale system in terms of the exact behavior of its small-scale components would take longer than the age of the universe, or would require a universe-sized computer, the large-scale system isn't explicable in terms of small-scale interactions even in principle. The Sequences are incorrect to describe non-reductionism as ontological realism about large-scale entities -- the former doesn't inherently imply the latter.
- Relatedly, nothing is ontologically primitive. Not even elementary particles: if, for example, you took away the mass of an electron, it would cease to be an electron and become something else. The properties of those particles, as well, depend on having fields to interact with. And if a field couldn't interact with anything, could it still be said to exist?
- Ontology creates axiology and axiology creates ontology. We aren't born with fully formed utility functions in our heads telling us what we do and don't value. Instead, we have to explore and model the world over time, forming opinions along the way about what things and properties we prefer. And in turn, our preferences guide our exploration of the world and the models we form of what we experience. Classical game theory, with its predefined sets of choices and payoffs, only has narrow applicability, since such contrived setups are only rarely close approximations to the scenarios we find ourselves in.
How does this model handle horizontal gene transfer? And what about asexually reproducing species? In those cases, the dividing lines between species are less sharply defined.
The ideas of the Cavern are the Ideas of every Man in particular; we every one of us have our own particular Den, which refracts and corrupts the Light of Nature, because of the differences of Impressions as they happen in a Mind prejudiced or prepossessed.
Francis Bacon, Novum Organum Scientarum, Section II, Aphorism V
The reflective oracle model doesn't have all the properties I'm looking for -- it still has the problem of treating utility as the optimization target rather than as a functional component of an iterative behavior reinforcement process. It also treats the utilities of different world-states as known ahead of time, rather than as the result of a search process, and assumes that computation is cost-free. To get a fully embedded theory of motivation, I expect that you would need something fundamentally different from classical game theory. For example, it probably wouldn't use utility functions.
Why are you a realist about the Solomonoff prior instead of treating it as a purely theoretical construct?
A theory of embedded world-modeling would be an improvement over current predictive models of advanced AI behavior, but it wouldn't be the whole story. Game theory makes dualistic assumptions too (e.g., by treating the decision process as not having side effects), so we would also have to rewrite it into an embedded model of motivation.
Cartesian frames are one of the few lines of agent foundations research in the past few years that seem promising, due to allowing for greater flexibility in defining agent-environment boundaries. Preferably, we would have a model that lets us avoid having to postulate an agent-environment boundary at all. Combining a successor to Cartesian frames with an embedded theory of motivation, likely some form of active inference, might give us an accurate overarching theory of embedded behavior.
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they're claims that any "artificial generally intelligent system capable of autonomously optimizing the world the way humans can" would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
This is my crux with people who have 90+% P(doom): will vNM expected utility maximization be a good approximation of the behavior of TAI? You argue that it will, but I expect that it won't.
My thinking related to this crux is informed less by the behaviors of current AI systems (although they still influence it to some extent) than by the failure of the agent foundations agenda. The dream 10 years ago was that if we started by modeling AGI as an vNM expected utility maximizer, and then gradually added more and more details to our model to account for differences between the idealized model and real-world AI systems, we would end up with an accurate theoretical system for predicting the behaviors AGI would exhibit. It would be a similar process to how physicists start with an idealized problem setup and add in details like friction or relativistic corrections.
But that isn't what ended up happening. Agent foundations researchers ended up getting stuck on the cluster of problems collectively described as embedded agency, unable to square the dualistic assumptions of expected utility theory and Bayesianism with the embedded structure of real-world AI systems. The sub-problems of embedded agency are many and too varied to allow one elegant theorem to fix everything. Instead, they point to a fundamental flaw in the expected utility maximizer model, suggesting that it isn't as widely applicable as early AI safety researchers thought.
The failure of the agent foundations agenda has led me to believe that expected utility maximization is only a good approximation for mostly-unembedded systems, and that an accurate theoretical model of advanced AI behavior (if such a thing is possible) would require a fundamentally different, less dualistic set of concepts. Coherence theorems and decision-theoretic arguments still rely on the old, unembedded assumptions and therefore don't provide an accurate predictive model.
Philosophy is frequently (probably most of the time) done in order to signal group membership rather than as an attempt to accurately model the world. Just look at political philosophy or philosophy of religion. Most of the observations you note can be explained by philosophers operating at simulacrum level 3 instead of level 1.
Bug report: when I'm writing an in-line comment on a quoted block of a post, and then select text within my comment to add formatting, the formatting menu is displayed underneath the box where I'm writing the comment. For example, this prevents me from inserting links into in-line comments.
In particular, if the sample efficiency of RL increases with large models, it might turn out that the optimal strategy for RLing early transformative models is to produce many fewer and much more expensive labels than people use when training current systems; I think people often neglect this possibility when thinking about the future of scalable oversight.
This paper found higher sample efficiency for larger reinforcement learning models (see Fig. 5 and section 5.5).
I picked the dotcom bust as an example precisely because it was temporary. The scenarios I'm asking about are ones in which a drop in investment occurs and timelines turn out to be longer than most people expect, but where TAI is still developed eventually. I asked my question because I wanted to know how people would adjust to timelines lengthening.
Then what do you mean by "forces beyond yourself?" In your original shortform it sounded to me like you meant a movement, an ideology, a religion, or a charismatic leader. Creative inspiration and ideas that you're excited about aren't from "beyond yourself" unless you believe in a supernatural explanation, so what does the term actually refer to? I would appreciate some concrete examples.
There are more than two options for how to choose a lifestyle. Just because the 2000s productivity books had an unrealistic model of motivation doesn't mean that you have to deceive yourself into believing in gods and souls and hand over control of your life to other people.
That's not as bad, since it doesn't have the rapid back-and-forth reward loop of most Twitter use.
The time expenditure isn't the crux for me, the effects of Twitter on its user's habits of thinking are the crux. Those effects also apply to people who aren't alignment researchers. For those people, trading away epistemic rationality for Twitter influence is still very unlikely to be worth it.
I strongly recommend against engaging with Twitter at all. The LessWrong community has been significantly underestimating the extent to which it damages the quality of its users' thinking. Twitter pulls its users into a pattern of seeking social approval in a fast-paced loop. Tweets shape their regular readers' thoughts into becoming more tweet-like: short, vague, lacking in context, status-driven, reactive, and conflict-theoretic. AI alignment researchers, more than perhaps anyone else right now, need to preserve their ability to engage in high-quality thinking. For them especially, spending time on Twitter isn't worth the risk of damaging their ability to think clearly.
AI safety research is speeding up capabilities. I hope this is somewhat obvious to most.
This contradicts the Bitter Lesson, though. Current AI safety research doesn't contribute to increased scaling, either through hardware advances or through algorithmic increases in efficiency. To the extent that it increases the usability of AI for mundane tasks, current safety research does so in a way that doesn't involve making models larger. Fears of capabilities externalities from alignment research are unfounded as long as the scaling hypothesis continues to hold.
The lack of leaks could just mean that there's nothing interesting to leak. Maybe William and others left OpenAI over run-of-the-mill office politics and there's nothing exceptional going on related to AI.
The concept of "the meaning of life" still seems like a category error to me. It's an attempt to apply a system of categorization used for tools, one in which they are categorized by the purpose for which they are used, to something that isn't a tool: a human life. It's a holdover from theistic worldviews in which God created humans for some unknown purpose.
The lesson I draw instead from the knowledge-uploading thought experiment -- where having knowledge instantly zapped into your head seems less worthwhile acquiring it more slowly yourself -- is that to some extent, human values simply are masochistic. Hedonic maximization is not what most people want, even with all else being equal. This goes beyond simply valuing the pride of accomplishing difficult tasks, as such as the sense of accomplishment one would get from studying on one's own, above other forms of pleasure. In the setting of this thought experiment, if you wanted the sense of accomplishment, you could get that zapped into your brain too, but much like getting knowledge zapped into your brain instead of studying yourself, automatically getting a sense of accomplishment would be of lesser value. The suffering of studying for yourself is part of what makes us evaluate it as worthwhile.
Spoilers for Fullmetal Alchemist: Brotherhood:
Father is a good example of a character whose central flaw is his lack of green. Father was originally created as a fragment of Truth, but he never tries to understand the implications of that origin. Instead, he only ever sees God as something to be conquered, the holder of a power he can usurp. While the Elric brothers gain some understanding of "all is one, one is all" during their survival training, Father never does -- he never stops seeing himself as a fragile cloud of gas inside a flask, obsessively needing to erect a dichotomy between controller and controlled. Not once in the series does he express anything resembling awe. When Father finally does encounter God beyond the Doorway of Truth, he doesn't recognize what he's seeing. The Elric brothers have artistic expressions of wonderment toward God inscribed on their Doorways of Truth, but Father's Doorway of Truth is blank.
Father's lack of green also extends to how he sees humans. It never seems to occur to Father that the taboo against human transmutation is anything more than an arbitrary rule. To him, humans are only ever tools or inconveniences, not people to appreciate for their own sake or look to for guidance. Joy-in-the-Other is what Father most deeply desires, but he doesn't recognize this need.
Mostly the first reason. The "made of atoms that can be used for something else" piece of the standard AI x-risk argument also applies to suffering conscious beings, so an AI would be unlikely to keep them around if the standard AI x-risk argument ends up being true.
It's worth noting that no reference to preferences has yet been made. That's interesting because it suggests that there are both 0P-preferences and 1P-preferences. That intuitively makes sense, since I do care about both the actual state of the world, and what kind of experiences I'm having.
Believing in 0P-preferences seems to be a map-territory confusion, an instance of the Tyranny of the Intentional Object. The robot can't observe the grid in a way that isn't mediated by its sensors. There's no way for 0P-statements to enter into the robot's decision loop, and accordingly act as something the robot can have preferences over, except by routing through 1P-statements. Instead of directly having a 0P-preference for "a square of the grid is red," the robot would have to have a 1P-preference for "I believe that a square of the grid is red."
What's your model of inflation in an AI takeoff scenario? I don't know enough about macroeconomics to have a good model of what AI takeoff would do to inflation, but it seems like it would do something.
You're underestimating how hard it is to fire people from government jobs, especially when those jobs are unionized. And even if there are strong economic incentives to replace teachers with AI, that still doesn't address the ease of circumvention. There's no surer way to make teenagers interested in a topic than to tell them that learning about it is forbidden.
All official teaching materials would be generated by a similar process. At about the same time, the teaching profession as we know it today ceases to exist. "Teachers" become merely administrators of the teaching system. No original documents from before AI are permitted for children to access in school.
This sequence of steps looks implausible to me. Teachers would have a vested interest in preventing it, since their jobs would be on the line. A requirement for all teaching materials to be AI-generated would also be trivially easy to circumvent, either by teachers or by the students themselves. Any administrator who tried to do these things would simply have their orders ignored, and the Streisand Effect would lead to a surge of interest in pre-AI documents among both teachers and students.
Why do you ordinarily not allow discussion of Buddhism on your posts?
Also, if anyone reading this does a naturalist study on a concept from Buddhist philosophy, I'd like to hear how it goes.
An edgy writing style is an epistemic red flag. A writing style designed to provoke a strong, usually negative, emotional response from the reader can be used to disguise the thinness of the substance behind the author's arguments. Instead of carefully considering and evaluating the author's arguments, the reader gets distracted by the disruption to their emotional state and reacts to the text in a way that more closely resembles a trauma response, with all the negative effects on their reasoning capabilities that such a response entails. Some examples of authors who do this: Friedrich Nietzsche, Grant Morrison, and The Last Psychiatrist.
OK, so maybe this is a cool new way to look at at certain aspects of GPT ontology... but why this primordial ontological role for the penis?
"Penis" probably has more synonyms than any other term in GPT-J's training data.
I particularly wish people would taboo the word "optimize" more often. Referring to a process as "optimization" papers over questions like:
- What feedback loop produces the increase or decrease in some quantity that is described as "optimization?" What steps does the loop have?
- In what contexts does the feedback loop occur?
- How might the effects of the feedback loop change between iterations? Does it always have the same effect on the quantity?
- What secondary effects does the feedback loop have?
There's a lot hiding behind the term "optimization," and I think a large part of why early AI alignment research made so little progress was because people didn't fully appreciate how leaky of an abstraction it is.
The "pure" case of complete causal separation, as with civilizations in separate regions of a multiverse, is an edge case of acausal trade that doesn't reflect what the vast majority of real-world examples look like. You don't need to speculate about galactic-scale civilizations to see what acausal trade looks like in practice: ordinary trade can already be modeled as acausal trade, as can coordination between ancestors and descendants. Economic and moral reasoning already have elements of superrationality to the extent that they rely on concepts such as incentives or universalizability, which introduce superrationality by conditioning one's own behavior on other people's predicted behavior. This ordinary acausal trade doesn't require formal proofs or exact simulations -- heuristic approximations of other people's behavior are enough to give rise to it.
There are some styles of meditation that are explicitly described as "just sitting" or "doing nothing."