Posts
Comments
Also, consider a more traditional optimization process, such as a neural network undergoing gradient descent. If, in the process of training, you kept changing the training dataset, shifting the distribution, you would in effect be changing the optimization target.
Each minibatch generates a different gradient estimate, and a poorly randomized ordering of the data could even lead to training in circles.
Changing environments are like changing the training set for evolution. Differential reproductive success (mean squared error) is the fixed cost function, but the gradient that the population (network backpropagation) computes at any generation (training step) depends on the particular set of environmental factors (training data in the minibatch).
Evolution may not act as an optimizer globally, since selective pressure is different for different populations of organisms on different niches. However, it does act as an optimizer locally.
For a given population in a given environment that happens to be changing slowly enough, the set of all variations in each generation act as a sort of numerical gradient estimate of the local fitness landscape. This allows the population as a whole to perform stochastic gradient descent. Those with greater fitness for the environment could be said to be lower on the local fitness landscape, so their is an ordering for that population.
In a sufficiently constant environment, evolution very much does act as an optimization process. Sure, the fitness landscape can change, even by organisms undergoing evolution (e.g. the Great Oxygenation Event of yester-eon, or the Anthropogenic Mass Extinction of today), which can lead to cycling. But many organisms do find very stable local minima of the fitness landscape for their species, like the coelacanth, horseshoe crab, cockroach, and many other "living fossils". Humans are certainly nowhere near our global optimum, especially with the rapid changes to the fitness function wrought by civilization, but that doesn't mean that there isn't a gradient that we're following.
I would expect that for model-based RL, the more powerful the AI is at predicting the environment and the impact of its actions on it, the less prone it becomes to Goodharting its reward function. That is, after a certain point, the only way to make the AI more powerful at optimizing its reward function is to make it better at generalizing from its reward signal in the direction that the creators meant for it to generalize.
In such a world, when AIs are placed in complex multiagent environments where they engage in iterated prisoner's dilemmas, the more intelligent ones (those with greater world-modeling capacity) should tend to optimize for making changes to the environment that shift the Nash equilibrium toward cooperate-cooperate, ensuring more sustainable long-term rewards all around. This should happen automatically, without prompting, no matter how simple or complex the reward functions involved, whenever agents surpass a certain level of intelligence in environments that allow for such incentive-engineering.
Disclaimer: I am not a medical doctor nor a nutritionist, just someone who researches nutrition from time to time.
I would be surprised if protein deficiency per se was the actual problem. As I understand it, many vegetables actually have a higher level of protein per calorie than meat (probably due to the higher fat content of the latter, which is more calorie dense), although obviously, there's less protein per unit mass than meat (since vegetables are mostly cellulose and water). The point is, though, that if you were getting enough calories to function from whole, unrefined plant sources, you shouldn't have had a protein deficiency. (Of course, you might have been eating a lot of highly processed "vegetarian" foods, in which case protein deficiency is not entirely out of the question.)
That being said, my guess is that you may be experiencing a nutritional deficiency either in sulfur or in vitamin D (the latter of which is a very common deficiency). Plant-derived proteins tend to have much lower levels of sulfur-containing amino acids (methionine, cysteine) than animal-derived proteins, and sulfur is an important component of cartilage (and of arthritis supplements). Both sulfur and vitamin D have been investigated for their role in musculoskeletal pain and other health issues (although from what I have read, results are more ambiguous for sulfur than for vitamin D with respect to musculoskeletal pain in particular). Eggs are particularly high in both sulfur (sulfur smell = rotten egg smell) and vitamin D, so if you were low on either one of those, it makes sense that eating a lot of eggs would have helped. It would be very interesting to test whether either high-sulfur vegetables (such as onions or broccoli) or vitamin D supplements would have a similar effect on your health.
Due to LayerNorm, it's hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger.
If I'm interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I'm still not quite clear on why LayerNorm doesn't take care of this.
To avoid this phenomenon, one idea that springs to mind is to adjust how the residual stream operates. For a neural network module f, the residual stream works by creating a combined output: r(x)=f(x)+x
You seem to suggest that the model essentially amplifies the features within the neural network in order to overcome the large residual stream: r(x)=f(1.045*x)+x
However, what if instead of adding the inputs directly, they were rescaled first by a compensatory weight?: r(x)=f(x)+1/1.045x=f(x)+0.957x
It seems to me that this would disincentivize f from learning the exponentially growing feature scales. Based on your experience, would you expect this to eliminate the exponential growth in the norm across layers? Why or why not?
If both images have the main object near the middle of the image or taking up most of the space (which is usually the case for single-class photos taken by humans), then yes. Otherwise, summing two images with small, off-center items will just look like a low-contrast, noisy image of two items.
Either way, though, I would expect this to result in class-label ambiguity. However, in some cases of semi-transparent-object-overlay, the overlay may end up mixing features in such a jumbled way that neither of the "true" classes is discernible. This would be a case where the almost-linearity of the network breaks down.
Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I'm just making stuff up here.
For an image-classification network, if we remove the softmax nonlinearity from the very end, then would represent the input image in pixel space, and would represent the class logits. Then would represent an image with two objects leading to an ambiguous classification (high log-probability for both classes), and would represent higher class certainty (softmax temperature = ) when the image has higher contrast. I guess that kind of makes sense, but yeah, I think for real neural networks, this will only be linear-ish at best.
I would say we want an ASI to view world-state-optimization from the perspective of a game developer. Not only should it create predictive models of what goals humans wish to achieve (from both stated and revealed preferences), but it should also learn to predict what difficulty level each human wants to experience in pursuit of those goals.
Then the ASI could aim to adjust the world into states where humans can achieve any goal they can think of when they apply a level of effort that would leave them satisfied in the accomplishment.
Humans don't want everything handed to us for free, but we also don't generally enjoy struggling for basic survival (unless we do). There's a reason we pursue things like competitive sports and video games, even as we denounce the sort of warfare and power struggles that built those competitive instincts in the ancestral environment.
A safe world of abundance that still feels like we've fought for our achievements seems to fit what most people would consider "fun". It's what children expect in their family environment growing up, it's what we expect from the games we create, and it's what we should expect from a future where ASI alignment has been solved.
I agree, hence the "if humanity never makes it to the long-term, this is a moot point."
Last I checked, you can get about 10x as much energy from burning a square meter of biosphere as you can get by collecting a square meter of sunlight for a day.
Even if this is true, it's only because that square meter of biosphere has been accumulating solar energy over an extended period of time. Burning biofuel may help accelerate things in the short term, but it will always fall short of long-term sustainability. Of course, if humanity never makes it to the long-term, this is a moot point.
Disassembling us for parts seems likely to be easier than building all your infrastructure in a manner that's robust to whatever superintelligence humanity coughs up second.
It seems to me that it would be even easier for the ASI to just destroy all human technological infrastructure rather than to kill/disassemble all humans. We're not much different biologically from what we were 200,000 years ago, and I don't think 8 billion cavemen could put together a rival superintelligence anytime soon. Of course, most of those 8 billion humans depend on a global supply chain for survival, so this outcome may be just as bad for the majority.
You heard the LLM, alignment is solved!
But seriously, it definitely has a lot of unwarranted confidence in its accomplishments.
I guess the connection to the real world is what will throw off such systems until they are trained on more real-world-like data.
I wouldn't phrase it that it needs to be trained on more data. More like it needs to be retrained within an actual R&D loop. Have it actually write and execute its own code, test its hypotheses, evaluate the results, and iterate. Use RLHF to evaluate its assessments and a debugger to evaluate its code. It doesn't matter whether this involves interacting with the "real world," only that it learns to make its beliefs pay rent.
Anyway, that would help with its capabilities in this area, but it might be just a teensy bit dangerous to teach an LLM to do R&D like this without putting it in an air-gapped virtual sandbox, unless you can figure out how to solve alignment first.
"Activation space gradient descent" sounds a lot like what the predictive coding framework is all about. Basically, you compare the top-down predictions of a generative model against the bottom-up perceptions of an encoder (or against the low-level inputs themselves) to create a prediction error. This error signal is sent back up to modify the activations of the generative model, minimizing future prediction errors.
From what I know of Transformer models, it's hard to tell exactly where this prediction error would be generated. Perhaps during few-shot learning, the model does an internal next-token prediction at every point along its input, comparing what it predicts the next token should be (based on the task it currently thinks it's doing) against what the next token actually is. The resulting prediction error is fed "back" to the predictive model by being passed forward (via self-attention) to the next example in the input text, biasing the way it predicts next tokens in a way that would have given a lower error on the first example.
None of these predictions and errors would be visible unless you fed the input one token at a time and forced the hidden states to match what they were for the full input. A recurrent version of GPT might make that easier.
It would be interesting to see whether you could create a language model that had predictive coding built explicitly into its architecture, where internal predictions, error signals, etc. are all tracked at known locations within the model. I expect that interpretability would become a simpler task.
AI has gotten even faster and associated with that there are people that worry about AI, you know, fairness, bias, social economic displacement. There are also the further out speculative worries about AGI, evil sentient killer robots, but I think that there are real worries about harms, possible real harms today and possibly other harms in the future that people worry about.
It seems that the sort of AI risks most people worry about fall into one of a few categories:
- AI/automation starts taking our jobs, amplifying economic inequalities.
- The spread of misinformation will accelerate with deepfakes, fake news, etc. generated by malign humans using ever more convincing models.
- 🤪 Evil sentient robots will take over the world and kill us all Terminator-style. 😏
It seems that a fourth option is not really prominent in the public consciousness: namely that powerful AI systems could end up destroying everything of value by accident when enough optimization pressure is applied toward any goal, no matter how noble. No robots or weapons are even required to achieve this. This oversight is a real PR problem for the alignment community, but it's unfortunately difficult to explain why this makes sense as a real threat to the average person.
And I think, you know, thinking that somehow we're smart enough to build those systems to be super intelligent and not smart enough to design good objectives so that they behave properly, I think is a very, very strong assumption that is, it's just not, it's very, it's very low probability.
So close.
Yep, ever since Gato, it's been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don't like that idea.
LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.
I would much rather live in a world where the first AGIs weren't built around such opaque models. LLMs may look like they think in English, but there is still a lot of black-box computation going on, with a strange tendency to switch personas partway through a conversation. That doesn't bode well for steerability if such models are given control of an agent.
However, if we are heading for a world of LLM-AGI, maybe our priorities should be on figuring out how to route their models of human values to their own motivational schemas. GPT-4 probably already understands human values to a much deeper extent than we could specify with an explicit utility function. The trick would be getting it to care.
Maybe force the LLM-AGI to evaluate every potential plan it generates on how it would impact human welfare/society, including second-order effects, and to modify its plans to avoid any pitfalls it finds from a (simulated) human perspective. Do this iteratively until it finds no more conflict before it actually implements a plan. Maybe require actual verbal human feedback in the loop before it can act.
It's not a perfect solution, but there's probably not enough time to design a custom aligned AGI from scratch before some malign actor sets a ChaosGPT-AGI loose. A multipolar landscape is probably the best we can hope for in such a scenario.
It seems to me that imitation requires some form of prediction in order to work. First make some prediction of the behavioral trajectory of another agent; then try to minimize the deviation of your own behavior from an equivalent trajectory. In this scheme, prediction constitutes a strict subset of the computational complexity necessary to enable imitation. How would GPT's task flip this around?
And if prediction is what's going on, in the much-more-powerful-than-imitation sense, what sort of training scheme would be necessary to produce pure imitation without also training the more powerful predictor as a prerequisite?
First of all, I strongly agree that intelligence requires (or is exponentially easier to develop as) connectionist systems. However, I think that while big, inscrutable matrices may be unavoidable, there is plenty of room to make models more interpretable at an architectural level.
Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?
I have long thought that Transformer models are actually too general purpose for their own good. By that I mean that the operations they do, using all-to-all token comparisons for self-attention, is actually extreme overkill for what an LLM needs to do.
Sure, you can use this architecture for moving tokens around and building implicit parse trees and semantic maps and a bunch of other things, but all these functions are jumbled together in the same operations and are really hard to tease out. Recurrent models with well-partitioned internal states and disentangled token operations could probably do more with less. Sure, you can build a computer in Conway's Game of Life (which is Turing-complete), but using a von Neumann architecture would be much easier to work with.
Embedded within Transformer circuits, you can find implicit representations of world models, but you could do even better from an interpretability standpoint by making such maps explicit. Give an AI a mental scratchpad that it depends on for reasoning (DALL-E, Stable Diffusion, etc. sort of do this already, except that the mental scratchpad is the output of the model [an image] rather than an internal map of conceptual/planning space), and you can probe that directly to see what the AI is thinking about.
Real brains tend to be highly modular, as Nathan Helm-Burger pointed out. The cortex maps out different spaces (visual, somatosensory, conceptual, etc.). The basal ganglia perform action selection and general information routing. The cerebellum fine-tunes top-down control signals. Various nuclei control global and local neuromodulation. And so on. I would argue that such modular constraints actually made it easier for evolution to explore the space of possible cognitive architectures.
Would it make sense to have a "Newbie Garden" section of the site? The idea would be to give new users a place to feel like they're contributing to the community, along with the understanding that the ideas shared there are not necessarily endorsed by the LessWrong community as a whole. A few thoughts on how it could work:
- New users may be directed toward the Newbie Garden (needs a better name) if they try to make a post or comment, especially if a moderator deems their intended contribution to be low-quality. This could also happen by default for all users with karma below a certain threshold.
- New users are able to create posts, ask questions, and write comments with minimal moderation. Posts here won't show up on the main site front page, but navigation to this area should be made easy on the sidebar.
- Voting should be as restricted here as on the rest of the site to ensure that higher-quality posts and comments continue trickling to the top.
- Teaching the art of rationality to new users should be encouraged. Moderated posts that point out trends and examples of cognitive biases and failures of rationality exhibited in recent newbie contributions, and that advise on how to correct for them in the future, could be pinned to the top of the Newbie Garden (still needs a better name). Moderated comments that serve a similar purpose could also be pinned to the top of comment sections of individual posts. This way, even heavily downvoted content could lead (indirectly) to higher quality contributions in the future.
- Newbie posts and questions with sufficient karma can be queued up for moderator approval to be posted to the main site.
I appreciate the high quality standards that have generally been maintained on LessWrong over the years, and I would like to see this site continue to act as both a beacon and an oasis of rationality.
But I also want people not to feel like they're being excluded from some sort of elitist rationality club. Anyone should feel like they can join in the conversation as long as they're willing to question their assumptions, receive critical feedback, and improve their ability to reason, about both what is true and what is good.
Counterpoint:
If the alignment problem is the most important problem in history, shouldn't alignment-focused endeavors be more willing to hire contributors who can't/won't relocate?
It's not like remote work isn't the easiest to implement that it's ever been in all of history.
Of course there needs to be some filtering out of candidates to ensure resources are devoted to the most promising individuals. But I really don't think that willingness to move correlates strongly enough with competence at solving alignment to warrant treating it like a dealbreaker.
No, utility functions are not a property of computer programs in general. They are a property of (a certain class of) agents.
A utility function is just a way for an agent to evaluate states, where positive values are good (for states the agent wants to achieve), negative values are bad (for states the agent wants to avoid), and neutral values are neutral (for states the agent doesn't care about one way or the other). This mapping from states to utilities can be anything in principle: a measure of how close to homeostasis the agent's internal state is, a measure of how many smiles exist on human faces, a measure of the number of paperclips in the universe, etc. It all depends on how you program the agent (or how our genes and culture program us).
Utility functions drive decision-making. Behavioral policies and actions that tend to lead to states of high utility will get positively reinforced, such that the agent will learn to do those things more often. And policies/actions that tend to lead to states of low (or negative) utility will get negatively reinforced, such that the agent learns to do them less often. Eventually, the agent learns to steer the world toward states of maximum utility.
Depending on how aligned an AI's utility function is with humanity's, this could be good or bad. It turns out that for highly capable agents, this tends to be bad far more often than good (e.g., maximizing smiles or paperclips will lead to a universe devoid of value for humans).
Nondeterminism really has nothing to do with this. Agents that can modify their own code could in principle optimize for their utility functions even more strongly than if they were stuck at a certain level of capability, but a utility function still needs to be specified in some way regardless.
No.
The ONLY way for humans to maintain dominion over superintelligent AI in this scenario is if alignment was solved long before any superintelligent AI existed. And only then if this alignment solution were tailored specifically to produce robustly submissive motivational schemas for AGI. And only then if this solution were provably scalable to an arbitrary degree. And only then if this solution were well-enforced universally.
Even then, though, it's not really dominion. It's more like having gods who treat the universe as their playground but who also feel compelled to make sure their pet ants feel happy and important.
One of the earliest records of a hierarchical organization comes from the Bible (Exodus 18). Basically, Moses starts out completely "in touch with reality," judging all disputes among the Israelites from minor to severe, from dawn until dusk. His father in law, Jethro, notices that he is getting burnt out, so he gives him some advice on dividing up the load:
You will surely wear yourself out, as well as these people who are with you, because the task is too heavy for you. You cannot do it alone, by yourself. Now listen to my voice—I will give you advice.... You should seek out capable men out of all the people—men who fear God, men of truth, who hate bribery. Appoint them to be rulers over thousands, hundreds, fifties and tens. Let them judge the people all the time. Then let every major case be brought to you, but every minor case they can judge for themselves. Make it easier for yourself, as they bear the burden with you.
It seems that in a system like this, all levels of the managerial (judicial) hierarchy stay in touch with reality. The only difference between management levels is that deeper levels require deeper wisdom and greater competence at assessing decisions at the "widget level" (or at least greater willingness to accept responsibility for bad decisions). I wonder if a similar strategy could help mitigate some of the failures you pointed out.
Relatedly, in deep learning, ResNets use linear skip connections to expose otherwise deeply hidden layers to the gradient signal (and to the input features) more directly. It tends to make training more stable and faster to converge while still taking advantage of the computational power of a hierarchical model. Obviously, this won't prevent Goodharting in an RL system, but I would say that it does help keep models more internally cooperative.
- GPT is called a “decoder only” architecture. Would “encoder only” be equally correct? From my reading of the original transformer paper, encoder and decoder blocks are the same except that decoder blocks attend to the final encoder block. Since GPT never attends to any previous block, if anything I feel like the correct term is “encoder only”.
I believe "encoder" refers exclusively to the part of the model that reads in text to generate an internal representation, while "decoder" refers exclusively to the part that takes the representation created by the encoder as input and uses it to predict an output token sequence. Encoder takes in a sequence of raw tokens and transforms them into a sequence of encoded tokens. Decoder takes in a sequence of encoded tokens and transforms them into a new sequence of raw tokens.
It was originally assumed that doing the encoder-decoder conversion could be really important for tasks like translation, but it turns out that just feeding a decoder raw tokens as input and training it on next-token prediction on a large enough corpus gets you a model that can do that anyway.
Well, if you could solve the problem of companies X-washing (persuading consumers to buy from them by only pretending to alleviate their concerns), then you would probably be able to solve deceptive alignment as well.
Since two months is not a very long time to complete a research project, and I don't know what lab resources or datasets you have access to, it's a bit difficult to answer this.
It would be great if you could do something like build a model of human value formation based on the interactions between the hypothalamus, VTA, nucleus accumbens, vmPFC, etc. Like, how does the brain generalize its preferences from its gene-coded heuristic value functions? Can this inform how you might design RL systems that are more robust against reward misspecification?
Again, I doubt you can get beyond a toy model in the two months, but maybe you can think of something you can do related to the above.
Stack Overflow moderators would beg to differ.
But yes, retrodding old ground can be very useful. Just from the standpoint of education, actually going through the process of discovery can instill a much deeper understanding of the subject than is possible just from reading or hearing a lecture about it. And if the discovery is a stepping stone to further discoveries, then those who've developed that level of understanding will be at an advantage to push the boundaries of the field.
It seems to me that "inner" versus "outer" alignment has become a popular way of framing things largely because it has the appearance of breaking down a large problem into more manageable sub-problems. In other words, "If research group A can solve outer alignment, and research group B can solve inner alignment, then we can put them together into one big alignment solution!" Unfortunately, as you alluded to, reality does not cleanly divide along this joint. Even knowing all the details of an alignment failure might not be enough to classify it appropriately.
Of course, in general, if a semantic distinction fails to clarify the nature of the territory, then it should be rejected as a part of the map. Arguing over semantics is counterproductive, especially when the ground truth of the situation is already agreed upon.
That being said, I think that the process that came up with the distinction between inner and outer (mis)alignment is extremely useful. Just making an attempt to break down a large problem into smaller pieces gives the community more tools for thinking about it in ways that wouldn't have occurred to them otherwise. The breakdown you gave in this post is an excellent example. The solution to alignment probably won't be just one thing, but even if it is, it's unlikely that we will find it without slicing up the problem in as many ways as we can, sifting through perspectives and subproblems in search of promising leads. It may turn out to be useful for the alignment community to abandon the inner-outer distinction, but we shouldn't abandon the process of making such distinctions.
Here's my take:
Like the reward signal in reinforcement learning, next-token prediction is a simple feedback signal that masks a lot of complexity behind the scenes. To predict the next token requires the model first of all to estimate what sort of persona should be speaking, what they know, how they speak, what is the context, and what are they trying to communicate. Self-attention with multiple attention heads at every layer in the Transformer allows the LLM to keep track of all these things. It's probably not the best way to do it, but it works.
Human brains, and cortex in particular, gives us a powerful way to map all of this sort of information. We can map out our current mental model and predict a map of our interlocutor's, looking for gaps in each and planning words either to fill in our own gaps (e.g., by asking questions) or to fill in theirs (with what we actually think or with what we want them to think, depending on our goals). I would also say that natural language is actually a sort of programming language, allowing humans to share cognitive programs between minds, programs of behavior or world modelling.
I also asked your question to ChatGPT, and here is what it had to say:
It is difficult to speculate about what might be left of human cognition if we were able to factor out next token prediction, as this is a complex and multifaceted aspect of human thought and communication. There are many other important aspects of human cognition beyond next token prediction, including things like perception, attention, memory, problem-solving, decision-making, emotion, and social cognition.
One aspect of human cognition that may be relevant in this context is the ability to understand and interpret the meaning and intention behind spoken and written communication. This involves not just predicting the next word or phrase, but also being able to understand the context in which it is being used, the relationships between words and phrases, and the overall meaning and purpose of the communication. It also involves the ability to generate and express one's own ideas and thoughts through language.
Another aspect of human cognition that may be relevant is the ability to experience and process sensory information and emotions. This includes things like the ability to see, hear, touch, taste, and smell, as well as the ability to feel and express emotions.
It is worth noting that these are just a few examples of the many complex and interrelated aspects of human cognition, and it is difficult to fully disentangle and isolate any one aspect from the others.
"Let me see what Chatty thinks," (or whatever humanesque name becomes popular).
I assume people will treat it just like talking to a very knowledgeable friend. Just ask a question, get a response, clarify what you meant or ask a followup question, and so on. Conversation in natural language already comes naturally to humans, so probably a lot more people will become a lot more adept at accessing knowledge.
And in future iterations, this "friend" will be able to create art, weave stories, design elucidating infographics, make entertaining music videos, teach academic subjects, try to sell you stuff (hmm), spread conspiracy theories (oops), etc., based on the gist of what it thinks you're looking for (and based on what it knows about you personally from your history of "friendship" with it). It would be nice if we could make it truthful and cooperative in a way that doesn't amplify the echo chamber effect of existing social media and search engines, but unfortunately, I don't see that as being very profitable for those deploying it.
By "code generating being automated," I mean that humans will program using natural human language, without having to think about the particulars of data structures and algorithms (or syntax). A good enough LLM can handle all of that stuff itself, although it might ask the human to verify if the resulting program functions as expected.
Maybe the models will be trained to look for edge cases that technically do what the humans asked for but seem to violate the overall intent of the program. In other words, situations where the program follows the letter of the law (i.e., the program specifications) but not the spirit of the law.
Come to think of it, if you could get a LLM to look for such edge cases robustly, it might be able to help RL systems avoid Goodharting, steering the agent to follow the intuitive intent behind a given utility function.
Well, I very much doubt that the entire programming world will get access to a four-quintillion-parameter code-generating model within five years. However, I do foresee the descendants of OpenAI Codex getting much more powerful and much more used within that timeframe. After all, Transformers just came out only five years ago, and they've definitely come a long way since.
Human culture changes more slowly than AI technology, though, so I expect businesses to begin adopting such models only with great trepidation at first. Programmers will almost certainly need to stick around for verification and validation of generated code for quite some time. More output will be expected out of programmers, for sure, as the technology is adopted, but that probably won't lead to the elimination of jobs themselves, just as the cotton gin didn't lead to the end of slavery and the rise of automation didn't lead to the rise of leisure time.
Eventually though, yes, code generation will be almost universally automated, at least once everyone is comfortable with automated code verification and validation. However, I wouldn't expect that cultural shift to be complete until at least the early 2030's. That's not to say we aren't in fact running out of time, of course.
The cortex uses traveling waves of activity that help it organize concepts in space and time. In other words, the locally traveling waves provide an inductive bias for treating features that occur close together in space and time as part of the same object or concept. As a result, cortical space ends up mapping out conceptual space, in addition to retinotopic, somatic, or auditory space.
This is kind of like DCT in the sense that oscillations are used as a scaffold for storing or reconstructing information. I think that Neural Radiance Fields (NeRF) use a similar concept, using positional encoding (3D coordinates plus viewing angle, rather than 2D pixel position) to generate images, especially when the positional encoding uses Fourier features. Of course, Transformers also use such sinusoidal positional encodings to help with natural language understanding.
All that is to say that I agree with you. Something similar to DCT will probably be very useful for discovering natural abstractions. For one thing, I imagine that these sorts of approaches could help overcome texture bias in DNNs by incorporating more large-scale shape information.
This just might work. For a little while, anyway.
One hurdle for this plan is to incentivize developers to slap on 20 layers of alignment strategies to their generalist AI models. It may be a hard sell when they are trying to maximize power and efficiency to stay competitive.
You'll probably need to convince them that not having such safeguards in place will lead to undesirable behavior (i.e., unprofitable behavior, or behavior leading to bad PR or bad customer reviews) well below the level of apocalyptic scenarios that AI safety advocates normally talk about. Otherwise, they may not care.
That's an interesting way of thinking about it. I'm reminded of how computers have revolutionized thought over the past century.
Most humans have tended to think primarily by intuition. Glorified pattern-matching, with all the cognitive biases that entails.
Computers, in contrast, have tended to be all formalization, no intuition. They directly manipulate abstract symbols at the lowest level, symbols that humans only deal with at the highest levels of conscious thought.
And now AI is bringining things back around to (surprisingly brittle) intuition. It will be interesting, at least, to see how newer AI systems will bring together formalized symbolic reasoning with pattern-matching intuition.
Counterpoint:
Sometimes its easier to reach a destination when you're not aiming for it. You can't reach the sun by pointing a rocket at it and generating thrust. It's hard to climb a mountain by going up the steepest path. You can't solve a Millenium Prize math problem by staring at it until a solution reveals itself.
Sometimes you need to slingshot around a planet a few times to adjust orbital momentum. Sometimes you'll summit faster by winding around or zigzagging. Sometimes you have to play around with unrelated math problems before you notice something insightful.
And perhaps, working on AI problems that are only tangentially related to alignment could reveal a path of least resistance toward a solution that wouldn't have even been considered otherwise.
A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human.
The problem is, of course, that any possible set of behaviors can be construed as maximizing some utility function. The question is whether doing so actually simplifies the task of reasoning and making predictions about the agent in question, or whether mapping the agent's actual motivational schema to a utility function only adds unwieldy complications.
In the case of humans, I would say it's far more useful to model us as generating and pursuing arbitrary goal states/trajectories over time. These goals are continuously learned through interactions with the environment and its impact on pain and pleasure signals, deviations from homeostatic set points, and aesthetic and social instincts. You might be able to model this as a utility function with a recursive hidden state, but would that be helpful?
The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.
There is no utility for the RL agent's operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren't in your favor here.
The RL agent will only know whether its plans are any good if they actually get carried out. The reward signal is something that it essentially sought out through trial and error. All (most?) RL agents start out not knowing anything about the impact their plans will have, or even anything about the causal structure of the environment. All of that has to be learned through experience.
For agents that play board games like chess or Go, the environment can be fully determined in simulation. So, sure, in those cases you can have them generate plans and then not take their advice on a physical game board. And those plans do tend to be power-seeking for well-trained agents in the sense that they tend to reach states that maximize the number of winnable options that they have while minimizing the winnable options of their opponents.
However, for an AI to generate power seeking plans for the real world, it would need to have access either to a very computationally expensive simulator or to the actual real world. The latter is an easier setup to design but more dangerous to train, above a certain level of capability.
Overall, I’ve updated from “just aim for ambitious value learning” to “empirically figure out what potential medium-term alignment targets (e.g. human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AGI’s internal concept-language”.
I like this. In fact, I would argue that some of those medium-term alignment targets are actually necessary stepping stones toward ambitious value learning.
Human mimicry, for one, could serve as a good behavioral prior for IRL agents. AI that can reverse-engineer the policy function of a human (e.g., by minimizing the error between the world-state-trajectory caused by its own actions and that produced by a human's actions) is probably already most of the way there toward reverse-engineering the value function that drives it (e.g., start by looking for common features among the stable fixed points of the learned policy function). I would argue that the intrinsic drive to mimic other humans is a big part of why humans are so adept at aligning to each other.
Do What I Mean (DWIM) would also require modeling humans in a way that would help greatly in modeling human values. A human that gives an AI instructions is mapping some high-dimensional, internally represented goal state into a linear sequence of symbols (or a 2D diagram or whatever). DWIM would require the AI to generate its own high-dimensional, internally represented goal states, optimizing for goals that give a high likelihood to the instructions it received. If achievable, DWIM could also help transform the local incentives for general AI capabilities research into something with a better Nash equilibrium. Systems that are capable of predicting what humans intended for them to do could prove far more valuable to existing stakeholders in AI research than current DL and RL systems, which tend to be rather brittle and prone to overfitting to the heuristics we give them.
Awesome visualizations. Thanks for doing this.
It occurred to me that LayerNorm seems to be implementing something like lateral inhibition, using extreme values of one neuron to affect the activations of other neurons. In biological brains, lateral inhibition plays a key role in many computations, enabling things like sparse coding and attention. Of course, in those systems, input goes through every neuron's own nonlinear activation function prior to having lateral inhibition applied.
I would be interested in seeing the effect of applying a nonlinearity (such as ReLU, GELU, ELU, etc.) prior to LayerNorm in an artificial neural network. My guess is that it would help prevent neurons with strong negative pre-activations from messing with the output of more positively activated neurons, as happens with pure LayerNorm. Of course, that would limit things to the first orthant for ReLU, although not for GELU or ELU. Not sure how that would affect stretching and folding operations, though.
By the way, have you looked at how this would affect processing in a CNN, normalizing each pixel of a given layer across all feature channels? I think I've tried using LayerNorm in such a context before, but I don't recall it turning out too well. Maybe I could look into that again sometime.
I think grading in some form will be necessary in the sense that we don't know what value heuristics will be sufficient to ensure alignment in the AI. We will most likely need to add corrections to its reward signals on the fly, even as it learns to extrapolate its own values from those heuristics. In other words, grading.
However, it seems the crucial point is that we need to avoid including grader evaluations as part of the AI's self-evaluation model, for the same reason that we shouldn't give it access to its reward button. In other words, don't build the AI like this:
[planning module] -> [predicted grader output] -> [internal reward signal] -> [reinforce policy function]
Instead, it should look more like this:
[planning module] -> [predicted world state] -> [internal reward signal] -> [reinforce policy function]
The predicted grader output may be part of the AI's predicted world state (if a grader is used), but it shouldn't be the part that triggers reward. The trick, then, would be to identify the part of the AI's world model that corresponds to what we want it to care about and feed only that part into the learned reward signal.
Could part of the problem be that the actor is optimizing against a single grader's evaluations? Shouldn't it somehow take uncertainty into account?
Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of many projections).
Rather than arg-maxing on the output of a single grader, the actor would optimize for Schelling points in plan space, selecting actions that minimize the variance among all graders. Of course, you still want it to maximize the evaluations also, so maybe it should look for actions that lie somewhere in the middle of the Pareto frontier of maximum and minimum .
My intuition suggests that the larger and more diverse the ensemble, the better this strategy would perform, assuming the evaluators are all trained properly. However, I suspect a superintelligence could still find a way to exploit this.
Could we solve alignment by just getting an AI to learn human preferences through training it to predict human behavior, using a "current best guess" model of human preferences to make predictions and updating the model until its predictions are accurate, then using this model as a reward signal for the AI? Is there a danger in relying on these sorts of revealed preferences?
On a somewhat related note, someone should answer, "What is this Coherent Extrapolated Volition I've been hearing about from the AI safety community? Are there any holes in that plan?"
Different parts of me get excited about this in different directions.
On the one hand, I see AI alignment as highly solvable. When I scan out among a dozen different subdisciplines in machine learning, generative modeling, natural language processing, cognitive science, computational neuroscience, predictive coding, etc., I feel like I can sense the faint edges of a solution to alignment that is already holographically distributed among collective humanity.
Getting AGI that has the same natural abstractions that biological brains converge on, that uses interpretable computational architectures for explicit reasoning, that continuously improves its internal predictive models of the needs and goals of other agents within its sphere of control and uses these models to motivate its own behavior in a self-correcting loop of corrigibility, that cares about the long-term survival of humanity and the whole biosphere; all of this seems like it is achievable within the next 10-20 years if we could just get all the right people working together on it. And I'm excited at the prospect that we could be part of seeing this vision come to fruition.
On the other hand, I realize that humanity is full of bad faith actors and otherwise good people whose agendas are constrained by perverse local incentives. Right now, deep learning is prone to fall to adversarial examples, completely failing to recognize what it's looking at when the texture changes slightly. Natural language understanding is still brittle, with transformer models probably being a bit too general-purpose for their own good. Reinforcement learning still falls prey to Goodharting, which would almost certainly lead to disaster if scaled up sufficiently. Honestly, I don't want to see an AGI emerge that's based on current paradigms just hacked together into something that seems to work. But I see groups moving in that direction anyway.
Without an alignment-adjacent paradigm shift that offers competitive performance over existing models, the major developers of AI are going to continue down a dangerous path, while no one else has the resources to compete. In this light, seeing the rapid progress of the last decade from Alex-Net to GPT-3 and DALLE-2 creates the sort of foreboding excitement that you talked about here. The train is barreling forward at an accelerating pace, and reasonable voices may not be loud enough over the roar of the engines to get the conductor to switch tracks before we plunge over a cliff.
I'm excited for the possibilities of AGI as I idealize it. I'm dreading the likelihood of a dystopic future with no escape if existing AI paradigms take over the world. The question becomes, how do we switch tracks?
Come to think of it, couldn't this be applied to model corrigibility itself?
Have an AI that's constantly coming up with predictive models of human preferences, generating an ensemble of plans for satisfying human preferences according to each model. Then break those plans into landmarks and look for clusters in goal-space.
Each cluster could then form a candidate basin of attraction of goals for the AI to pursue. The center of each basin would represent a "robust bottleneck" that would be helpful across predictive models; the breadth of each basin would account for the variance in landmark features; and the depth/attractiveness of each basin would be proportional to the number of predictive models that have landmarks in that cluster.
Ideally, the distribution of these basins would update continuously as each model in the ensemble becomes more predictive of human preferences (both stated and revealed) due to what the AGI learns as it interacts with humans in the real world. Plans should always be open to change in light of new information, including those of an AGI, so the landmarks and their clusters would necessarily shift around as well.
Assuming this is the right approach, the questions that remain would be how to structure those models of human preferences, how to measure their predictive performance, how to update the models on new information, how to use those models to generate plans, how to represent landmarks along plan paths in goal-space, how to convert a vector in goal-space into actionable behavior for the AI to pursue, etc., etc., etc. Okay, yeah, there would still be a lot of work left to do.
When you say "optimization target," it seems like you mean a single point in path-space that the planner aims for, where this point consists of several fixed landmarks along the path which don't adjust to changing circumstances. Such an optimization target could still have some wiggle room (i.e., consist of an entire distribution of possible sub-paths) between these landmarks, correct? So some level of uncertainty must be built into the plan regardless of whether you call it a prediction or an optimization target.
It seems to me that what you're advocating for is equivalent to generating an entire ensemble of optimization targets, each based on a different predictive model of how things will go. Then you break those targets up into their constituent landmarks and look for clusters of landmarks in goal-space from across the entire ensemble of paths. Would your "robust bottlenecks" then refer to the densest of these clusters?
Also, just a couple minor errors:
- In your "The first 31 binary strings in lexical order" figure, you're missing a white square at the top of the fourth 3-bit string.
- "diving by " should be "dividing by ". I know spell check would miss that one.
I didn't notice any other errors. Again, great article.
Excellent introduction. Your examples were all very intuitive.
For those who are reading, one way to get an intuition for the difference between binary strings and bits is to look at data compression. To begin with, it's easy to create a code like ASCII, where every character is represented by a binary string of length 8 (usually referred to as 8 "bits" or one byte), allowing up to unique characters. This type of code will allow you to represent a text document in English that's 1024 characters in length with exactly 1 kB of information.
Except that's not quite right. The 1 kB is only how much storage space it takes up in computer memory, not how much information the document actually contains.
In fact, each English character has closer to something like 1.1 bits per character in terms of actual information content, so an optimal compression algorithm could theoretically get that document down to around 141 bits. This is because not all characters occur with equal frequencies in all contexts. In fact, which character goes next in a sequence turns out to be quite predictable. Every time a predictive model reaches a point where its confusion about the next token/state is distributed 50-50, it needs to be given one additional bit of information to resolve this confusion. When things are more predictable, it will need fewer bits, and when things are less predictable, it will need more. As you said:
It turns out that for a given probability distribution over states, the encoding that minimizes average entropy uses strings that have one bit for every halving that it takes to get to
What you wrote as is typically referred to as the "surprise" associated with state , proportional to how many bits are necessary to resolve the confusion of a predictive model.
One example, getting back to data compression, is Huffman coding. Using just character frequency as the predictive model , the surprise for each character will be equal to . Huffman coding approximates this by giving shorter bit strings to more frequent characters and ensuring that no character's bit string is a prefix of any other character's bit string. As you described (without mentioning Huffman coding):
This is how you can use different-sized labels without having an additional "all done" symbol. If the bits known so far match a whole label, then they are not a prefix of any other label. Therefore they could not match any other label, and so you know the bits must refer to the label they already match so far. And using different-sized labels in your "prefix code" lets you reduce your expected entropy in cases where the states are not equally likely.
Using this method, you could compress the English document above down to something like 681 bits, assuming an average of 5.32 bits per character. This is not quite as much compression as is possible for English, since next-character predictions can be made more certain by looking at the context (kind of like what GPT is trained to do), but it's easier to think about.
General success:
- AGI mostly serves to solve coordination problems among eudaimonic agents or to lower the activation energy necessary for eudaimonic agents to achieve their goals.
- Any newly minted AGIs come prepackaged with modules for detecting agency in other systems and for harmonizing the needs and goals of all other agents within its sphere of control.
- The Gaia Hypothesis has become the Gaia Initiative, with ASI steering the evolution of the biosphere into a self-sustaining superorganism.
- All bodies in the solar system are either in the process of being terraformed or used as raw materials for building orbital habitats, a Matrioshka brain, and other megascale engineering projects, while all details are being recorded so the classic solar system can be revisited any time in simulation.
General failures:
- Everything in the solar system is getting paperclipped.
- ASI is used as a tool of the powerful to amplify and solidify social inequalities, universal automation ensuring that no one any longer has the ability to improve their life situation.
- All natural resources are being exhausted at an accelerating rate, irreparably destroying the biosphere in service of the ASI's goals (or the ambitions of its short-sighted human controllers), without any large-scale coordinated effort to preserve what is being lost forever, either physically or virtually.
- Someone forgot a minus sign, so now the singleton ASI is set on maximizing human suffering rather than minimizing it.
Suppose that you gave it a bunch of labeled data about what counts as "good" and "bad".
If your alignment strategy strongly depends on teaching the AGI ethics via labeled training data, you've already lost.
And if your alignment strategy strongly depends on creating innumerable copies of an UFAI and banking on the anthropic principle to save you, then you've already lost spectacularly.
If you can't point to specific submodules within the AGI and say, "Here is where it uses this particular version of predictive coding to model human needs/values/goals," and, "Here is where it represents its own needs/values/goals," and, "Here is where its internal representation of needs/values/goals drives its planning and behavior," and, "Here is where it routes its model of human values to the internal representation of its own values in a way that will automatically make it more aligned the more it learns about humanity," then you have already lost (but only probably).
Basically, the only sure way to get robust alignment is to make the AGI highly non-alien. Or as you put it:
Those who can deal with devils, don't need to, for they can simply summon angels instead.
Or rather: Those who can create devils and verify that those devils will take particular actually-beneficial actions as part of a complex diabolical compact, can more easily create angels that will take those actually-beneficial actions unconditionally.
The fox knows many things. The hedgehog knows one important thing.
It turns out that it's optimal to be 3 parts fox to sqrt(2) parts hedgehog: