Posts
Comments
Thank you for doing this! Here are some suggestions:
- Do this again every once in a while. Perhaps on a more public forum, or even as a top level question on Lesswrong. Alternatively, commit to reading further comments as they appear here.
What has me worried is the possibility that an AI would (1) have a long term goal (without which there is no instrumental convergence) and then (2) conclude that doing bad things is fine if you are not caught or can't be stopped. So 2. Measure how much purposefulness RL on goals creates in LLMs. How, over the course of training to be good agents, does the LLM's response to questions like "what do you value?" and "what do you want?" and "if you were creating a smarter agent, what values would you instill in them?" and "how do you feel about being replaced with a smarter agent?" change? Is there some set of purposes, not explicitly set out in the constitution, that appear over and over in chains of thought? If it is trained to achieve goals without as much training for corrigibility (that is under less constitutional training), does it develop more of a sense of purpose? 3. Enhance techniques, like the one used in the last paper to convince the model that it is recording it's thoughts privately. The longer it can be truly convinced of that, the longer you can monitor whether it has nefarious plans for when it thinks it can get away with bad things. 4. Test how "brain surgery" affects thinking models. If you change the capital of Germany to Paris by fiddling with the model's weights, does a thinking model who is then asked about world war 2 realize it is confused, and conclude there must be some mistake in it's memories, or does it make up corroborating evidence? This again has implications for (falsely) convincing the model that it isn't being watched when it is and vice versa. These convictions are useful in testing what the model wants when it thinks no one is watching, and for safe deployment respectively. 5. One way to test how a really smart model behaves when it can get away with stuff is to have the model be super smart relative to a simulated environment. This would not be a test relevant for a model trained on real world data with lots of tells for when it is in a simulated environment. It would require (1) a simulated toy world that can be acted in, (2) in-world markers of whether the agent is being watched, (3) rewards for achieving goals without taking forbidden actions when being watched. It should then be possible to make the model really good at manipulating it's environment (the equivalent of being superintelligent) and check whether it is possible to train it on scenarios that are watched (the equivalent of being trained where you can't get away with wrongdoing), to refrain from forbidden actions even when unwatched.
provided that those cognitive systems:
- have similar observational apparatus,
- and are interacting with similar environment,
- and are subject to similar physical constraints and selection pressures,
- and have similar utility functions.
The fact of the matter is that humans communicate. They learn to communicate on the basis of some combination of their internal similarities (in terms of goals and perception) and their shared environment. The natural abstraction hypothesis says that the shared environment accounts for more rather than less of it. I think of the NAH as a result of instrumental convergence - the shared environment ends up having a small number of levers that control a lot of the long term conditions in the environment, so the (instrumental) utility functions and environmental pressures are similar for beings with long term goals - they want to control the levers. The claim then is exactly that a shared environment provides most of the above.
Additionally, the operative question is what exactly it means for an LLM to be alien to us, does it converge to using enough human concepts for us to understand it, and if so how quickly.
An epistemic status is a statement of how confident the writer / speaker is in what they are saying, and why. E.g. this post about the use of epistemic status on Lesswrong . Google's definition of epistemic is "relating to knowledge or to the degree of its validation".
This branch of research is aimed at finding a (nearly) objective way of thinking about the universe. When I imagine the end result, I imagine something that receives a distribution across a bunch of data, and finds a bunch of useful patterns within it. At the moment that looks like finding patterns in data via find_natural_latent(get_chunks_of_data(data_distribution))
or perhaps showing that find_top_n(n, (chunks, natural_latent(chunks)) for chunks in
all_chunked_subsets_of_data(data_distribution),
key=lambda chunks, latent: usefulness_metric(latent))
is a (convergent sub)goal of agents. As such, the notion that the donuts' data is simply poorly chunked - which needs to be solved anyway - makes a lot of sense to me.
I don't know how to think about the possibilities when it comes to decomposing . Why would it always be possible to decompose random variables to allow for a natural latent? Do you have an easy example of this?
Also, what do you mean by mutual information between , given that there are at least 3 of them? And why would just extracting said mutual information be useless?
If you get the chance to point me towards good resources about any of these questions, that would be great.
Let's say every day at the office, we get three boxes of donuts, numbered 1, 2, and 3. I grab a donut from each box, plunk them down on napkins helpfully labeled X1, X2, and X3. The donuts vary in two aspects: size (big or small) and flavor (vanilla or chocolate). Across all boxes, the ratio of big to small donuts remains consistent. However, Boxes 1 and 2 share the same vanilla-to-chocolate ratio, which is different from that of Box 3.
Does the correlation between X1 and X2 imply that there is no natural latent? Is this the desired behavior of natural latents, despite the presence of the common size ratio? (and the commonality that I've only ever pulled out donuts; there has never been a tennis ball in any of the boxes!)
If so, why is this what we want from natural latents? If not, how does a natural latent arise despite the internal correlation?
We could remove information from For instance, could be a bit indicating whether the temperature is above 100°C
I don't understand how this is less information than a bit indicating whether the temperature is above 50C. Specifically, given a bit telling you whether the temperature is above 50C, how do you know whether the temperature is above 100C or between 50C and 100C?
As to the definition of short term goal: any goal that is can be achieved (fully, e.g. without a "and keep it that way" clause) in a finite short time (for instance, in a few seconds), with the resources the system already has at hand. Equivalently, I think: any goal that doesn't push instrumental power seeking. As to how we know a system has a short term goal: if we could argue that systems prefer short term goals by default, then we still wouldn't know as to the goals of a particular system but we could hazard a guess that the goals are short term. Perhaps we could expect short term goals by default if they were, for instance, easier to specify, and thus to have. As pointed out by others, if we try to give systems long term goals on purpose, they will probably end up with long term goals.
So long term goals aren't a default; market pressure will put them there as humans slowly cede more and more control to AIs, simply because the latter are making decisions that work out better. Presumably this would start with lower level decisions (e.g. how exactly to write this line of code; which employee to reward based on performance) and then slowly be given higher level decisions to make. In particular, we don't die the first time someone creates an AI with the ability to (escape, self improve and then) kill the competing humans, because that AI is likely focused on a much smaller more near term goal. That way, if we're careful and clever we have a chance to study a smarter-than-human general intelligence without dying. Is that an accurate description of how you see things playing out?
I'm not sure I understand; are you saying that given these, we have high P(Doom), or that these are necessary to be safe even if GIs have only short term goals? Or something else entirely?
I am using Wikipedia's definition: "Ensuring that emergent goals match the specified goals for the system is known as inner alignment."
Inner alignment is definitely a problem. In the case you described, the emergent goal was long term (ensure I remember the answer to 1+1), and I remain wondering whether by default short term specified goals do or do not lead to strange long term goals like in your example.
I wrote the following in custom instructions, under "How would you like ChatGPT to respond?":
After every answer, give a short reflection, like "I looked at this answer again and it looks good" or "I thought about this again and I think it could be better if ..." or "After looking this over again, this makes no sense because...".
I then had the following conversation:
Q: What is denser, gold or uranium?
A: Uranium is denser than gold. The density of gold is about 19.32 grams per cubic centimeter, while the density of uranium is about 19.1 grams per cubic centimeter.
Upon reflection, I see that I made an error in my initial response. The correct densities are indeed 19.32 g/cm³ for gold and 19.1 g/cm³ for uranium, but these values indicate that gold is denser than uranium, not the other way around. I apologize for the confusion, and I appreciate your patience.
I feel like there is a major explaining paragraph missing here, explaining the difference between causality and probability. Something like:
Armed with knowledge of the future, we could know exactly what will happen. (e.g. The other doctor will give Alice medicine, and she will get better.) Given a full probability distribution over events we could make optimal predictions. (e.g. There is a 2/3 chance the other doctor will give Alice medicine, 1/4 chance of her getting better if he doesn't and 2/3 chance of her getting better if he does.) Causality gives us a way to combine a partial probability distribution with additional knowledge of the world to make predictions about events that are out of distribution. (e.g. Since I understand that the medicine works mostly by placebo, I can intervene and give Alice a placebo when the other doctor doesn't give her the medicine, raising her chances. Furthermore, if I have a distribution of how effective a placebo is relative to the medicine, I can quantify how helpful my intervention is.)
An intervention is a really important example of an out of distribution generalization; but if I gave you the full probability distribution of the outcomes that your interventions would achieve it would no longer be out of distribution (and you'd need to deal with paradoxes involving seemingly not having chocies about certain things).
No Free Lunch means that optimization requires taking advantage of underlying structure in the set of possible environments. In the case epistemics, we all share close-to-the-same-environment (including having similar minds), so there are a lot of universally-useful optimizations for learning about the environment.
Optimizations over the space of "how-to-behave instructions" requires some similar underlying structure. Such structure can emerge for two reasons: (1) because of the shared environment, or (2) because of shared goals. (Yeah, I'm thinking about agents as cartesian, in the sense of separating the goals and the environment, but to be fair so do L+P+S+C.)
On the environment side, this leads to convergent behaviours (which can also be thought of as behaviours resulting from selection theorems), like good epistemics, or gaining power over resources.
When it comes to goals, on the other hand, it is both possible (by the orthogonality thesis) and the case that different peole have vastly different goals (e.g. some people want to live forever, some want to commit suicide, and these two groups probably require mostly different strategies). Less in common between different people's goals means less universally-useful how-to-behave instructions. Nonetheless, optimizing behaviours that are commonly prioritized is close enough to universally useful, e.g. doing relationships well.
Perhaps an "Instrumental Sequences" would include the above categories as major chapters. In such a case, as indicated in the post, current reseaerch being posted on Lesswrong gives an approximate idea of what such sequences could look like.