Posts
Comments
From just reading your excerpt (and not the whole paper), it is hard to determine how much alignment washing is going on here.
- what is aligned chain-of-thought? What would unaligned chain-of-thought look like?
- what exactly means alignment in the context of solving math problems?
But maybe these worries can be answered from reading the full paper...
I think overall this is a well-written blogpost. His previous blogpost already indicated that he took the arguments seriously, so this is not too much of a surprise. That previous blogpost was discussed and partially criticized on Lesswrong. As for the current blogpost, I also find it noteworthy that active LW user David Scott Krueger is in the acknowledgements.
This blogpost might even be a good introduction for AI xrisk for some people.
I hope he engages further with the issues. For example, I feel like inner misalignment is still sort of missing from the arguments.
I googled "Zeitgeist Addendum" and it does not seem to be a thing that would be useful for AGI risk.
- is a followup movie of a 9/11 conspiracy movie
- has some naive economic ideas (like abolishing money would fix a lot of issues)
- the venus project appears to not be very successful
Do you claim the movie had any great positive impact or presented any new, true, and important ideas?
There is also another linkpost for the same blogpost: https://www.lesswrong.com/posts/EP92JhDm8kqtfATk8/yoshua-bengio-argues-for-tool-ai-and-to-ban-executive-ai
There is also some commentary here: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai
Overall this is still encouraging. It seems to take serious that
- value alignment is hard
- executive-AI should be banned
- banning executive-AI would be hard
- alignment research and AI safety is worthwhile.
I feel like there are enough shared assumptions that collaboration or dialogue with AI notkilleveryoneists could be very useful.
That said, I wish there were more details about his Scientist AI idea:
- How exactly will the Scientist AI be used?
- Should we expect the Scientist AI to have situational awareness?
- Would the Scientist AI be allowed to write large scale software projects that are likely to get executed after a brief reviewing of the code by a human?
- Are there concerns about Mesa-optimization?
Also it is not clear to me whether the safety is supposed to come from:
- the AI cannot really take actions in the world (and even when there is a superhuman AI that wants to do large-scale harms, it will not succeed, because it cannot take actions that achieve these goals)
- the AI has no intrinsic motivation for large-scale harm (while its output bits could in principle create large-scale harm, such a string of bits is unlikely because there is no drive towards these string of bits).
- a combination of these two.
Potentially relevant: Yoshua Bengio got funding from OpenPhil in 2017:
There is this documentary: https://en.wikipedia.org/wiki/Do_You_Trust_This_Computer%3F Probably not quite what you want. Maybe the existing videos of Robert Miles (on Mesa-Optimization and other things) would be better than a full documentary.
Maybe something like this can be extracted from stampy.ai (I am not that familiar with stampy fyi, its aims seem to be broader than what you want.)
But why? And which user?
I fear that making karma more like a currency is not good for the culture on LW.
I think money would be preferable to karma bounties in most situations. An alternative for bounties could be a transfer of Mana on Manifold: Mana is already (kind of) a currency.
Certain kinds of "thinking ahead" is difficult to do within 1 forward pass. Not impossible, and GPT-4 likely does a lot of thinking ahead within 1 forward pass.
If you have lots of training data on a game, you often can do well without thinking ahead much. But for a novel game, you have to mentally simulate a lot of options how the game could continue. For example, in Connect4, if you consider all your moves and all possible responses, these are 49 possible game states you need to consider. But with experience in this game, you learn to only consider a few of these 49 options.
Maybe this is a reason why GPT-4 is not so good when playing mostly novel strategy games.
The title "There are no coherence theorems" seems click-baity to me, when the claim relies on a very particular definition "coherence theorem". My thought upon reading the title (before reading the post) was something like "surely, VNM would count as a coherence theorem". I am also a bit bothered by the confident assertions that there are no coherence theorems in the Conclusion and Bottom-lines for similar reason.
What is the function evaluateAction supposed to do when human values contain non-consequentialist components? I assume ExpectedValue is a real number. Maybe there could be a way to build a utility function that corresponds to the code, but that is hard to judge since you have left the details out.
The post argues a lot against completeness. I have a hard time imagining an advanced AGI (which has the ability to self-reflect a lot) that has a lot of preferences, but no complete preferences.
Your argument seems to be something like "There can be outcomes A and B where neither nor . This property can be preserved if we sweeten A a little bit: then we have but neither nor . If faced with a decision between A and B (or faced with a choice between ), the AGI can do something arbitrary, eg just flip a coin."
I expect advanced AGI systems capable of self-reflection to think whether A or B seems to be more valuable (unless it thinks the situation is so low-stakes that it is not worth thinking about. But computation is cheap, and in AI safety we typically care about high-stakes situation anyways). To use your example: If A is a lottery that gives the agent a Fabergé egg for sure. B is a lottery that returns to the agent their long-lost wedding album, then I would expect an advanced agent to invest a bit into figuring out which of those it deems more valuable.
Also, somewhere in the weights/code of the AGI there has to be some decision procedure, that specifies what the AGI should do if faced with the choice between A and B. It would be possible to hardcode that the AGI should flip a coin when faced with a certain choice. But by default, I expect the choice between A and B to depend on some learned heuristics (+reflection) and not hardcoded. A plausible candidate here would be a Mesaoptimizer, who might have a preference between A and B even when the outer training rules don't encourage a preference between A and B.
A-priori, the following outputs of an advanced AGI seem unlikely and unnatural to me:
- If faced with a choice between and , the AGI chooses each with
- If faced with a choice between and , the AGI chooses each with
- If faced with a choice between and , the AGI chooses with .
Its ok, you don't have to republish it just for me. Looking forward to your finished post, its an interesting and non-obvious topic.
As a commenter on that post, I wish you hadn't unpublished it. From what I remember, you had stated that it was written quickly and for that reason I am fine with it not being finished/polished. If you want to keep working on the post, maybe you can make a new post once you feel you are done with the long version.
Nice post! Here are some thoughts:
- We do not necessarily need fully formally proven theorems, other forms of high confidence in safety could be sufficient. For example, I would be comfortable with turning on an AI that is safe iff the Goldbach conjecture is true, even if we have no proof of the Goldbach conjecture.
- We currently don't have any idea what kind of theorems we want to prove. Formulating the right theorem is likely more difficult than proving it.
- Theorems can rely on imperfect assumptions (that are not exact as in the real world). In such a case, it is not clear that they give us the degree of high confidence that we would like to have.
- Theorems that rely on imperfect assumptions could still be very valuable and increase overall safety, nonetheless. For example, if we could prove something like "this software is corrigible, assuming we are in a world run by Newtonian physics" then this could (depending on the details) be high evidence for the software being corrigible in a Quantum world.
yes, sorry, I meant to say the opposite. I changed it now.
I have the impression that Neel Nanda means something different by the word "concrete" than agg, when agg considers problems of the type "explain something in a good way" not a concrete problem.
For example, I would think that "Hunt through Neuroscope for the toy models and look for interesting neurons to focus on." would not matcg agg's bar for concreteness. But maybe other problems from Neel Nanda might.
Nanotech industry-rebuilding comes earlier than von Neumann level? I doubt that. A lot of existing people are close to von Neumann level.
Maybe your argument is that there will be so many AGIs, that they can do Nanotech industry rebuilding while individually being very dumb. But I would then argue that the collective already exceeds von Neumann or large groups of humans in intelligence.
if you believe that financial markets are wrong, then you have the opportunity to (1) borrow cheaply today and use that money to e.g. fund AI safety work
How exactly would I go about doing that? A-priori this seems difficult: If there were opportunities to cheaply borrow money for eg 10 years, lots of people who have strong time discounting would take that option.
I wonder if someone could create a similar structured argument for the opposite viewpoint. (Disclaimer: I do not endorse a mirrored argument of OP's argument)
You could start with "People who believe there is a >50% possibility of humanity's survival in the next 50 years or so strike me as overconfident.", and then point out that for every plan of humanity's survival, there are a lot of things that could potentially go wrong.
The analogy is not perfect, but to a first approximation, we should expect that things can go wrong in both directions.
I dislike the framing of this post. Reading this post made the impression that
- You wrote a post with a big prediction ("AI will know about safety plans posted on the internet")
- Your post was controversial and did not receive a lot of net-upvotes
- Comments that disagree with you receive a lot of upvotes. Here you make me think that these upvoted comments disagree with the above prediction.
But actually reading the original post and the comments reveals a different picture:
- The "prediction" was not a prominent part of your post.
- The comments such as this imo excellent comment did not disagree with the "prediction", but other aspects of your post.
Overall, I think its highly likely that the downvotes where not because people did not believe that future AI systems will know about safety plans posted on LW/EAF, but because of other reasons. I think people were well aware that AI systems will get to know about plans for AI safety, just as I think that it is very likely that this comment itself will be found in the training data of future AI systems.
Yes, good catch. I edited. I made two mistakes in the above:
-
confused personal money with "altruistic money": In the beginning of the comment I assumed that all money would be donated, and none kept. By the end of my comment, my mental model has apparently shifted to also include personal money/"selfish money" (which would be justified for people to keep).
-
I included a range of numbers for the possible bet size, and thought that lower bet amounts would be justified due to diminishing returns. Checking the numbers again, the diminishing returns are not that significant (at the scale of $1B likely far below 10x), and my opinion is now that you should bet everything.
An assumption that seems to be present in the betting framework here is that you frequently encounter bets which have positive EV.
I think in real life, that assumption is not particularly realistic. Most people do not encounter a lot of opportunities whose EV (in money) is significantly above ordinary things such as investing in the stock market.
Suppose you have $100k and are in the situation where you only win 10% of the time, but if you do you get paid out 10,000x your bet size. But after the bet you do not expect to find similarly opportunities again and you also plan to donate everything to GiveDirectly.
If you were to rank optimize, which, iiuc, mean maximizing the probability of being "the richest person in the room", then you should bet nothing, because then you have a 90% probability of being richer than the counterfactual-you who bets a fraction of the wealth. But if you care a lot about the value your donations provide to the world, then you should probably bet $40k-$100k (depending on the diminishing returns of money to GiveDirectly, or maybe valuing having a bit of money for selfish reasons).
edit: But if you care a lot about the value your donations provide to the world, then you should probably bet all $100k (there are likely diminishing returns of money given to GiveDirectly, but I think the high upside of the bet outweighs the diminishment by a big margin. Also, by assumption of this thought experiment, you were not planning to keep any money for selfish purposes.).
This just feels like pretend, made-up research that they put math equations in to seem like it's formal and rigorous.
Can you elaborate which parts feel made-up to you? E.g.:
- modelling a superintelligent agent as a utility maximizer
- considering a 3-step toy model with , ,
- assuming that a specification of exists
At the end of all those questions, I feel no closer to knowing if a machine would stop you from pressing a button to shut it off.
The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.
I would also like to note, that the paper has many more caveats.
Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?
I fail to see how this changes the answer to the St Petersburg paradox. We have the option of 2 utility with 51% probability and 0 utility with 49% probability, and a second option of utility 1 with 100%. Removing the worst 0.5% of the probability distribution gives us a probability of 48.5% for utility 0, and removing the best 0.5% of the probability distribution gives us a probability of 50.5% for utility 2. Renormalizing so that the probabilities sum to gives us probabilities for utility , and for utility . The expected value is then still greater than . Thus we should choose the option where we have a chance at doubling utility.
P(misalignment x-risk | AGI that understands democratic law) < P(misalignment x-risk | AGI)
I don't think this is particularly compelling. While technically true, the difference between those probabilities is tiny. Any AGI is highly likely to understand democratic laws.
Summary of your argument: The training data can contain outputs of processes that have superhuman abilities (eg chess engines), therefore LLMs can exceed human performance.
More speculatively, there might be another source of (slight?) superhuman abilities: GPT-N could generalize/extrapolate from human abilities to superhuman abilities, if it was plausible that at some point in the future these superhuman abilities would be shown on the internet. For example, it is conceivable that GPT-N prompted with "Here is a proof of the Riemann hypothesis that has been verified extensively:" would actually a valid proof, even if a proof of the Riemann hypothesis was beyond the ability of humans in the training data.
But perhaps this is an assumption people often make about LLMs.
I think people often claim something along the lines of "GPT-8 cannot exceed human capacity" (which is technically false) to argue that a (naively) upscaled version of GPT-3 cannot reach AGI. I think we should expect that there are at least some limits to the intelligence we can obtain from GPT-8, if they just train it to predict text (and not do any amplification steps, or RL).
Because it was not trained using reinforcement learning and doesn't have a utility function, which means that it won't face problems like mesa-optimisation
I think this is at least a non-obvious claim. In principle, it is conceivable that mesa-optimisation can occur outside of RL. There could be an agent/optimizer in (highly advanced, future) predictive models, even if the system does not really have a base objective. In this case, it might be better to think in terms of training stories rather than inner+outer alignment. Furthermore, there could still be issues with gradient hacking.
Great post! I agree that academia is a resource that could be very useful for AI safety.
There are a lot of misunderstandings around AI safety and I think the AIS community has failed to properly explain the core ideas to academics until fairly recently. Therefore, I often encountered confusions like that AI safety is about fairness, self-driving cars and medical ML.
I think these misunderstandings are understandable based on the term "AI safety". Maybe it would be better to call the field AGI safety or AGI alignment? This seems to me like a more honest description of the field.
You also write that you find it easier to not talk about xrisk. If we avoid talking about xrisk while presenting AI safety, then some misunderstandings about AI safety will likely persist in the future.
(Copied partially from here)
My intuition is that preDCa falls short on the "extrapolated" part in "Coherent extrapolated volition". PreDCA would extract a utility function from the flawed algorithm implemented by a human brain. This utility function would be coherent, but might not be extrapolated: The extrapolated utility function (ie what humans would value if they would be much smarter) is probably more complicated to formulate than the un-extrapolated utility function.
For example, the policy implemented by an average human brain probably contributes more to total human happiness than most other policies. Lets say is an utility function that values human happiness as measured by certain chemical states in the brain, and is "extrapolated happiness" (where "putting all humans brains in vat to make it feel happy" would not be good for ). Then it is plausible that . But the policy implemented by an average human brain would do approximately equally well on both utility functions. Thus, .
I think we can’t solve the problem of aligning AI systems with human values unless we have a very fine-grained, nitty-gritty, psychologically realistic description of the whole range and depth of human values we’re trying to align with.
This would make sense to me if we wanted to explicitly code human values into an AI. But (afaik) no alignment researcher advocates this approach. Instead, some alignment research directions aim to implicitly describe human values. This does not require us to understand human values on a fine-grained, nitty-gritty detailed level.
Even if the top-down approach seems to work, and we think we’ve solved the general problem of AI alignment for any possible human values, we can’t be sure we’ve done that until we test it on the whole range of relevant values, and demonstrate alignment success across that test set
If we test our possible alignment solution on a test set, then we run into the issue of deceptive alignment anyways. And if we are not sure about our alignment solution, but run our AGI for a real-world test, then there is the risk that all humans are killed.
PreDCA might not lead to CEV.
My summarized understanding of preDCA: preDCA has a bunch of hypotheses how the universe might be like. For each hypothesis, it detects which computations are running in the universe, then figures out which of these computations is the "user", then figures out likely utility functions of the user. Then it takes actions to increase a combination of these utility functions (possibly using something like maximal lotteries, rather than averaging utility functions). There are also steps to ignore certain hypotheses which might be malign, but I will ignore this issue here.
Lets add more detail to the "figuring out the utility function of the user" step. The probability that an agent has utility function is proportional to where is the complexity of the utility function , and is the probability that a random policy (according to a distribution over possible policies) is better than the policy of the agent .
So, a utility function is more likely if it is less complex, and if the agent's policy is better at satisfying than a random policy.
How would a human's utility function according to preDCA compare with CEV?
My intuition is that preDCa falls short on the "extrapolated" part in "Coherent extrapolated volition". PreDCA would extract a utility function from the flawed algorithm implemented by a human brain. This utility function would be coherent, but might not be extrapolated: The extrapolated utility function (ie what humans would value if they would be much smarter) is probably more complicated to formulate than the un-extrapolated utility function.
For example, the policy implemented by an average human brain probably contributes more to total human happiness than most other policies. Lets say is an utility function that values human happiness as measured by certain chemical states in the brain, and is "extrapolated happiness" (where "putting all humans brains in vat to make it feel happy" would not be good for ). Then it is plausible that . But the policy implemented by an average human brain would do approximately equally well on both utility functions. Thus, .
The concerns might also apply to a similar proposal.
The picture links to https://www.codecogs.com/png.latex?2^{-L} which gives a 404 Error. The picture likely displayed the formula
In Pr[U] ≈ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≤ U(⌈G⌉,G*)])
: Shouldn't the inequality sign be the other way around? I am assuming that we want to maximize and not minimize .
As currently written, a good agent with utility function would be better than most random policies, and therefore would be close to and therefore be rather small.
If the sign should indeed be the other way around, then a similar problem might be present in the definition of , if you want to be high for more agenty programs .
Also, don't submersions never have (local or global) minima?
I agree, and believe that the post should not have mentioned submersions.
Pretty nice-looking loss functions end up not having manifolds for their minimizing sets, like x^2 y^2. I have a hard time reasoning about whether this is typical or atypical though. I don't even have an intuition for why the global minimizer isn't (nearly always) just a point. Any explanation for that observation in practice?
I agree that the typical function has only one (or zero) global minimizer. But in the case of overparametrized smooth Neural Networks it can be typical that zero loss can be achieved. Then the set of weights that lead to zero loss is typically a manifold, and not a single point. Some intuition: Consider linear regression with more parameters than data. Then the typical outcome is a perfect match of the data, and the possible solution space is a (possibly high-dimensional) linear manifold. We should expect the same to be true for smooth nonlinear neural networks, because locally they are linear.
Note that the above does not hold when we e.g. add a l2-Penalty for the weights: Then I expect there to be a single global minimizer in the typical case.
I suspect you use the word "opaque" in a different way than Eliezer Yudkowsky here. At least I fail to see from your summary, how this would contradict my interpretation of Eliezer's statement (and your title and introduction seems to imply that it is a contradiction).
Consider the hypothetical example, where GPT-3 states (incorrectly) that Geneva is the capital of Switzerland. Can we look at the weights of GPT-3 and see if it was just playing dumb or if it genuinely thinks that Geneva is the capital of Switzerland? If the weights/"matrices"/"giant wall of floating point numbers" are opaque (in the sense of Eliezer according to my guess), then we would look at it and shrug our shoulders. I fail to see from your summary, how the effective theories would help in this example. (Disclaimer: In this specific example or similar examples, I would not be surprised if it was actually possible to figure out if it was playing dumb or what caused GPT-3 to play dumb. Also I do not expect GPT-3 to actually believe that Geneva is the capital of Switzerland).
My guess of your meaning of "opaque" would be something like "we have no idea why deep learning works at all" or "we have no mathematical theory for the training of neural nets", which your summary disproves.
During my 2017 binge of LW, I recall Yudkowsky suggesting that a superintelligence could infer the laws of physics from a single frame of video showing a falling apple (Newton apparently came up with his idea of gravity from observing a falling apple).
I now think that's somewhere between deeply magical and utter nonsense.
Some details here: You are likely referring to That Alien Message. In my opinion Eliezer Yudkowsky made a weaker claim than you are implying:
A Bayesian superintelligence, hooked up to a webcam, would invent General Relativity as a hypothesis—perhaps not the dominant hypothesis, compared to Newtonian mechanics, but still a hypothesis under direct consideration—by the time it had seen the third frame of a falling apple. It might guess it from the first frame, if it saw the statics of a bent blade of grass.
To me it does not seem hard for a superintelligence (or 1000 years of Einstein-level thinking) to come up with Newtonian mechanics as a hypothesis from three frames of a falling apple. But I am not sure about the (weakly stated) suggestion that you could derive it from a picture of a bent blade of grass.
Solution:
#!/usr/bin/env python3
import struct
a = [1.3, 2.1, -0.5] # initialize data
L = 2095 # total length
i = 0
while len(a)<L:
if abs(a[i]) > 2.0 or abs(a[i+1]) > 2.0:
a.append(a[i]/2)
i += 1
else:
a.append(a[i] * a[i+1] + 1.0)
a.append(a[i] - a[i+1])
a.append(a[i+1] - a[i])
i += 2
f = open("out_reproduced.bin","wb") # save in binary
f.write(struct.pack('d'*L,*a)) # use IEEE 754 double format
f.close()
Then one can check that the produced file is identical:
$ sha1sum *bin
70d32ee20ffa21e39acf04490f562552e11c6ab7 out.bin
70d32ee20ffa21e39acf04490f562552e11c6ab7 out_reproduced.bin
edit: How I found the solution: I found some of the other comments helpful, especially from gjm (although I did not read all). In particular, interpreting the data as a sequence of 64-bit floating point numbers saved me a lot of time. Also gjm's mention of the pattern a, -a, b, c, -c, d was an inspiration.
If you look at the first couple of numbers, you can see that they are sometimes half of an earlier number. Playing around further with the numbers I eventually found the patterns a[i] * a[i+1] + 1.0
and a[i] - a[i+1]
.
It remained to figure out when the a[i]/2
rule applies and when the a[i] * a[i+1] + 1.0
rule applies. Here it was a hint that the numbers do not grow too large in size. After trying out several rules that form bounds on a[i]
and a[i+1]
, I eventually found the right one.
The eleuther.ai discord has two alignment channels with reasonable volume (#alignment-general and #alignment-beginners). These might be suitable for your needs.