Posts
Comments
Mid 2027 seems too late to me for such a candidate to start the official campaign.
For the 2020 presidential election, many democratic candidates announced their campaign in early 2019, and Yang already in 2017. Debates happened already in June 2019. As a likely unknown candidate, you probably need a longer run time to accumulate a bit of fame.
Also Musk's regulatory plan is polling well
What plan are you referring to? Is this something AI safety specific?
I wouldn't say so, I don't think his campaign has made UBI advocacy more difficult.
But an AI notkilleveryoneism campaign seems more risky. It could end up making the worries look silly, for example.
Their platform would be whatever version and framing of AI notkilleveryoneism the candidates personally endorse, plus maybe some other smaller things. They should be open that they consider the potential human disempowerment or extinction to be the main problem of our time.
As for the concrete policy proposals, I am not sure. The focus could be on international treaties, or banning or heavy regulation of AI models who were trained with more than a trillion quadrillion (10^27) operations. (not sure I understand the intent behind your question).
A potentially impactful thing: someone competent runs as a candidate for the 2028 election on an AI notkilleveryoneism[1] platform. Maybe even two people should run, one for the democratic primary, and one in the republican primary. While getting the nomination is rather unlikely, there could be lots of benefits even if you fail to gain the nomination (like other presidential candidates becoming sympathetic to AI notkilleveryoneism, or more popularity of AI notkilleveryoneism in the population, etc.)
On the other hand, attempting a presidential run can easily backfire.
A relevant previous example to this kind of approach is the 2020 campaign by Andrew Yang, which focussed on universal basic income (and downsides of automation). While the campaign attracted some attention, it seems like it didn't succeed in making UBI a popular policy among democrats.
Not necessarily using that name. ↩︎
This can easily be done in the cryptographic example above: B can sample a new number , and then present to a fresh copy of A that has not seen the transcript for so far.
I don't understand how this is supposed to help. I guess the point is to somehow catch a fresh copy of A in a lie about a problem that is different from the original problem, and conclude that A is the dishonest debater?
But couldn't A just answer "I don't know"?
Even if it is a fresh copy, it would notice that it does not know the secret factors, so it could display different behavior than in the case where A knows the secret factors .
Some of these are very easy to prove; here's my favorite example. An agent has a fixed utility function and performs Pareto-optimally on that utility function across multiple worlds (so "utility in each world" is the set of objectives). Then there's a normal vector (or family of normal vectors) to the Pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector's components will all be nonnegative (because Pareto surface), and the vector is defined only up to normalization, so we can interpret that normal vector as a probability distribution. That also makes sense intuitively: larger components of that vector (i.e. higher probabilities) indicate that the agent is "optimizing relatively harder" for utility in those worlds. This says nothing at all about how the agent will update, and we'd need a another couple sentences to argue that the agent maximizes expected utility under the distribution, but it does give the prototypical mental picture behind the "Pareto-optimal -> probabilities" idea.
Here is an example (to point out a missing assumption): Lets say you are offered to bet on the result of a coin flip for dollar. You get dollars if you win, and your utility function is linear in dollars. You have three actions: "Heads", "Tails", and "Pass". Then "Pass" performs Pareto-optimally across multiple worlds. But "Pass" does not maximize expected utility under any distribution.
I think what is needed for the result is an additional convexity-like assumption about the utilities.
This could be the set of achievable utility vectors is convex'', or even something weaker like
every convex combination of achievable utility vectors is dominated by an achievable utility vector" (here, by utility vector I mean if is the utility of world ).
If you already accept the concept of expected utility maximization,
then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).
Or: even if you do expect powerful agents to be approximately Pareto-optimal, presumably they will be approximately Pareto optimal, not exactly Pareto-optimal. What can we say about coherence then?
The underlying math statement of some of these kind of results about Pareto-optimality seems to be something like this:
If is Pareto-optimal wrt utilities , and a convexity assumption (e.g. the set is convex, or something with mixed strategies) holds, then there is a probability distribution so that is optimal for .
I think there is a (relatively simple) approximate version of this, where we start out with approximate Pareto-optimality.
We say that is Pareto--optimal if there is no (strong) Pareto-improvement by more than (that is, there is no with for all ).
Claim: If is Pareto--optimal and the convexity assumption holds, then there is a probability distribution so that is -optimal for .
Rough proof: Define and as the closure of . Let be of the form for the largest such that . We know that . Now is Pareto-optimal for , and by the non-approximate version there exists a probability distribution so that is optimal for . Then, for any , we have $\mathbb{E}{i\sim\mu} u_i(x) \leq \mathbb{E}{i\sim\mu} \tilde y_i = \mathbb{E}{i\sim\mu} (u_i(\bar x) + \delta)\le \varepsilon + \mathbb{E}{i\sim\mu} u_i(\bar x), $ that is, is -optimal for .
I think there are some subtleties with the (non-infra) bayesian VNM version, which come down to the difference between "extreme point" and "exposed point" of . If a point is an extreme point that is not an exposed point, then it cannot be the unique expected utility maximizer under a utility function (but it can be a non-unique maximizer).
For extreme points it might still work with uniqueness, if, instead of a VNM-decision-maker, we require a slightly weaker decision maker whose preferences satisfy the VNM axioms except continuity.
For any , if then either or .
I think this condition might be too weak and the conjecture is not true under this definition.
If , then we have (because a minimum over a larger set is smaller). Thus, can only be the unique argmax if .
Consider the example . Then is closed. And satisfies . But per the above it cannot be a unique maximizer.
Maybe the issue can be fixed if we strengthen the condition so that has to be also minimal with respect to .
For a provably aligned (or probably aligned) system you need a formal specification of alignment. Do you have something in mind for that? This could be a major difficulty. But maybe you only want to "prove" inner alignment and assume that you already have an outer-alignment-goal-function, in which case defining alignment is probably easier.
insofar as the simplest & best internal logical-induction market traders have strong beliefs on the subject, they may very well be picking up on something metaphysically fundamental. Its simply the simplest explanation consistent with the facts.
Theorem 4.6.2 in logical induction says that the "probability" of independent statements does not converge to or , but to something in-between. So even if a mathematician says that some independent statement feels true (eg some objects are "really out there"), logical induction will tell him to feel uncertain about that.
A related comment from lukeprog (who works at OP) was posted on the EA Forum. It includes:
However, at present, it remains the case that most of the individuals in the current field of AI governance and policy (whether we fund them or not) are personally left-of-center and have more left-of-center policy networks. Therefore, we think AI policy work that engages conservative audiences is especially urgent and neglected, and we regularly recommend right-of-center funding opportunities in this category to several funders.
it's for the sake of maximizing long-term expected value.
Kelly betting does not maximize long-term expected value in all situations. For example, if some bets are offered only once (or even a finite amount), then you can get better long-term expected utility by sometimes accepting bets with a potential "0"-Utility outcome.
This is maybe not the central point, but I note that your definition of "alignment" doesn't precisely capture what I understand "alignment" or a good outcome from AI to be:
‘AGI’ continuing to exist
AGI could be very catastrophic even when it stops existing a year later.
eventually
If AGI makes earth uninhabitable in a trillion years, that could be a good outcome nonetheless.
ranges that existing humans could survive under
I don't know whether that covers "humans can survive on mars with a space-suit", but even then, if humans evolve/change to handle situations that they currently do not survive under, that could be part of an acceptable outcome.
it is the case that most algorithms (as a subset in the hyperspace of all possible algorithms) are already in their maximally most simplified form. Even tiny changes to an algorithm could convert it from 'simplifiable' to 'non-simplifiable'.
This seems wrong to me:
For any given algorithm you can find many equivalent but non-simplified algorithms with the same behavior, by adding a statement to the algorithm that does not affect the rest of the algorithm
(e.g. adding a line such as foobar1234 = 123
in the middle of a python program)).
In fact, I would claim that the majority python programs on github are not in their "maximally most simplified form".
Maybe you can cite the supposed theorem that claims that most (with a clearly defined "most") algorithms are maximally simplified?
This is not a formal definition.
Your English sentence has no apparent connection to mathematical objects, which would be necessary for a rigorous and formal definition.
I think you are broadly right.
So we're automatically giving ca. higher probability – even before applying the length penalty .
But note that under the Solomonoff prior, you will get another penalty for these programs with DEADCODE. So with this consideration, the weight changes from (for normal ) to (normal plus DEADCODE versions of ), which is not a huge change.
For your case of "uniform probability until " I think you are right about exponential decay.
That point is basically already in the post:
large language models can help document and teach endangered languages, providing learning tools for younger generations and facilitating the transmission of knowledge. However, this potential will only be realized if we prioritize the integration of all languages into AI training data.
I have doubts that the claim about "theoretically optimal" apply to this case.
Now, you have not provided a precise notion of optimality, so the below example might not apply if you come up with another notion of optimality or assume that voters collude with each other, or use a certain decision theory, or make other assumptions... Also there are some complications because the optimal strategy for each player depends on the strategy of the other players. A typical choice in these cases is to look at Nash-equilibria.
Consider three charities A,B,C and two players X,Y who can donate $100 each. Player X has utilities , , for the charities A,B,C. Player Y has utilities , , for the charities A,B,C.
The optimal (as in most overall utility) outcome would be to give everything to charity B. This would require that both players donate everything to charity B. However, this is not a Nash-equilibrium, as player X has an incentive to defect by giving to A instead of B and getting more utility.
This specific issue is like the prisoners dilemma and could be solved with other assumptions/decision theories.
The difference between this scenario and the claims in the literature might be that public goods is not the same as charity, or that the players cannot decide to keep the funds for themselves. But I am not sure about the precise reasons.
Now, I do not have an alternative distribution mechanism ready, so please do not interpret this argument as serious criticism of the overall initiative.
There is also Project Quine, which is a newer attempt to build a self-replicating 3D printer
This was already referenced here: https://www.lesswrong.com/posts/MW6tivBkwSe9amdCw/ai-existential-risk-probabilities-are-too-unreliable-to
I think it would be better to comment there instead of here.
One thing I find positive about SSI is their intent to not have products before superintelligence (note that I am not arguing here that the whole endeavor is net-positive). Not building intermediate products lessens the impact on race dynamics. I think it would be preferable if all the other AGI labs had a similar policy (funnily, while typing this comment, I got a notification about Claude 3.5 Sonnet... ). The policy not to have any product can also give them cover to focus on safety research that is relevant for superintelligence, instead of doing some shallow control of the output of LLMs.
To reduce bad impacts from SSI, it would be desirable that SSI also
- have a clearly stated policy to not publish their capabilities insights,
- take security sufficiently seriously to be able to defend against nation-state actors that try to steal their insights.
It does not appear paywalled to me. The link that @mesaoptimizer posted is an archive, not the original bloomberg.com article.
I haven't watched it yet, but there is also a recent technical discussion/podcast episode about AIXI and relatedd topics with Marcus Hutter: https://www.youtube.com/watch?v=7TgOwMW_rnk
It suffices to show that the Smith lotteries that the above result establishes are the only lotteries that can be part of maximal lottery-lotteries are also subject to the partition-of-unity condition.
I fail to understand this sentence. Here are some questions about this sentence:
-
what are Smith lotteries? Ctrl+f only finds lottery-Smith lottery-lotteries, do you mean these? Or do you mean lotteries that are smith?
-
which result do you mean by "above result"?
-
What does it mean for a lottery to be part of maximal lottery-lotteries?
-
does "also subject to the partition-of-unity" refer to the smith lotteries or to the lotteries that are part of maximal lottery-lotteries? (it also feels like there is a word missing somewhere)
-
Why would this suffice?
-
Is this part also supposed to imply the existence of maximal lottery-lotteries? If so, why?
A lot of the probabilities we talk about are probabilities we expect to change with evidence. If we flip a coin, our p(heads) changes after we observe the result of the flipped coin. My p(rain today) changes after I look into the sky and see clouds. In my view, there is nothing special in that regard for your p(doom). Uncertainty is in the mind, not in reality.
However, how you expect your p(doom) to change depending on facts or observation is useful information and it can be useful to convey that information. Some options that come to mind:
-
describe a model: If your p(doom) estimate is the result of a model consisting of other variables, just describing this model is useful information about your state of knowledge, even if that model is only approximate. This seems to come closest to your actual situation.
-
describe your probability distribution of your p(doom) in 1 year (or another time frame): You could say that you think there is a 25% chance that your p(doom) in 1 year is between 10% and 30%. Or give other information about that distribution. Note: your current p(doom) should be the mean of your p(doom) in 1 year.
-
describe your probability distribution of your p(doom) after a hypothetical month of working on a better p(doom) estimate: You could say that if you were to work hard for a month on investigating p(doom), you think there is a 25% chance that your p(doom) after that month is between 10% and 30%. This is similar to 2., but imo a bit more informative. Again, your p(doom) should be the mean of your p(doom) after a hypothetical month of investigation, even if you don't actually do that investigation.
This sounds like https://www.super-linear.org/trumanprize. It seems like it is run by Nonlinear and not FTX.
I think Proposition 1 is false as stated because the resulting functional is not always continuous (wrt the KR-metric). The function , with should be a counterexample. However, the non-continuous functional should still be continuous on the set of sa-measures.
Another thing: the space of measures is claimed to be a Banach space with the KR-norm (in the notation section). Afaik this is not true, while the space is a Banach space with the TV-norm, with the KR-metric/norm it should not be complete and is merely a normed vector space. Also the claim (in "Basic concepts") that is the dual space of is only true if equipped with TV-norm, not with KR-metric.
Another nitpick: in Theorem 5, the type of in the assumption is probably meant to be , instead of .
Regarding direction 17: There might be some potential drawbacks to ADAM. I think its possible that some very agentic programs have relatively low score. This is due to explicit optimization algorithms being low complexity.
(Disclaimer: the following argument is not a proof, and appeals to some heuristics/etc. We fix for these considerations too.) Consider an utility function . Further, consider a computable approximation of the optimal policy (AIXI that explicitly optimizes for ) and has an approximation parameter n (this could be AIXI-tl, plus some approximation of ; higher is better approximation). We will call this approximation of the optimal policy . This approximation algorithm has complexity , where is a constant needed to describe the general algorithm (this should not be too large).
We can get better approximation by using a quickly growing function, such as the Ackermann function with . Then we have .
What is the score of this policy? We have . Let be maximal in this expression. If , then .
For the other case, let us assume that if , the policy is at least as good at maximizing than . Then, we have .
I don't think that the assumption ( maximizes better than ) is true for all and , but plausibly we can select such that this is the case (exceptions, if they exist, would be a bit weird, and if ADAM working well due to these weird exceptions feels a bit disappointing to me). A thing that is not captured by approximations such as AIXI-tl are programs that halt but have insane runtime (longer than ). Again, it would feel weird to me if ADAM sort of works because of low-complexity extremely-long-running halting programs.
To summarize, maybe there exist policies which strongly optimize a non-trivial utility function with approximation parameter , but where is relatively small.
I think the "deontological preferences are isomorphic to utility functions" is wrong as presented.
Firts, the formula has issues with dividing by zero and not summing probabilities to one (and re-using variable as a local variable in the sum). So you probably meant something like Even then, I dont think this describes any isomorphism of deontological preferences to utility functions.
-
Utility functions are invariant when multiplied with a positive constant. This is not reflected in the formula.
-
utility maximizers usually take the action with the best utility with probability , rather than using different probabilities for different utilities.
-
modelling deontological constraints as probability distributions doesnt seem right to me. Let's say I decide between drinking green tea and black tea, and neither of those violate any deontological constraints, then assigning some values (which ones?) to P("I drink green tea") or P("I drink black tea") doesnt describe these deontological constraints well.
-
any behavior can be encoded as utility functions, so finding any isomorphisms to utility functions is usually possible, but not always meaningful.
Some of the downvotes were probably because of the unironic use of the term TESCREAL. This term mixes a bunch of different things together, which makes your writing less clear.
Sure, I'd be happy to read a draft
I am going to assume that in the code, when calculating p_alice_win_given_not_caught
, we do not divide the term by two (since this is not that consistent with the description. I am also assuming that is a typo and is meant, which would also be more consistent with other stuff).
So I am going to assume assume a symmetrical version.
Here, P(Alice wins) is . Wlog we can assume (otherwise Bob will run everything or nothing in shielded mode).
We claim that is a (pure) Nash equilibrium, where .
To verify, lets first show that Alice cannot make a better choice if Bob plays . We have . Since this only depends on the sum, we can make the substitution . Thus, we want to maximize . We have . Rearranging, we get . Taking logs, we get . Rearranging, we get . Thus, is the optimal choice. This means, that if Bob sticks to his strategy, Alice cannot do better than .
Now, lets show that Bob cannot do better. We have . This does not depend on and anymore, so any choice of and is optimal if Alice plays .
(If I picked the wrong version of the question, and you actually want some symmetry: I suspect that the solution will have similarities, or that in some cases the solution can be obtained by rescaling the problem back into a more symmetric form.)
This article talks a lot about risks from AI. I wish the author would be more specific what kinds of risks they are thinking about. For example, it is unclear which parts are motivated by extinction risks or not. The same goes for the benefits of open-sourcing these models. (note: I haven't read the reports this article is based on, these might have been more specific)
Thank you for writing this review.
The strategy assumes we'll develop a good set of safety properties that we're demanding proof of.
I think this is very important. From skimming the paper it seems that unfortunately the authors do not discuss it much. I imagine that actually formally specifying safety properties is actually a rather difficult step.
To go with the example of not helping terrorists spread harmful virus: How would you even go about formulating this mathematically? This seems highly non-trivial to me. Do you need to mathematically formulate what exactly are harmful viruses?
The same holds for Asimov's three laws of robotics, turning these into actual math or code seems to be quite challenging.
There's likely some room for automated systems to figure out what safety humans want, and turn it into rigorous specifications.
Probably obvious to many, but I'd like to point out that these automated systems themselves need to be sufficiently aligned to humans, while also accomplishing tasks that are difficult for humans to do and probably involve a lot of moral considerations.
A common response is that “evaluation may be easier than generation”. However, this doesn't mean evaluation will be easy in absolute terms, or relative to one’s resources for doing it, or that it will depend on the same resources as generation.
I wonder to what degree this is true for the human-generated alignment ideas that are being submitted LessWrong/Alignment Forum?
For mathematical proofs, evaluation is (imo) usually easier than generation: Often, a well-written proof can be evaluated by reading it once, but often the person who wrote up the proof had to consider different approaches and discard a lot of them.
To what degree does this also hold for alignment research?
The setup violates a fairness condition that has been talked about previously.
From https://arxiv.org/pdf/1710.05060.pdf, section 9:
We grant that it is possible to punish agents for using a specific decision proce- dure, or to design one decision problem that punishes an agent for rational behavior in a different decision problem. In those cases, no decision theory is safe. CDT per- forms worse that FDT in the decision problem where agents are punished for using CDT, but that hardly tells us which theory is better for making decisions. [...]
Yet FDT does appear to be superior to CDT and EDT in all dilemmas where the agent’s beliefs are accurate and the outcome depends only on the agent’s behavior in the dilemma at hand. Informally, we call these sorts of problems “fair problems.” By this standard, Newcomb’s problem is fair; Newcomb’s predictor punishes and rewards agents only based on their actions. [...]
There is no perfect decision theory for all possible scenarios, but there may be a general-purpose decision theory that matches or outperforms all rivals in fair dilem- mas, if a satisfactory notion of “fairness” can be formalized
Is the organization who offers the prize supposed to define "alignment" and "AGI" or the person who claims the prize? this is unclear to me from reading your post.
Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing! Such formal definitions would be very valuable by themselves (without any proofs). Especially if people widely agree that the definitions capture the important aspects of the problem.
I think the conjecture is also false in the case that utility functions map from to .
Let us consider the case of and . We use , where is the largest integer such that starts with (and ). As for , we use , where is the largest integer such that starts with (and ). Both and are computable, but they are not locally equivalent. Under reasonable assumptions on the Solomonoff prior, the policy that always picks action is the optimal policy for both and (see proof below).
Note that since the policy is computable and very simple, is not true, and we have instead. I suspect that the issues are still present even with an additional condition, but finding a concrete example with an uncomputable policy is challenging.
proof: Suppose that and are locally equivalent. Let be an open neighborhood of the point and , be such that for all .
Since , we have . Because is an open neighborhood of , there is an integer such that for all . For such , we have This implies . However, this is not possible for all . Thus, our assumption that and are locally equivalent was wrong.
Assumptions about the solomonoff prior: For all , the sequence of actions that produces the sequence of with the highest probability is (recall that we start with observations in this setting). With this assumption, it can be seen that the policy that always picks action is among the best policies for both and .
I think this is actually a natural behaviour for a reasonable Solomonoff prior: It is natural to expect that is more likely than . It is natural to expect that the sequence of actions that leads to over has low complexity. Always picking is low complexity.
It is possible to construct an artificial UTM that ensures that "always take " is the best policy for , : An UTM can be constructed such that the corresponding Solomonoff prior assigns 3/4 probability to the program/environment "start with o_1. after action a_i, output o_i". The rest of the probability mass gets distributed according to some other more natural UTM.
Then, for , in each situation with history the optimal policy has to pick (the actions outside of this history have no impact on the utility): With 3/4 probability it will get utility of at least . And with probability at least . Whereas, for the choice of , with probability it will have utility of , and with probability it can get at most . We calculate , ie. taking action is the better choice.
Similarly, for , the optimal policy has to pick too in each situation with history . Here, the calculation looks like .
"inclusion map" refers to the map , not the coproduct . The map is a coprojection (these are sometimes called "inclusions", see https://ncatlab.org/nlab/show/coproduct).
A simple example in sets: We have two sets , , and their disjoint union . Then the inclusion map is the map that maps (as an element of ) to (as an element of ).
What is an environmental subagent? An agent on a remote datacenter that the builders of the orginal agent don't know about?
Another thing that is not so clear to me in this description: Does the first agent consider the alignment problem of the environmental subagent? It sounds like the environmental subagents cares about paperclip-shaped molecules, but is this a thing the first agent would be ok with?
This does not sound very encouraging from the perspective of AI Notkilleveryoneism. When the announcement of the foundation model task force talks about safety, I cannot find hints that they mean existential safety. Rather, it seems about safety for commercial purposes.
A lot of the money might go into building a foundation model. At least they should also announce that they will not share weights and details on how to build it, if they are serious about existential safety.
This might create an AI safety race to the top as a solution to the tragedy of the commons
This seems to be the opposite of that. The announcement talks a lot about establishing UK as a world leader, e.g. "establish the UK as a world leader in foundation models".
There is an additional problem where one of the two key principles for their estimates is
Avoid extreme confidence
If this principle leads you to picking probability estimates that have some distance to 1 (eg by picking at most 0.95).
If you build a fully conjunctive model, and you are not that great at extreme probabilities, then you will have a strong bias towards low overall estimates. And you can make your probability estimates even lower by introducing more (conjunctive) factors.
Nitpick: The title the authors picked ("Current and Near-Term AI as a Potential Existential Risk Factor") seems to better represent the content of the article than the title you picked for this LW post ("The Existential Risks of Current and Near-Term AI").
Reading the title I was expecting an argument that extinction could come extremely soon (eg by chaining GPT-4 instances together in some novel and clever way). The authors of the article talk about something very different imo.
From just reading your excerpt (and not the whole paper), it is hard to determine how much alignment washing is going on here.
- what is aligned chain-of-thought? What would unaligned chain-of-thought look like?
- what exactly means alignment in the context of solving math problems?
But maybe these worries can be answered from reading the full paper...
I think overall this is a well-written blogpost. His previous blogpost already indicated that he took the arguments seriously, so this is not too much of a surprise. That previous blogpost was discussed and partially criticized on Lesswrong. As for the current blogpost, I also find it noteworthy that active LW user David Scott Krueger is in the acknowledgements.
This blogpost might even be a good introduction for AI xrisk for some people.
I hope he engages further with the issues. For example, I feel like inner misalignment is still sort of missing from the arguments.
I googled "Zeitgeist Addendum" and it does not seem to be a thing that would be useful for AGI risk.
- is a followup movie of a 9/11 conspiracy movie
- has some naive economic ideas (like abolishing money would fix a lot of issues)
- the venus project appears to not be very successful
Do you claim the movie had any great positive impact or presented any new, true, and important ideas?
There is also another linkpost for the same blogpost: https://www.lesswrong.com/posts/EP92JhDm8kqtfATk8/yoshua-bengio-argues-for-tool-ai-and-to-ban-executive-ai
There is also some commentary here: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai
Overall this is still encouraging. It seems to take serious that
- value alignment is hard
- executive-AI should be banned
- banning executive-AI would be hard
- alignment research and AI safety is worthwhile.
I feel like there are enough shared assumptions that collaboration or dialogue with AI notkilleveryoneists could be very useful.
That said, I wish there were more details about his Scientist AI idea:
- How exactly will the Scientist AI be used?
- Should we expect the Scientist AI to have situational awareness?
- Would the Scientist AI be allowed to write large scale software projects that are likely to get executed after a brief reviewing of the code by a human?
- Are there concerns about Mesa-optimization?
Also it is not clear to me whether the safety is supposed to come from:
- the AI cannot really take actions in the world (and even when there is a superhuman AI that wants to do large-scale harms, it will not succeed, because it cannot take actions that achieve these goals)
- the AI has no intrinsic motivation for large-scale harm (while its output bits could in principle create large-scale harm, such a string of bits is unlikely because there is no drive towards these string of bits).
- a combination of these two.