Posts
Comments
If you only care about the real world and you're sure there's only one real world, then the fact that you at time 0 would sometimes want to bind yourself at time 1 (e.g., physically commit to some action or self-modify to perform some action at time 1) seems very puzzling or indicates that something must be wrong, because at time 1 you're in a strictly better epistemic position, having found out more information about which world is real, so what sense does it make that your decision theory makes you-at-time-0 decide to override you-at-time-1's decision?
(If you believed in something like Tegmark IV but your values constantly change to only care about the subset of worlds that you're in, then time inconsistency, and wanting to override your later selves, would make more sense, as your earlier self and later self would simply have different values. But it seems counterintuitive to be altruistic this way.)
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn't.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I'm afraid that some people just don't feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.
BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don't think that's a good plan, this plan seems even worse.
And I agree with Bryan Caplan's recent take that friendships are often a bigger conflict of interest than money, so Open Phil higher-ups being friends with Anthropic higher-ups is troubling.
No kidding. From https://www.openphilanthropy.org/grants/openai-general-support/:
OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela.
Wish OpenPhil and EAs in general were more willing to reflect/talk publicly about their mistakes. Kind of understandable given human nature, but still... (I wonder if there are any mistakes I've made that I should reflect more on.)
To be clear, by "indexical values" in that context I assume you mean indexing on whether a given world is "real" vs "counterfactual," not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)
I think being indexical in this sense (while being altruistic) can also lead you to reject UDT, but it doesn't seem "compelling" that one should be altruistic this way. Want to expand on that?
Maybe breaking up certain biofilms held together by Ca?
Yeah there's a toothpaste on the market called Livfree that claims to work like this.
IIRC, high EDTA concentration was found to cause significant amounts of erosion.
Ok, that sounds bad. Thanks.
ETA: Found an article that explains how Livfree works in more detail:
Tooth surfaces are negatively charged, and so are bacteria; therefore, they should repel each other. However, salivary calcium coats the negative charges on the tooth surface and bacteria, allowing them to get very close (within 10 nm). At this point, van der Waal’s forces (attractive electrostatic forces at small distances) take over, allowing the bacteria to deposit on the tooth surfaces, initiating biofilm formation.10 A unique formulation of EDTA strengthens the negative electronic forces of the tooth, allowing the teeth to repel harmful plaque. This special formulation quickly penetrates through the plaque down to the tooth surface. There, it changes the surface charge back to negative by neutralizing the positively charged calcium ions. This new, stronger negative charge on the tooth surface environment simply allows the plaque and the tooth surface to repel each other. This requires neither an abrasive nor killing the bacteria (Figure 3).
The authors are very positive on this toothpaste, although they don't directly explain why it doesn't cause tooth erosion.
I actually no longer fully endorse UDT. It still seems a better decision theory approach than any other specific approach that I know, but it has a bunch of open problems and I'm not very confident that someone won't eventually find a better approach that replaces it.
To your question, I think if my future self decides to follow (something like) UDT, it won't be because I made a "commitment" to do it, but because my future self wants to follow it, because he thinks it's the right thing to do, according to his best understanding of philosophy and normativity. I'm unsure about this, and the specific objection you have is probably covered under #1 in my list of open questions in the link above.
(And then there's a very different scenario in which UDT gets used in the future, which is that it gets built into AIs, and then they keep using UDT until they decide not to, which if UDT is reflectively consistent would be never. I dis-endorse this even more strongly.)
Any thoughts on edathamil/EDTA or nano-hydroxyapatite toothpastes?
This means that in the future, there will likely be a spectrum of AIs of varying levels of intelligence, some much smarter than humans, others only slightly smarter, and still others merely human-level.
Are you imagining that the alignment problem is still unsolved in the future, such that all of these AIs are independent agents unaligned with each other (like humans currently are)? I guess in my imagined world, ASIs will have solved the alignment (or maybe control) problem at least for less intelligent agents, so you'd get large groups of AIs aligned with each other that can for many purposes be viewed as one large AI.
Building on (5), I generally expect AIs to calculate that it is not in their interest to expropriate wealth from other members of society, given how this could set a precedent for future wealth expropriation that comes back and hurts them selfishly.
At some point we'll reach technological maturity, and the ASIs will be able to foresee all remaining future shocks/changes to their economic/political systems, and probably determine that expropriating humans (and anyone else they decide to, I agree it may not be limited to humans) won't cause any future problems.
Even if a tiny fraction of consumer demand in the future is for stuff produced by humans, that could ensure high human wages simply because the economy will be so large.
This is only true if there's not a single human that decides to freely copy or otherwise reproduce themselves and drive down human wages to subsistence. And I guess yeah, maybe AIs will have fetishes like this, but (like my reaction to Paul Christiano's "1/trillion kindness" argument) I'm worried whether AIs might have less benign fetishes. This worry more than cancels out the prospect that humans might live / earn a wage from benign fetishes in my mind.
This might be the most important point on my list, despite saying it last, but I think humans will likely be able to eventually upgrade their intelligence, better allowing them to “keep up” with the state of the world in the future.
I agree this will happen eventually (if humans survive), but think it will take a long time because we'll have to solve a bunch of philosophical problems to determine how to do this safely (e.g. without losing or distorting our values) and we probably can't trust AI's help with these (although I'd love to change that, hence my focus on metaphilosophy), and in the meantime AIs will be zooming ahead partly because they started off thinking faster and partly because some will be reckless (like some humans currently are!) or have simple values that don't require philosophical contemplation to understand, so the situation I described is still likely to occur.
It therefore seems perfectly plausible for AIs to simply get rich within the system we have already established, and make productive compromises, rather than violently overthrowing the system itself.
So assuming that AIs get rich peacefully within the system we have already established, we'll end up with a situation in which ASIs produce all value in the economy, and humans produce nothing but receive an income and consume a bunch, through ownership of capital and/or taxing the ASIs. This part should be non-controversial, right?
At this point, it becomes a coordination problem for the ASIs to switch to a system in which humans no longer exist or no longer receive any income, and the ASIs get to consume or reinvest everything they produce. You're essentially betting that ASIs can't find a way to solve this coordination problem. This seems like a bad bet to me. (Intuitively it just doesn't seem like a very hard problem, relative to what I imagine the capabilities of the ASIs to be.)
I'm simply arguing against the point that smart AIs will automatically turn violent and steal from agents who are less smart than they are unless they're value aligned. This is a claim that I don't think has been established with any reasonable degree of rigor.
I don't know how to establish anything post-ASI "with any reasonable degree of rigor" but the above is an argument I recently thought of, which seems convincing, although of course you may disagree. (If someone has expressed this or a similar argument previously, please let me know.)
- Why? Perhaps we'd do it out of moral uncertainty, thinking maybe we owe something to our former selves, but future people probably won't think this.
- Currently our utility is roughly log in money, partly because we spend money on instrumental goals and there's diminishing returns due to limited opportunities being used up. This won't be true of future utilitarians spending resources on their terminal values. So "one in hundred million fraction" of resources is a much bigger deal to them than to us.
I have a slightly different take, which is that we can't commit to doing this scheme even if we want to, because I don't see what we can do today that would warrant the term "commitment", i.e., would be binding on our post-singularity selves.
In either case (we can't or don't commit), the argument in the OP loses a lot of its force, because we don't know whether post-singularity humans will decide to do this kind scheme or not.
So the commitment I want to make is just my current self yelling at my future self, that "no, you should still bail us out even if 'you' don't have a skin in the game anymore". I expect myself to keep my word that I would probably honor a commitment like that, even if trading away 10 planets for 1 no longer seems like that good of an idea.
This doesn't make much sense to me. Why would your future self "honor a commitment like that", if the "commitment" is essentially just one agent yelling at another agent to do something the second agent doesn't want to do? I don't understand what moral (or physical or motivational) force your "commitment" is supposed to have on your future self, if your future self does not already think doing the simulation trade is a good idea.
I mean imagine if as a kid you made a "commitment" in the form of yelling at your future self that if you ever had lots of money you'd spend it all on comic books and action figures. Now as an adult you'd just ignore it, right?
Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.
The meta problem here is that you gave a "proof" (in quotes because I haven't verified it myself as correct) using your own definitions of "aligned" and "superintelligence", but if people asserting that it's not possible in principle have different definitions in mind, then you haven't actually shown them to be incorrect.
Apparently the current funding round hasn't closed yet and might be in some trouble, and it seems much better for the world if the round was to fail or be done at a significantly lower valuation (in part to send a message to other CEOs not to imitate SamA's recent behavior). Zvi saying that $150B greatly undervalues OpenAI at this time seems like a big unforced error, which I wonder if he could still correct in some way.
What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?
I'm very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?
as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what's happening in a way that corrupts thoughts which previously implemented values.
Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can't definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:
- Is it better to have no competition or some competition, and what kind? (Past "moral/philosophical progress" might have been caused or spread by competitive dynamics.)
- How should social status work in CEV? (Past "progress" might have been driven by people motivated by certain kinds of status.)
- No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)
can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its "true wants, needs, and hopes for the future"?
I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what "correct philosophical reasoning" is, we can solve this problem by asking "What would this chunk of matter conclude if it were to follow correct philosophical reasoning?"
As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:
Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.
When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.
So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?
BTW here's a poem(?) I wrote for Twitter, actually before coming across the above story:
The people try to align the board. The board tries to align the CEO. The CEO tries to align the managers. The managers try to align the employees. The employees try to align the contractors. The contractors sneak the work off to the AI. The AI tries to align the AI.
but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.
Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won't be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very distrustful of such person/group from the start.
Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.
Would be interested in understanding your perspective on this better. I feel like aside from AI, our world is not being eaten by molochs very quickly, and I prefer something like stopping AI development and doing (voluntary and subsidized) embryo selection to increase human intelligence for a few generations, then letting the smarter humans decide what to do next. (Please contact me via PM if you want to have a chat about this.)
AI companies don't seem to be shy about copying RLHF though. Llama, Gemini, and Grok are all explicitly labeled as using RLHF.
It's also not clear to me that most of the value of AI will accrue to them. I'm confused about this though.
I'm also uncertain, and its another reason for going long a broad index instead. I would go even broader than S&P 500 if I could, but nothing else has option chains going out to 2029.
If indeed OpenAI does restructure to the point where its equity is now genuine, then $150 billion seems way too low as a valuation
Why is OpenAI worth much more than $150B, when Anthropic is currently valued at only $30-40B? Also, loudly broadcasting this reduces OpenAI's cost of equity, which is undesirable if you think OpenAI is a bad actor.
To clarify, I don't actually want you to scare people this way, because I don't know if people can psychologically handle it or if it's worth the emotional cost. I only bring it up myself to counteract people saying things like "AIs will care a little about humans and therefore keep them alive" or when discussing technical solutions/ideas, etc.
Should have made it much scarier. "Superhappies" caring about humans "not in the specific way that the humans wanted to be cared for" sounds better or at least no worse than death, whereas I'm concerned about s-risks, i.e., risks of worse than death scenarios.
If a misaligned AI had 1/trillion "protecting the preferences of whatever weak agents happen to exist in the world", why couldn't it also have 1/trillion other vaguely human-like preferences, such as "enjoy watching the suffering of one's enemies" or "enjoy exercising arbitrary power over others"?
From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I'm very philosophically confused about how to think about all of this.)
And his response was basically to say that he already acknowledged my concern in his OP:
I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
Personally, I have a bigger problem with people (like Paul and Carl) who talk about AIs keeping people alive, and not talk about s-risks in the same breath or only mention it in a vague, easy to miss way, than I have with Eliezer not addressing Paul's arguments.
I'm thinking that the most ethical (morally least risky) way to "insure" against a scenario in which AI takes off and property/wealth still matters is to buy long-dated far out of the money S&P 500 calls. (The longest dated and farthest out of the money seems to be Dec 2029 10000-strike SPX calls. Spending $78 today on one of these gives a return of $10000 if SPX goes to 20000 by Dec 2029, for example.)
My reasoning here is that I don't want to provide capital to AI industries or suppliers because that seems wrong given what I judge to be high x-risk their activities are causing (otherwise I'd directly invest in them), but I also want to have resources in a post-AGI future in case that turns out to be important for realizing my/moral values. Suggestions welcome for better/alternative ways to do this.
What is going on with Constitution AI? Does anyone know why no LLM aside from Claude (at least none that I can find) has used it? One would think that if it works about as well as RLHF (which it seems to), AI companies would be flocking to it to save on the cost of human labor?
Also, apparently ChatGPT doesn't know that Constitutional AI is RLAIF (until I reminded it) and Gemini thinks RLAIF and RLHF are the same thing. (Apparently not a fluke as both models made the same error 2 out of 3 times.)
- Once they get into CEV, they may not want to defer to others anymore, or may set things up with a large power/status imbalance between themselves and everyone else which may be detrimental to moral/philosophical progress. There are plenty of seemingly idealistic people in history refusing to give up or share power once they got power. The prudent thing to do seems to never get that much power in the first place, or to share it as soon as possible.
- If you're pretty sure you will defer to others once inside CEV, then you might as well do it outside CEV due to #1 in my grandparent comment.
The main asymmetries I see are:
- Other people not trusting the group to not be corrupted by power and to reflect correctly on their values, or not trusting that they'll decide to share power even after reflecting correctly. Thus "programmers" who decide to not share power from the start invite a lot of conflict. (In other words, CEV is partly just trying to not take power away from people, whereas I think you've been talking about giving AIs more power than they already have. "the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with")
- The "programmers" not trusting themselves. I note that individuals or small groups trying to solve morality by themselves don't have very good track records. They seem to too easily become wildly overconfident and/or get stuck in intellectual dead-ends. Arguably the only group that we have evidence for being able to make sustained philosophical progress is humanity as a whole.
To the extent that these considerations don't justify giving every human equal power/weight in CEV, I may just disagree with Eliezer about that. (See also Hacking the CEV for Fun and Profit.)
About a week ago FAR.AI posted a bunch of talks at the 2024 Vienna Alignment Workshop to its YouTube channel, including Supervising AI on hard tasks by Jan Leike.
What do you think about my positions on these topics as laid out in and Six Plausible Meta-Ethical Alternatives and Ontological Crisis in Humans?
My overall position can be summarized as being uncertain about a lot of things, and wanting (some legitimate/trustworthy group, i.e., not myself as I don't trust myself with that much power) to "grab hold of the whole future" in order to preserve option value, in case grabbing hold of the whole future turns out to be important. (Or some other way of preserving option value, such as preserving the status quo / doing AI pause.) I have trouble seeing how anyone can justifiably conclude "so don’t worry about grabbing hold of the whole future" as that requires confidently ruling out various philosophical positions as false, which I don't know how to do. Have you reflected a bunch and really think you're justified in concluding this?
E.g. in Ontological Crisis in Humans I wrote "Maybe we can solve many ethical problems simultaneously by discovering some generic algorithm that can be used by an agent to transition from any ontology to another?" which would contradict your "not expecting your preferences to extend into the distant future with many ontology changes" and I don't know how to rule this out. You wrote in the OP "Current solutions, such as those discussed in MIRI’s Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I’m not convinced there is a satisfactory solution within the constraints presented." but to me this seems like very weak evidence for the problem being actually unsolvable.
As long as all mature superintelligences in our universe don't necessarily have (end up with) the same values, and only some such values can be identified with our values or what our values should be, AI alignment seems as important as ever. You mention "complications" from obliqueness, but haven't people like Eliezer recognized similar complications pretty early, with ideas such as CEV?
It seems to me that from a practical perspective, as far as what we should do, your view is much closer to Eliezer's view than to Land's view (which implies that alignment doesn't matter and we should just push to increase capabilities/intelligence). Do you agree/disagree with this?
It occurs to me that maybe you mean something like "Our current (non-extrapolated) values are our real values, and maybe it's impossible to build or become a superintelligence that shares our real values so we'll have to choose between alignment and superintelligence." Is this close to your position?
I think the relevant implication from the thought experiment is that thinking a bunch about metaethics and so on will in practice change your values
I don't think that's necessarily true. For example some people think about metaethics and decide that anti-realism is correct and they should just keep their current values. I think that's overconfident but it does show that we don't know whether correct thinking about metaethics necessarily leads one to change one's values. (Under some other metaethical possibilities the same is also true.)
Also, even if it possible to steelman Land in a way to eliminate flaws in his argument, I'd rather spend my time reading philosophers who are more careful and do more thinking (or are better at it) before confidently declaring a conclusion. I do appreciate you giving an overview of his ideas, as it's good to be familiar with that part of the current philosophical landscape (apparently Land is a fairly prominent philosopher with an extensive Wikipedia page).
This made me curious enough to read Land's posts on the orthogonality thesis. Unfortunately I got a pretty negative impression from them. From what I've read, Land tends to be overconfident in his claims and fails to notice obvious flaws in his arguments. Links for people who want to judge for themselves (I had to dig up archive.org links as the original site has disappeared):
- http://web.archive.org/web/20131028060133/http://www.xenosystems.net/against-orthogonality/
- http://web.archive.org/web/20141013052107/http://www.xenosystems.net/stupid-monsters/
- http://web.archive.org/web/20200809114022/http://www.xenosystems.net/more-thought/
- http://web.archive.org/web/20140917022211/http://www.xenosystems.net/will-to-think/
From Will-to-Think ("Probably Land's best anti-orthogonalist essay"):
Imagine, instead, that Gandhi is offered a pill that will vastly enhance his cognitive capabilities, with the rider that it might lead him to revise his volitional orientation — even radically — in directions that cannot be anticipated, since the ability to think through the process of revision is accessible only with the pill. This is the real problem FAI (and Super-humanism) confronts. The desire to take the pill is the will-to-think. The refusal to take it, based on concern that it will lead to the subversion of presently supreme values, is the alternative. It’s a Boolean dilemma, grounded in the predicament: Is there anything we trust above intelligence (as a guide to doing ‘the right thing’)? The postulate of the will-to-think is that anything other than a negative answer to this question is self-destructively contradictory, and actually (historically) unsustainable.
When reading this it immediately jumps out at me that "boolean" is false. There are many other options Gandhi could take besides taking the pill or not. He could look for other ways to increase intelligence and pick one that is least likely to subvert his values. Perhaps try to solve metaethics first so that he has a better idea of what "preserving values" or "subversion of values" means. Or try to solve metaphilosophy to better understand what method of thinking is more likely to lead to correct philosophical conclusions, before trying to reflect on one's values. Somehow none of these options occur to Land and he concludes that the only reasonable choice is to take the pill with unknown effects on one's values.
- You can also make an argument for not taking over the world on consequentialist grounds, which is that nobody should trust themselves to not be corrupted by that much power. (Seems a bit strange that you only talk about the non-consequentialist arguments in footnote 1.)
- I wish this post also mentioned the downsides of decentralized or less centralized AI (such as externalities and race dynamics reducing investment into safety, potential offense/defense imbalances, which in my mind are just as worrisome as the downsides of centralized AI), even if you don't focus on them for understandable reasons. To say nothing risks giving the impression that you're not worried about that at all, and people should just straightforwardly push for decentralized AI to prevent the centralized outcome that many fear.
I'm actually pretty confused about what they did exactly. From the Safety section of Learning to Reason with LLMs:
Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model's safety refusal boundaries. We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.
from Hiding the Chains of Thought:
For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
These two sections seem to contradict each other but I can also think of ways to interpret them to be more consistent. (Maybe "don't train any policy compliance or user preferences onto the chain of thought" is a potential future plan, not what they already did. Maybe they taught the model to reason about safety rules but not to obey them in the chain of thought itself.)
Does anyone know more details about this, and also about the reinforcement learning that was used to train o1 (what did they use as a reward signal, etc.)? I'm interested to understand how alignment in practice differs from theory (e.g. IDA), or if OpenAI came up with a different theory, what its current alignment theory is.
If the other player is a stone with “Threat” written on it, you should do the same thing, even if it looks like the stone’s behavior doesn’t depend on what you’ll do in response. Responding to actions and ignoring the internals when threatened means you’ll get a lot fewer stones thrown at you.
In order to "do the same thing" you either need the other's player's payoffs, or according to the next section "If you receive a threat and know nothing about the other agent’s payoffs, simply don’t give in to the threat!" So if all you see is a stone, then presumably you don't know the other agent's payoffs, so presumably "do the same thing" means "don't give in".
But that doesn't make sense because suppose you're driving and suddenly a boulder rolls towards you. You're going to "give in" and swerve, right? What if it's an animal running towards you and you know they're too dumb to do LDT-like reasoning or model your thoughts in their head, you're also going to swerve, right? So there's still a puzzle here where agents have an incentive to make themselves look like a stone (i.e., part of nature or not an agent), or to never use LDT or model others in any detail.
Another problem is, do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?
#1 has obviously happened. Nordstream 1 was blown up within weeks of my OP, and AFAIK Russian hasn't substantially expanded its other energy exports. Less sure about #2 and #3, as it's hard to find post-2022 energy statistics. My sense is that the answers are probably "yes" but I don't know how to back that up without doing a lot of research.
However coal stocks (BTU, AMR, CEIX, ARCH being the main pure play US coal stocks) haven't done as well as I had expected (the basket is roughly flat from Aug 2022 to today) for two other reasons: A. There have been two mild winters that greatly reduced winter energy demands and caused thermal coal prices to crash. Most people seem to attribute this to global warming caused by maritime sulfur regulations. B. Chinese real-estate problems caused metallurgical coal prices to also crash in recent months.
My general lesson from this is that long term investing is harder than I thought. Short term trading can still be profitable but can't match the opportunities available back in 2020-21 when COVID checks drove the markets totally wild. So I'm spending a lot less time investing/trading these days.
Unfortunately this ignores 3 major issues:
- race dynamics (also pointed out by Akash)
- human safety problems - given that alignment is defined "in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy", why should we believe that AI developers and/or parts of governments that can coerce AI developers will steer the AI systems in a good direction? E.g., that they won't be corrupted by power or persuasion or distributional shift, and are benevolent to begin with.
- philosophical errors or bottlenecks - there's a single mention of "wisdom" at the end, but nothing about how to achieve/ensure the unprecedented amount of wisdom or speed of philosophical progress that would be needed to navigate something this novel, complex, and momentous. The OP seems to suggest punting such problems to "outside consensus" or "institutions or processes", with apparently no thought towards whether such consensus/institutions/processes would be up to the task or what AI developers can do to help (e.g., by increasing AI philosophical competence).
Like others I also applaud Sam for writing this, but the actual content makes me more worried, as it's evidence that AI developers are not thinking seriously about some major risks and risk factors.
I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy
Thanks for this info and the references. I guess by "metaphilosophy" I meant something more meta than metaethics or metaepistemology, i.e., a field that tries to understand all philosophical reasoning in some unified or systematic way, including reasoning used in metaethics and metaepistemology, and metaphilosophy itself. (This may differ from standard academic terminology, in which case please let me know if there's a preferred term for the concept I'm pointing at.) My reasoning being that metaethics itself seems like a hard problem that has defied solution for centuries, so why stop there instead of going even more meta?
Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen.
I think you (and other philosophers) may be too certain that a pause won't happen, but I'm not sure I can convince you (at least not easily). What about calling for it in a low cost way, e.g., instead of doing something high profile like an open letter (with perceived high opportunity costs), just write a blog post or even a tweet saying that you wish for an AI pause, because ...? What if many people privately prefer an AI pause, but nobody knows because nobody says anything? What if by keeping silent, you're helping to keep society in a highly suboptimal equilibrium?
I think there are also good arguments for doing something like this from a deontological or contractualist perspective (i.e. you have a duty/obligation to honestly and publicly report your beliefs on important matters related to your specialization), which sidestep the "opportunity cost" issue, but I'm not sure if you're open to that kind of argument. I think they should have some weight given moral uncertainty.
Sadly, I don't have any really good answers for you.
Thanks, it's actually very interesting and important information.
I don't know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I've noticed (and stated in the OP) that normative ethics seems to be an exception where it's common to express uncertainty/confusion/difficulty. But I think, from both my inside and outside views, that this should be common in most philosophical fields (because e.g. we've been trying to solve them for centuries without coming up with broadly convincing solutions), and there should be a steady stream of all kinds of philosophers going up the meta ladder all the way to metaphilosophy. It recently dawned on me that this doesn't seem to be the case.
Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don't know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
What seems useless, calling for an AI pause, or the AI pause itself? Have trouble figuring out because if "calling for an AI pause", what is the opportunity cost (seems easy enough to write or sign an open letter), and if "AI pause itself", "seems useless" contradicts "would love". In either case, this seems extremely important to openly discuss/debate! Can you please ask these philosophers to share their views of this on LW (or their preferred venue), and share your own views?
Thank you for your view from inside academia. Some questions to help me get a better sense of what you see:
- Do you know any philosophers who switched from non-meta-philosophy to metaphilosophy because they become convinced that the problems they were trying to solve are too hard and they needed to develop a better understanding of philosophical reasoning or better intellectual tools in general? (Or what's the closest to this that you're aware of?)
- Do you know any philosophers who have expressed an interest in ensuring that future AIs will be philosophically competent, or a desire/excitement for supercompetent AI philosophers? (I know 1 or 2 private expressions of the former, but not translated into action yet.)
- Do you know any philosophers who are worried that philosophical problems involved in AI alignment/safety may be too hard to solve in time, and have called for something like an AI pause to give humanity more time to solve them? (Even philosophers who have expressed a concern about AI x-risk or are working on AI safety have not taken a position like this, AFAIK.)
- How often have you seen philosophers say something like "Upon further reflection, my proposed solution to problem X has many problems/issues, I'm no longer confident it's the right approach and now think X is much harder than I originally thought."
Would also appreciate any links/citations/quotes (if personal but sharable communications) on these.
These are all things I've said or done due to high estimate of philosophical difficulty, but not (or rarely) seen among academic philosophers, at least from my casual observation from outside academia. It's also possible that we disagree on what estimate of philosophical difficulty is appropriate (such that for example you don't think philosophers should often say or do these things), which would also be interesting to know.
My understanding of what happened (from reading this) is that you wanted to explore in a new direction very different from the then preferred approach of the AF team, but couldn't convince them (or someone else) to join you. To me this doesn't clearly have much to do with streetlighting, and my current guess is that it was probably reasonable of them to not be convinced. It was also perfectly reasonable of you to want to explore a different approach, but it seems unreasonable to claim without giving any details that it would have produced better results if only they had listened to you. (I mean you can claim this, but why should I believe you?)
If you disagree (and want to explain more), maybe you could either explain the analogy more fully (e.g., what corresponds to the streetlight, why should I believe that they overexplored the lighted area, what made you able to "see in the dark" to pick out a more promising search area or did you just generally want to explore the dark more) and/or try to convince me on the object level / inside view that your approach is or was more promising?
(Also perfectly fine to stop here if you want. I'm pretty curious on both the object and meta levels about your thoughts on AF, but you may not have wanted to get into such a deep discussion when you first joined this thread.)
(Upvoted since your questions seem reasonable and I'm not sure why you got downvoted.)
I see two ways to achieve some justifiable confidence in philosophical answers produced by superintelligent AI:
- Solve metaphilosophy well enough that we achieve an understanding of philosophical reasoning on par with mathematical reason, and have ideas/systems analogous to formal proofs and mechanical proof checkers that we can use to check the ASI's arguments.
- We increase our own intelligence and philosophical competence until we can verify the ASI's reasoning ourselves.
Having worked on some of the problems myself (e.g. decision theory), I think the underlying problems are just very hard. Why do you think they could have done "so much more, much more intently, and much sooner"?
I've had this tweet pinned to my Twitter profile for a while, hoping to find some like-minded people, but with 13k views so far I've yet to get a positive answer (or find someone expressing this sentiment independently):
Among my first reactions upon hearing "artificial superintelligence" were "I can finally get answers to my favorite philosophical problems" followed by "How do I make sure the ASI actually answers them correctly?"
Anyone else reacted like this?
This aside, there are some people around LW/rationality who seem more cautious/modest/self-critical about proposing new philosophical solutions, like MIRI's former Agent Foundations team, but perhaps partly as a result of that, they're now out of a job!
"Signal group membership" may be true of the fields you mentioned (political philosophy and philosophy of religion), but seems false of many other fields such as philosophy of math, philosophy of mind, decision theory, anthropic reasoning. Hard to see what group membership someone is signaling by supporting one solution to Sleeping Beauty vs another, for example.
I'm increasingly worried that philosophers tend to underestimate the difficulty of philosophy. I've previously criticized Eliezer for this, but it seems to be a more general phenomenon.
Observations:
- Low expressed interest in metaphilosophy (in relation to either AI or humans)
- Low expressed interest in AI philosophical competence (either concern that it might be low, or desire/excitement for supercompetent AI philosophers with Jupiter-sized brains)
- Low concern that philosophical difficulty will be a blocker of AI alignment or cause of AI risk
- High confidence when proposing novel solutions (even to controversial age-old questions, and when the proposed solution fails to convince many)
- Rarely attacking one's own ideas (in a serious or sustained way) or changing one's mind based on others' arguments
- Rarely arguing for uncertainty/confusion (i.e., that that's the appropriate epistemic status on a topic), with normative ethics being a sometime exception
Possible explanations:
- General human overconfidence
- People who have a high estimate of difficulty of philosophy self-selecting out of the profession.
- Academic culture/norms - no or negative rewards for being more modest or expressing confusion. (Moral uncertainty being sometimes expressed because one can get rewarded by proposing some novel mechanism for dealing with it.)
I have a lot of disagreements with section 6. Not sure where the main crux is, so I'll just write down a couple of things.
One intuition pump here is: in the current, everyday world, basically no one goes around with much of a sense of what people’s “values on reflection” are, or where they lead.
This only works because we're not currently often in danger of subjecting other people to major distributional shifts. See Two Neglected Problems in Human-AI Safety.
That is, ultimately, there is just the empirical pattern of: what you would think/feel/value given a zillion different hypothetical processes; what you would think/feel/value about those processes given a zillion different other hypothetical processes; and so on. And you need to choose, now, in your actual concrete circumstance, which of those hypotheticals to give authority to.
I notice that in order to argue that solving AI alignment does not need "very sophisticated philosophical achievement", you've proposed a solution to metaethics, which would itself constitute a "very sophisticated philosophical achievement" if it's correct!
Personally I'm very uncertain about metaethics (see also previous discussion on this topic between Joe and me), and don't want to see humanity bet the universe on any particular metaethical theory in our current epistemic state.
High population may actually be a problem, because it allows the AI transition to occur at low average human intelligence, hampering its governance. Low fertility/population would force humans to increase average intelligence before creating our successor, perhaps a good thing!
This assumes that it's possible to create better or worse successors, and that higher average human intelligence would lead to smarter/better politicians and policies, increasing our likelihood of building better successors.
Some worry about low fertility leading to a collapse of civilization, but embryo selection for IQ could prevent that, and even if collapse happens, natural selection would start increasing fertility and intelligence of humans again, so future smarter humans should be able to rebuild civilization and restart technological progress.
Added: Here's an example to illustrate my model. Assume a normally distributed population with average IQ of 100 and we need a certain number of people with IQ>130 to achieve AGI. If the total population was to half, then to get the same absolute number of IQ>130 people as today, average IQ would have to increase by 4.5, and if the population was to become 1/10 of the original, average IQ would have to increase by 18.75.
Social media sites are already getting overwhelmed by spam, fake images, fake videos, blackmail attempts, phishing, etc. The only way to counteract the speed and volume of massive AI-driven attacks is with AI-powered defenses. These defenses need rules. If those rules aren't formal and proven robust, then they will likely be hacked and exploited by adversarial AIs. So at the most basic level, we need infrastructure rules which are provably robust against classes of attacks. What those attack classes are and what properties those rules guarantee is part of what I'm arguing we need to be working on right now.
Maybe it would be more productive to focus on these nearer-term topics, which perhaps can be discussed more concretely. Have you talked to any experts in formal methods who think that it would be feasible (in the near future) to define such AI-driven attack classes and desirable properties for defenses against them, and do they have any specific ideas for doing so? Again from my own experience in cryptography, it took decades to formally define/refine seemingly much simpler concepts, so it's hard for me to understand where your relative optimism comes from.