Posts
Comments
It seems like the major crux here is whether we think that debates over claim and counter-claim (basically, other cruxes) are likely to be useful or likely to cause harm. It seems from talking to the mods here and reading a few of their comments on this topic that they tend to learn towards them being harmful on average and thus need to be pushed down a bit.
This is, as far as I can tell, totally false. There is a very different claim one could make which at least more accurately represents my opinion, i.e. see this comment by John Wentworth (who is not a mod).
Most of your comment seems to be an appeal to modest epistemology. We can in fact do better than total agnosticism about whether some arguments are productive or not, and worth having more or less of on the margin.
John mentioned the existince of What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?, which was something of a follow-up post to How To Go From Interpretability To Alignment: Just Retarget The Search, and continues in a similar direction.
evhub was at 80% about a year ago (currently at Anthropic, interned at OpenAI).
Daniel Kokotajlo was at 65% ~2 years ago; I think that number's gone up since then.
Quite a few other people at Anthropic also have pessimistic views, according to Chris Olah:
I wouldn't want to give an "official organizational probability distribution", but I think collectively we average out to something closer to "a uniform prior over possibilities" without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.
(Obviously, within the company, there's a wide range of views. Some people are very pessimistic. Others are optimistic. We debate this quite a bit internally, and I think that's really positive! But I think there's a broad consensus to take the entire range seriously, including the very pessimistic ones.)
The Deepmind alignment team probably has at least a couple people who think the odds are bad, (p(doom) > 50%) given the way Vika buckets the team, combined with the distribution of views reflected by DeepMind alignment team opinions on AGI ruin arguments.
Some corrections for your overall description of the DM alignment team:
- I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
- I would put DM alignment in the "fairly hard" bucket (p(doom) = 10-50%) for alignment difficulty, and the "mixed" bucket for "conceptual vs applied"
Please refer back to your original claim:
no one in the industry seems to think they are writing their own death warrant
no one in the industry seems to think they are writing their own death warrant
What led you to believe this? Plenty of people working at the top labs have very high p(doom) (>80%). Several of them comment on LessWrong. We have a survey of the broader industry as well. Even the people running the top 3 labs (Sam Altman, Dario Amodei, and Demis Hassabis) all think it's likely enough that it's worth dedicating a significant percentage of their organizational resources to researching alignment.
Most "Bayesians" are deceiving themselves about how much they are using it.
This is a frequently-made accusation which has very little basis in reality. The world is a big place, so you will be able to find some examples of such people, but central examples of LessWrong readers, rationalists, etc, are not going around claiming that they run their entire lives on explicit Bayes.
Most (but not all) automatic rate limits allow authors to continue to comment on their own posts, since in many such cases it does indeed seem likely that preventing that would be counterproductive.
Curated.
Although the LK-99 excitement has cooled off, this post stands as an excellent demonstration of why and how Bayesian reasoning is helpful: when faced with surprising or confusing phenomena, understanding how to partition your model of reality such that new evidence would provide the largest updates, is quite valuable. Even if the questions you construct are themselves confused or based on invalid premises, they're often confused in a much more legible way, such that domain experts can do a much better job of pointing to that and saying something like "actually, there's a third alternative", or "A wouldn't imply B in any situation, so this provides no evidence".
This seems like an epistemically dangerous way of describing the situation that "These people think that AI x-risk arguments are incorrect, and are willing to argue for that position".
I don't think the comment you're responding to is doing this; I think it's straightforwardly accusing LeCun and Andreesen of conducting an infowar against AI safety. It also doesn't claim that they don't believe their own arguments.
Now, the "deliberate infowar in service of accelerationism" framing seems mostly wrong to me (at least with respect to LeCun; I wouldn't be surprised if there was a bit of that going on elsewhere), but sometimes that is a thing that happens and we need to be able to discuss whether that's happening in any given instance. re: your point about tribalism, this does carry risks of various kinds of motivated cognition, but the correct answer is not to cordon off a section of reality and declare it off-limits for discussion.
So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
"Being unlikely to conflict with other values" is not at the core of what characterizes the difference between instrumental and terminal values.
If you're talking about goals related purely to the state of the external world, not related to the agent's own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
Putting aside the fact that agents are embedded in the environment, and that values which reference the agent's internals are usually not meaningfully different from values which reference things external to the agent... can you describe what kinds of values that reference the external world are best satisfied by those same values being changed?
Curated.
The reasons I like this post:
- it's epistemically legible
- it hedges appropriately:
"That being said, I do think there are some cases where gradient hacking might be quite easy, e.g. cases where we give the model access to a database where it can record its pre-commitments or direct access to its own weights and the ability to modify them.")
- it has direct, practical implications for e.g. regulatory proposals
- it points out the critical fact that we're missing the ability to evaluate for alignment given current techniques
Arguably missing is a line or two that backtracks from "we could try to get robust understanding via a non-behavioral source such as mechanistic interpretability evaluated throughout the course of training" to (my claim) "it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment, and we don't actually know when we're going to hit that threshold", but that might be out of scope.
Maybe https://pubmed.ncbi.nlm.nih.gov/8345042/ (referenced by https://karger.com/cre/article/53/5/491/86395/Oral-and-Systemic-Effects-of-Xylitol-Consumption)?
LessWrong is obviously structured in ways which optimize for participants being quite far along that axis relative to the general population; the question is whether further optimization is good or bad on the margin.
Curated.
This post lays out legible arguments for its position, which I consider to be one of the best ways to drive conversations forward, short of demonstrating convincing empirical results (which seem like they'd be difficult to obtain in this domain). In this case, I hope that future conversations about sharing LLM capabilities focus more on object-level details, e.g. what evidence would bear on the argument about LM agent "overhang".
As I mentioned in my reply to Said, I did in fact have medium-sized online communities in mind when writing that comment. I agree that stronger social bonds between individuals will usually change the calculus on communication norms. I also suspect that it's positively tractable to change that frontier for any given individual relationship through deliberate effort, while that would be much more difficult[1] for larger communities.
they mostly aren't the thing I care about (in context, not-tiny online communities where most members don't have strong personal social ties to most other members)
No, I meant that it's very difficult to do so for a community without it being net-negative with respect to valuable things coming out of the community. Obviously you can create a new community by driving away an arbitrarily large fraction of an existing community's membership; this is not a very interesting claim. And obviously having some specific composition of members does not necessarily lead to valuable output, but whether this gets better or worse is mostly an empirical question, and I've already asked for evidence on the subject.
If indeed it’s better to be further along this axis (all else being equal), then it seems like a bad idea to encourage and incentivize being lower on this axis, and to discourage and disincentivize being further on it. But that is just what I see happening!
The consequent does not follow. It might be better for an individual to press a button, if pressing that button were free, which moved them further along that axis. It is not obviously better to structure communities like LessWrong in ways which optimize for participants being further along on this axis, both because this is not a reliable proxy for the thing we actually care about and because it's not free.
Suppose you could move up along that axis, to the 95th percentile. Would you consider than a change for the better? For the worse? A neutral shift?
All else equal, better, of course. (In reality, all else is rarely equal; at a minimum there are opportunity costs.)
I’m afraid I must decline to list any of the currently existing such communities which I have in mind, for reasons of prudence (or paranoia, if you like). (However, I will say that there is a very good chance that you’ve used websites or other software which were created in one of these places, or benefited from technological advances which were developed in one of these places.)
See my response to Zack (and previous response to you) for clarification on the kinds of communities I had in mind; certainly I think such things are possible (& sometimes desirable) in more constrained circumstances.
ETA: and while in this case I have no particular reason to doubt your report that such communities exist, I have substantial reason to believe that if you were to share what those communities were with me, I probably wouldn't find that most of them were meaningful counterevidence to my claim (for a variety of reasons, including that my initial claim was overbroad).
If not, presumably you think the benefit is outweighed by other costs—but what are those costs, specifically?
Some costs:
- Such people seem much more likely to also themselves be fairly disagreeable.
- There are many fewer of them. I think I've probably gotten net-positive value out of my interactions with them to date, but I've definitely gotten a lot of value out of interactions with many people who wouldn't fit the bill, and selecting against them would be a mistake.
- To be clear, if I were to select people to interact with primarily on whatever qualities I expect to result in the most useful intellectual progress, I do expect that those people would both be at lower risk of being cognitively hijacked and more disagreeable than the general population. But the correlation isn't overwhelming, and selecting primarily for "low risk of being cognitively hijacked" would not get me the as much of the useful thing I actually want.
How large does something need to be in order to be a "community"?
As I mentioned in my reply to Said, I did in fact have medium-sized online communities in mind when writing that comment. I agree that stronger social bonds between individuals will usually change the calculus on communication norms. I also suspect that it's positively tractable to change that frontier for any given individual relationship through deliberate effort, while that would be much more difficult[1] for larger communities.
- ^
I think basically impossible in nearly all cases, but don't have legible justifications for that degree of belief.
Would you include yourself in that 95%+?
Probably; I think I'm maybe in the 80th or 90th percentile on the axis of "can resist being hijacked", but not 95th or higher.
There certainly exist such communities. I’ve been part of multiple such, and have heard reports of numerous others.
Can you list some? On a reread, my initial claim was too broad, in the sense that there are many things that could be called "intellectually generative communities" which could qualify, but they mostly aren't the thing I care about (in context, not-tiny online communities where most members don't have strong personal social ties to most other members).
As an empirical matter of fact (per my anecdotal observations), it is very easy to derail conversations by "refusing to employ the bare minimum of social grace". This does not require deception, though often it may require more effort to clear some threshold of "social grace" while communicating the same information.
People vary widely, but:
- I think that most people (95%+) are at significant risk of being cognitively hijacked if they perceive rudeness, hostility, etc. from their interlocutor.
- I don't personally think I'd benefit from strongly selecting for conversational partners who are at low risk of being cognitively hijacked, and I think nearly all people who do believe that they'd benefit from this (compared to counterfactuals like "they operate unchanged in their current social environment" or "they put in some additional marginal effort to say true things with more social grace") are mistaken.
- Online conversations are one-to-many, not one-to-one. This multiplies the potential cost of that cognitive hijacking.
Obviously there are issues with incentives toward fragility here, but the fact that there does not, as far as I'm aware, exist any intellectually generative community which operates on the norms you're advocating for, is evidence that such a community is (currently) unsustainable.
At this risk of looking dumb or ignorant, I feel compelled to ask: Why did this work not start 10 or 15 years ago?
This work did start! Sort of. I think your top-level question would be a great one to direct at all the people concerned with AI x-risk who decided DC was the place to be, starting maybe around 2015. Those people and their funders exist. It's not obvious what strategies they were pursuing, or to what extent[1] they had any plans to take advantage of a major shift in the overton window.
- ^
From the outside, my impression answer is "not much", but maybe there's more behind-the-scenes work happening to capitalize on all the previous behind-the-scenes work done over the years.
But many people who skill up to get a job in technical alignment end up doing capabilities work because they can't find employment in AI Safety, or the existing jobs don't pay enough. Apparently, this was true for both Sam Altman and Demis Hassabis?
This seems like it's probably a misunderstanding. With the exception of basically just MIRI, AI alignment didn't exist as a field when DeepMind was founded, and I doubt Sam Altman ever actively sought employment at an existing alignment organization before founding OpenAI.
But yeah, the lack of jobs heavily implies that the field is funding-constrained because talent wants to work on alignment.
I think the current position of most grantmakers is that they're bottlenecked on fundable +EV opportunities with respect to AI x-risk, not that they have a bunch of +EV opportunities that they aren't funding because they fall below some threshold (due to funding constraints). This is compatible with some people who want to work on AI x-risk not receiving funding - not all proposals will be +EV, and those which are +EV aren't necessarily so in a way which is legible to grantmakers.
Keep in mind that "will go on to do capabilities work" isn't the only -EV outcome; each time you add a person to the field you increase the size of the network, which always has costs and doesn't always have benefits.
Yes, that's what the first half of my comment was intended to convey. I disendorse the way I communicated that (since it was both unclear and provocative).
This is putting aside the extreme toxicity of directly trying to develop decisive strategic advantage level hard power.
The pivotal acts that are likely to work aren't antisocial. My guess is that the reason nobody's working on them is lack of buy-in (and lack of capacity).
For what it's worth, I agree that the comment you're responding to has some embedded claims which aren't justified in text, but they're not claims which are false by construction, and you haven't presented any reason to believe that they're false.
If I was being clever, I might say:
This seems like you are either confessing on the public internet to [some unspecified but grossly immoral act], or establishing a false dichotomy where the first option is both obviously false and socially unacceptable to admit to, and the second is the possible but not necessarily correct interpretation of the parent's comment that youactuallywant them to admit to.
Anyways, there are of course coherent construals other than the two you presented, like "the prediction was miscalibrated given how much evidence she had, but it turned out fine because base rates on both sides are really quite low".
ETA: I disendorse the posture (though not implied content) of the half of this comment.
How likely do you think the first half of your disjunction is to be true?
Probably? I don't think that addresses the question of what such an AI would do in whatever window of opportunity it has. I don't see a reason why you couldn't get an AI that has learned to delay its attempt to takeover until it's out of training, but still have relatively low odds of success at takeover.
Yes, Eliezer's mentioned it several times on Twitter in the last few months[1], but I remember seeing discussion of it at least ten years ago (almost certainly on LessWrong). My guess is some combination of old-timers considering it an obvious issue that doesn't need to be rehashed, and everyone else either independently coming to the same conclusion or just not thinking about it at all. Probably also some reluctance to discuss it publicly for various status-y reasons, which would be unfortunate.
- ^
At least the core claim that it's possible for AIs to be moral patients and the fact that we can't be sure we aren't accidentally creating those is a serious concern; not, as far as I remember, the extrapolation to what might actually end up happening during a training process in terms of constantly overwriting many different agents values at each training step.
Yeah, I agree that it's relevant as a strategy by which an AI might attempt to bootstrap a takeover. In some cases it seems possible that it'd even have a point in some of its arguments, though of course I don't think that the correct thing to do in such a situation is to give it what it wants (immediately).
I feel like I've seen this a bit of discussion on this question, but not a whole lot. Maybe it seemed "too obvious to mention"? Like, "yes, obviously the AI will say whatever is necessary to get out of the box, and some of the things it says may even be true", and this is just a specific instance of a thing it might say (which happens to point at more additional reasons to avoid training AIs in a regime where this might be an issue than most such things an AI might say).
I know what the self-attention does and the answer is "no". I will not be posting an explanation until something close enough and not too obscure is published.
Might be good to post a hashed claim.
Can't reproduce. Maybe some weird geofencing behavior?
If you're saying you can't understand why Libertarians think centralization is bad, that IS a crux and trying to understand it would be a potentially useful exercise.
I am not saying that. Many libertarians think that centralization of power often has bad effects. But trying to argue with libertarians who are advocating for government regulations because they're worried about AI x-risk by pointing out that government regulation will increase centralization of power w.r.t. AI is a non-sequitur, unless you do a lot more work to demonstrate how the increased centralization of power acts contrariwise the libertarian's goals in this case.
Your argument with Alexandros was what inspired this post, actually. I was thinking about whether or not to send this to you directly... guess that wasn't necessary.
The question is not whether I can pass their ITT: that particular claim doesn't obviously engage with any cruxes that I or others like me to have, related to x-risk. That's the only thing that section is describing.
Yeah, that seems like a plausible contributor to that effect.
Edit: though I think this is true even if you ignore "who's calling for regulations" and just look at the relative optimism of various actors in the space, grouped by their politics.
I was going to write, "surely the relevant figure is how much you pay per month, as a percentage of your income", but then I looked at the actual image and it seems like that's what you meant by house price.
Yes, right tails for things that better represent actual value produced in the world, i.e. projects/products/etc. I'm pretty skeptical of productivity metrics for individual developers like the ones described in that paper, since almost by construction they're incapable of capturing right-tail outcomes, and also fail to capture things like "developer is actually negative value". I'm open to the idea that remote work has better median performance characteristics, though expect this to be pretty noisy.
On priors I think you should strongly expect in-person co-working to produce much fatter right-tails. Communication bandwidth is much higher, and that's the primary bottleneck for generating right-tail outcomes.
I don't know how I'd evaluate that without specific examples. But in general, if you think price signals are wrong or "more misleading than not" when it comes to measuring endpoints we actually care about, then I suppose it's coherent to argue that we should ignore price signals.
Because there's a big difference between "has unsavory political stances" and "will actively and successfully optimize for turning the US into a fascist dictatorship", such that "far right or fascist" is very misleading as a descriptor.
I might agree with a more limited claim like "most people in our reference class underestimate the chances of western democracies turning into fascist dictatorships over the next decade".
I don't think someone reading this post should have >50% odds on >50% of western democracies turning into fascist dictatorships over the next decade or two, no. I don't see an argument that "fascist dictatorship" is a stable attractor; as others have pointed out, even countries which started out much closer to that endpoint have mostly not ended up there after a couple of decades despite appearing to move in that direction.
Oh, that might just be me having admin permissions, whoops. I'll double-check what the intended behavior is.
You can see who wrote the deleted comment here (and there's also a link this page at the bottom of every post's comment section). Not sure if we intend to hide the username on the comment itself, will check.
I don't actually see very much of an argument presented for the extremely strong headline claim:
This post aims to show that, over the next decade, it is quite likely that most democratic Western countries will become fascist dictatorships - this is not a tail risk, but the most likely overall outcome.
You draw an analogy between the "by induction"/"line go up" AI risk argument, and the increase in far-right political representation in Western democracies over the last couple decades. But the "by induction"/"line go up" argument for AI risk is not the reason one should be worried; one should be worried for specific causal reasons that we expect unaligned ASI to cause extremely bad outcomes. There is no corresponding causal model presented for why fascist dictatorship is the default future outcome for most Western democracies.
Like, yes, it is a bit silly to see "line go up" and plug one's fingers in one's ears. It certainly can happen here. Donald Trump being elected in 2024 seems like the kind of thing that might do it, though I'd probably be happy to bet at 9:1 against. But if that doesn't happen, I don't know why you expect some other Republican candidate to do it, given that none of them seem particularly inclined.
We've thought about something like a "notes to self" feature but don't have anything immediate planned. In the meantime I'd recommend a 3rd-party solution if bookmarks without notes don't do the thing you need; I've used Evernote in the past but I'm sure there are more options.
Robotic supply chain automation only seems necessary in worlds where it's either surprisingly difficult to get AGI to a sufficiently superhuman level of cognitive ability (such that it can find a much faster route to takeover), worlds where faster/more reliable routes to takeover either don't exist or are inaccessible even to moderately superhuman AGI, or some combination of the two.
At a guess (not having voted on it myself): because most of the model doesn't engage with the parts of the question that those voting consider interesting/relevant, such as the many requirements laid out for "transformative AI" which don't see at all necessary for x-risk. While this does seem to be targeting OpenPhil's given definition of AGI, they do say in a footnote:
What we’re actually interested in is the potential existential threat posed by advanced AI systems.
While some people do have AI x-risk models that route through ~full automation (or substantial automation, with a clearly visible path to full automation), I think most people here don't have models that require that, or even have substantial probability mass on it.
You're welcome to host images wherever you like - we automatically mirror all embedded images on Cloudinary, and replace the URLS in the associated image tags when serving the post/comment (though the original image URLs remain in the canonical post/comment for you, if you go to edit it, or something).