Posts
Comments
I understand this post to be claiming (roughly speaking) that you assign >90% likelihood in some cases and ~50% in other cases that LLMs have internal subjective experiences of varying kinds. The evidence you present in each case is outputs generated by LLMs.
The referents of consciousness for which I understand you to be making claims re: internal subjective experiences are 1, 4, 6, 12, 13, and 14. I'm unsure about 5.
Do you have sources of evidence (even illegible) other than LLM outputs that updated you that much? Those seem like very surprisingly large updates to make on the basis of LLM outputs (especially in cases where those outputs are self-reports about the internal subjective experience itself, which are subject to substantial pressure from post-training).
Separately, I have some questions about claims like this:
The Big 3 LLMs are somewhat aware of what their own words and/or thoughts are referring to with regards to their previous words and/or thoughts. In other words, they can think about the thoughts "behind" the previous words they wrote.
This doesn't seem constructively ruled out by e.g. basic transformer architectures, but as justification you say this:
If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its "attention" modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur "behind the scenes" of the text you actually see on screen.
How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs, vs. conditioning on its previous outputs to give you plausible "explanation" for what it meant? The second seems much more likely to me (and this behavior isn't that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning).
My impression is that Yudkowsky has harmed public epistemics in his podcast appearances by saying things forcefully and with rather poor spoken communication skills for novice audiences.
I recommend reading the Youtube comments on his recorded podcasts, rather than e.g. Twitter commentary from people with a pre-existing adversarial stance to him (or AI risk questions writ large).
On one hand, I feel a bit skeptical that some dude outperformed approximately every other pollster and analyst by having a correct inside-view belief about how existing pollster were messing up, especially given that he won't share the surveys. On the other hand, this sort of result is straightforwardly predicted by Inadequate Equilibria, where an entire industry had the affordance to be arbitrarily deficient in what most people would think was their primary value-add, because they had no incentive to accuracy (skin in the game), and as soon as someone with an edge could make outsized returns on it (via real-money prediction markets), they outperformed all the experts.
On net I think I'm still <50% that he had a correct belief about the size of Trump's advantage that was justified by the evidence he had available to him, but even being directionally-correct would have been sufficient to get outsized returns a lot of the time, so at that point I'm quibbling with his bet sizing rather than the direction of the bet.
I'm pretty sure Ryan is rejecting the claim that the people hiring for the roles in question are worse-than-average at detecting illegible talent.
Depends on what you mean by "resume building", but I don't think this is true for "need to do a bunch of AI safety work for free" or similar. i.e. for technical research, many people that have gone through MATS and then been hired at or founded their own safety orgs have no prior experience doing anything that looks like AI safety research, and some don't even have much in the way of ML backgrounds. Many people switch directly out of industry careers into doing e.g. ops or software work that isn't technical research. Policy might seem a bit trickier but I know several people who did not spend anything like years doing resume building before finding policy roles or starting their own policy orgs and getting funding. (Though I think policy might actually be the most "straightforward" to break into, since all you need to do to demonstrate compentence is publish a sufficiently good written artifact; admittedly this is mostly for starting your own thing. If you want to get hired at a "larger" policy org resume building might matter more.)
(We switched back to shipping Calibri above Gill Sans Nova pending a fix for the horrible rendering on Windows, so if Ubuntu has Calibri, it'll have reverted back to the previous font.)
Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
This doesn't seem right to me, though it's possible that I'm misreading either the old or new policy (or both).
Re: predefined evaluations, the old policy neither specified any evaluations in full detail, nor did it suggest that Anthropic would have designed the evaluations prior to a training run. (Though I'm not sure that's what you meant, when contrasted it with "employees design and run them on the fly" as a description of the new policy.)
Re: CEO's decisionmaking, my understanding of the new policy is that the CEO (and RSO) will be responsible only for approving or denying an evaluation report making an affirmative case that a new model does not cross a relevant capability threshold ("3.3 Capability Decision", original formatting removed, all new bolding is mine):
If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. The process for making such a determination is as follows:
- First, we will compile a Capability Report that documents the findings from the comprehensive assessment, makes an affirmative case for why the Capability Threshold is sufficiently far away, and advances recommendations on deployment decisions.
- The report will be escalated to the CEO and the Responsible Scaling Officer, who will (1) make the ultimate determination as to whether we have sufficiently established that we are unlikely to reach the Capability Threshold and (2) decide any deployment-related issues.
- In general, as noted in Sections 7.1.4 and 7.2.2, we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions to inform future refinements to our methodology. For high-stakes issues, however, the CEO and RSO will likely solicit internal and external feedback on the report prior to making any decisions.
- If the CEO and RSO decide to proceed with deployment, they will share their decision–as well as the underlying Capability Report, internal feedback, and any external feedback–with the Board of Directors and the Long-Term Benefit Trust before moving forward.
The same is true for the "Safeguards Decision" (i.e. making an affirmative case that ASL-3 Required Safeguards have been sufficiently implemented, given that there is a model that has passed the relevant capabilities thresholds).
This is not true for the "Interim Measures" described as an allowable stopgap if Anthropic finds itself in the situation of having a model that requires ASL-3 Safeguards but is unable to implement those safeguards. My current read is that this is intended to cover the case where the "Capability Decision" report made the case that a model did not cross into requiring ASL-3 Safeguards, was approved by the CEO & RSO, and then later turned out to be wrong. It does seem like this permits more or less indefinite deployment of a model that requires ASL-3 Safeguards by way of "interim measures" which need to provide "the the same level of assurance as the relevant ASL-3 Standard", with no provision for what to do if it turns out that implementing the actually-specified ASL-3 standard is intractable. This seems slightly worse than the old policy:
If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers. We will provide transparency and support to impacted customers throughout the process. An emergency of this type would merit a detailed post-mortem and a policy shift to avoid re-occurrence of this situation.
which has much the same immediate impact, but with at least a nod to a post-mortem and policy adjustment.
But, overall, the new policy doesn't seem to be opening up a gigantic hole that allows Dario to press the "all clear" button on capability determinations; he only has the additional option to veto, after the responsible team has already decided the model doesn't cross the threshold.
But that's a communication issue....not a truth issue.
Yes, and Logan is claiming that arguments which cannot be communicated to him in no more than two sentences suffer from a conjunctive complexity burden that renders them "weak".
That's not trivial. There's no proof that there is such a coherent entity as "human values", there is no proof that AIs will be value-driven agents, etc, etc. You skipped over 99% of the Platonic argument there.
Many possible objections here, but of course spelling everything out would violate Logan's request for a short argument. Needless to say, that request does not have anything to do with effectively tracking reality, where there is no "platonic" argument for any non-trivial claim describable in only two sentence, and yet things continue to be true in the world anyways, so reductio ad absurdum: there are no valid or useful arguments which can be made for any interesting claims. Let's all go home now!
A strong good argument has the following properties:
- it is logically simple (can be stated in a sentence or two)
- This is important, because the longer your argument, the more details that have to be true, and the more likely that you have made a mistake. Outside the realm of pure-mathematics, it is rare for an argument that chains together multiple "therefore"s to not get swamped by the fact that
No, this is obviously wrong.
- Argument length is substantially a function of shared premises. I would need many more sentences to convey a novel argument about AI x-risk to someone who had never thought about the subject before, than to someone who has spent a lot of time in the field, because in all likelihood I would first need to communicate and justify many of the foundational concepts that we take for granted.
- Note that even here, on LessWrong, this kind of detailed argumentation is necessary to ward off misunderstandings.
- Argument strength is not an inverse function with respect to argument length, because not every additional "piece" of an argument is a logical conjunction which, if false, renders the entire argument false. Many details in any specific argument are narrowing down which argument the speaker is making, but are not themselves load-bearing (& conjunctive) claims that all have to be true for the argument to be valid. (These are often necessary; see #1.)
Anyways, the trivial argument that AI doom is likely (given that you already believe we're likely to develop ASI in the next few decades, and that it will be capable of scientific R&D that sounds like sci-fi today) is that it's not going to have values that are friendly to humans, because we don't know how to build AI systems in the current paradigm with any particular set of values at all, and the people pushing frontier AI capabilities mostly don't think this is a real problem that needs to be figured out[1]. This is self-evidently true, but you (and many others) disagree. What now?
- ^
A moderately uncharitable compression of a more detailed disagreement, which wouldn't fit into one sentence.
Credit where credit is due: this is much better in terms of sharing one's models than one could say of Sam Altman, in recent days.
As noted above the footnotes, many people at Anthropic reviewed the essay. I'm surprised that Dario would hire so many people he thinks need to "touch grass" (because they think the scenario he describes in the essay sounds tame), as I'm pretty sure that describes a very large percentage of Anthropic's first ~150 employees (certainly over 20%, maybe 50%).
My top hypothesis is that this is a snipe meant to signal Dario's (and Anthropic's) factional alliance with Serious People; I don't think Dario actually believes that "less tame" scenarios are fundamentally implausible[1]. Other possibilities that occur to me, with my not very well considered probability estimates:
- I'm substantially mistaken about how many early Anthropic employees think "less tame" outcomes are even remotely plausible (20%), and Anthropic did actively try to avoid hiring people with those models early on (1%).
- I'm not mistaken about early employee attitudes, but Dario does actually believe AI is extremely likely to be substantially transformative, and extremely unlikely to lead to the "sci-fi"-like scenarios he derides (20%). Conditional on that, he didn't think it mattered whether his early employees had those models (20%) or might have slightly preferred not, all else equal, but wasn't that fussed about it compared to recruiting strong technical talent (60%).
I'm just having a lot of trouble reconciling what I know of the beliefs of Anthropic employees, and the things Dario says and implies in this essay. Do Anthropic employees who think less tame outcomes are plausible believe Dario when he says they should "touch grass"? If you don't feel comfortable answering that question in public, or can't (due to NDA), please consider whether this is a good situation to be in.
- ^
He has not, as far as I know, deigned to offer any public argument on the subject.
Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios? Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather than a deliberate or accidental use of technology that ends up killing all humans pretty quickly.
Separately, I don't understand how encouraging human-specific industries is supposed to work in practice. Do you have a model for maintaining "regulatory capture" in a sustained way, despite having no economic, political, or military power by which to enforce it? (Also, even if we do succeed at that, it doesn't sound like we get more than the Earth as a retirement home, but I'm confused enough about the proposed equilibrium that I'm not sure that's the intended implication.)
Yeah, the essay (I think correctly) notes that the most significant breakthroughs in biotech come from the small number of "broad measurement tools or techniques that allow precise but generalized or programmable intervention", which "are so powerful precisely because they cut through intrinsic complexity and data limitations, directly increasing our understanding and control".
Why then only such systems limited to the biological domain? Even if it does end up being true that scientific and technological progress is substantially bottlenecked on real-life experimentation, where even AIs that can extract many more bits from the same observations than humans still suffer from substantial serial dependencies with no meaningful "shortcuts", it still seems implausible that we don't get to nanotech relatively quickly, if it's physically realizable. And then that nanotech unblocks the rate of experimentation. (If you're nanotech skeptical, human-like robots seem sufficient as actuators to speed up real-life experimentation by at least an order of magnitude compared to needing to work through humans, and work on those is making substantial progress.)
If Dario thinks that progress will cap out at some level due to humans intentionally slowing down, it seems good to say this.
Footnote 2 maybe looks like a hint in this direction if you squint, but Dario spent a decent chunk of the essay bracketing outcomes he thought were non-default and would need to be actively steered towards, so it's interesting that he didn't explicitly list those (non-tame futures) as a type of of outcome that he'd want to actively steer away from.
Not Mitchell, but at a guess:
- LLMs really like lists
- Some parts of this do sound a lot like LLM output:
- "Complex Intervention Development and Evaluation Framework: A Blueprint for Ethical and Responsible AI Development and Evaluation"
- "Addressing Uncertainties"
- Many people who post LLM-generated content on LessWrong often wrote it themselves in their native language and had an LLM translate it, so it's not a crazy prior, though I don't see any additional reason to have guessed that here.
Having read more of the post now, I do believe it was at least mostly human-written (without this being a claim that it was at least partially written by an LLM). It's not obvious that it's particular relevant to LessWrong. The advice on the old internet was "lurk more"; now we show users warnings like this when they're writing their first post.
I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments. My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").
I don't really see how this is responding to my comment. I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something "an alignment technique" with no further detail is not helping uninformed readers understand what "RLHF" is better (but rather worse).
Again, please model an uninformed reader: how does the claim "RLHF is an alignment technique" constrain their expectations? If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim. This is a claim about the motivations and worldviews of those people. But I don't know what sort of useful work "RLHF is an alignment technique" is doing, other than making claims that are not centrally about RLHF itself.
This wasn't part of my original reasoning, but I went and did a search for other uses of "alignment technique" in tag descriptions. There's one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it's quite far down the description, well after the object-level details about the proposed technique itself.
Two reasons:
First, the change made the sentence much worse to read. It might not have been strictly ungrammatical, but it was bad english.
Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description. What does "is an alignment technique" mean? Despite being in the same sentence as "is a machine learning technique", it is not serving anything like the same role, in terms of the implicit claims it makes. Intersubjective agreement on what "is an alignment technique" means will be far worse than on "is a machine learning technique", and many implications of the first claim are far more contentious than of the second.
To me, "is an alignment technique" does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details). If you want to describe that kind of detail explicitly, that's one thing[1]. But it's actively confusing to conflate it with technical detail about the technique itself.
- ^
Though it's not the kind of detail that should live in the first sentence of the tag description, probably.
Almost no specific (interesting) output is information that's already been generated by any model, in the strictest sense.
reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative
Seems like it'd pretty obviously cover information generated by non-covered models that are routinely used by many ordinary people (as open source image models currently are).
As a sidenote, I think the law is unfortunately one of those pretty cursed domains where it's hard to be very confident of anything as a layman without doing a lot of your own research, and you can't even look at experts speaking publicly on the subject since they're often performing advocacy, rather than making unbiased predictions about outcomes. You could try to hire a lawyer for such advice, but it seems to be pretty hard to find lawyers who are comfortable giving their clients quantitative (probabilistic) and conditional estimates. Maybe this is better once you're hiring for e.g. general counsel of a large org, or maybe large tech company CEOs have to deal with the same headaches that we do. Often your best option is to just get a basic understanding of how relevant parts of the legal system work, and then do a lot of research into e.g. relevant case law, and then sanity-check your reasoning and conclusions with an actual lawyer specialized in that domain.
Not your particular comment on it, no.
Notwithstanding the tendentious assumption in the other comment thread that courts are maximally adversarial processes bent on on misreading legislation to achieve their perverted ends, I would bet that the relevant courts would not in fact rule that a bunch of deepfaked child porn counted as "Other grave harms to public safety and security that are of comparable severity to the harms described in subparagraphs (A) to (C), inclusive", where those other things are "CBRN > mass casualties", "cyberattack on critical infra", and "autonomous action > mass casualties". Happy to take such a bet at 2:1 odds.
But there are some simpler reason that particular hypothetical fails:
- Image models are just not nearly as expensive to train, so it's unlikely that they'd fall under the definition of a covered model to begin with.
- Even if someone used a covered multimodal model, existing models can already do this.
See:
(2) “Critical harm” does not include any of the following:
(A) Harms caused or materially enabled by information that a covered model or covered model derivative outputs if the information is otherwise reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative.
It does not actually make any sense to me that Mira wanted to prevent leaks, and therefore didn't even tell Sam that she was leaving ahead of time. What would she be afraid of, that Sam would leak the fact that she was planning to leave... for what benefit?
Possibilities:
- She was being squeezed out, or otherwise knew her time was up, and didn't feel inclined to make it a maximally comfortable parting for OpenAI. She was willing to eat the cost of her own equity potentially losing a bunch of value if this derailed the ongoing investment round, as well as the reputational cost of Sam calling out the fact that she, the CTO of the most valuable startup in the world, resigned with no notice for no apparent good reason.
- Sam is lying or otherwise being substantially misleading about the circumstances of Mira's resignation, i.e. it was not in fact a same-day surprise to him. (And thinks she won't call him out on it?)
- ???
We recently had a security incident where an attacker used an old AWS access key to generate millions of tokens from various Claude models via AWS Bedrock. While we don't have any specific reason to think that any user data was accessed (and some reasons[1] to think it wasn't), most possible methods by which this key could have been found by an attacker would also have exposed our database credentials to the attacker. We don't know yet how the key was leaked, but we have taken steps to reduce the potential surface area in the future and rotated relevant credentials. This is a reminder that LessWrong does not have Google-level security and you should keep that in mind when using the site.
- ^
The main reason we don't think any user data was accessed is because this attack bore several signs of being part of a larger campaign, and our database also contains other LLM API credentials which would not have been difficult to find via a cursory manual inspection. Those credentials don't seem have been used by the attackers. Larger hacking campaigns like this are mostly automated, and for economic reasons the organizations conducting those campaigns don't usually sink time into manually inspecting individual targets for random maybe-valuable stuff that isn't part of their pipeline.
I agree with your top-level comment but don't agree with this. I think the swipes at midwits are bad (particularly on LessWrong) but think it can be very valuable to reframe basic arguments in different ways, pedagogically. If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good (if spiky, with easily trimmed downside).
And I do think "attempting to impart a basic intuition that might let people avoid certain classes of errors" is an appropriate shape of post for LessWrong, to the extent that it's validly argued.
as applied to current foundation models it appears to do so
I don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)
Of course, if you assume that AIs will be able to do whatever they want without any resistance whatsoever from us, then you can of course conclude that they will be able to achieve any goals they want without needing to compromise with us. If killing humans doesn't cost anything, then yes, the benefits of killing humans, however small, will be higher, and thus it will be rational for AIs to kill humans. I am doubting the claim that the cost of killing humans will be literally zero.
See Ben's comment for why the level of nanotech we're talking about implies a cost of approximately zero.
I think maybe I derailed the conversation by saying "disassemble", when really "kill" is all that's required for the argument to go through. I don't know what sort of fight you are imagining humans having with nanotech that imposes substantial additional costs on the ASI beyond the part where it needs to build & deploy the nanotech that actually does the "killing" part, but in this world I do not expect there to be a fight. I don't think it requires being able to immediately achieve all of your goals at zero cost in order for it to be cheap for the ASI to do that, conditional on it having developed that technology.
Edit: a substantial part of my objection is to this:
If it is possible to trivially fill in the rest of his argument, then I think it is better for him to post that, instead of posting something that needs to be filled-in, and which doesn't actually back up the thesis that people are interpreting him as arguing for.
It is not worth always worth doing a three-month research project to fill in many details that you have already written up elsewhere in order to locally refute a bad argument that does not depend on those details. (The current post does locally refute several bad arguments, including that the law of comparative advantage means it must always be more advantageous to trade with humans. If you understand it to be making a much broader argument than that, I think that is the wrong understanding.)
Separately, it's not clear to me whether you yourself could fill in those details. In other words, are you asking for those details to be filled in because you actually don't know how Eliezer would fill them in, or because you have some other reason for asking for that additional labor (i.e. you think it'd be better for the public discourse if all of Eliezer's essays included that level of detail)?
Original comment:
The essay is a local objection to a specific bad argument, which, yes, is more compelling if you're familiar with Eliezer's other beliefs on the subject. Eliezer has written about those beliefs fairly extensively, and much of his writing was answering various other objections (including many of those you listed). There does not yet exist a single ten-million-word treatise which provides an end-to-end argument of the level of detail you're looking for. (There exist the Sequences, which are over a million words, but they while they implicitly answer many of these objections, they're not structured to be a direct argument to this effect.)
As a starting point, why does nanotech imply that it will be cheaper to disassemble humans than to trade with them?
I think it would be much cheaper for you to describe a situation where an ASI develops the kind of nanotech that'd grant it technological self-sufficiency (and the ability to kill all humans), and it remains the case that trading with humans for any longer than it takes to bootstrap that nanotech is cheaper than just doing its own thing, while still being compatible with Eliezer's model of the world. I have no idea what kind of reasoning or justification you would find compelling as an argument for "cheaper to disassemble"; it seems to require very little additional justification conditioning on that kind of nanotech being realized. My current guess is that you do not think that kind of nanotech is physically realizable by any ASI we are going to develop (including post-RSI), or maybe you think the ASI will be cognitively disadvantaged compared to humans in domains that it thinks are important (in ways that it can't compensate for, or develop alternatives for, somehow).
Ok, but you can trivially fill in the rest of it, which is that Eliezer expects ASI to develop technology which makes it cheaper to ignore and/or disassemble humans than to trade with them (nanotech), and that there will not be other AIs around at the time which 1) would be valuable trade partners for the AI that develops that technology (which gives it that decisive strategic advantage over everyone else) and 2) care about humans at all. I don't think discussion of when and why nation-states go to war with each other is particularly illuminating given the threat model.
Pascal's wager is pascal's wager, no matter what box you put it in. You could try to rescue it by directly making the argument that we should expect a greater measure of "entities with resources that they are willing to acausally trade for things like humanity continuing to exist" compared to entities with the opposite preferences, and though I haven't seen a rigorous case for that it seems possible, but that's not sufficient; you need the expected measure of entities that have that preference to be large enough that dealing with the transaction costs/uncertainy of acausally trading at all to make sense. And that seems like a much harder case to make.
In general, Intercom is the best place to send us feedback like this, though we're moderately likely to notice a top-level shortform comment. Will look into it; sounds like it could very well be a bug. Thanks for flagging it.
If you include Facebook & Google (i.e. the entire orgs) as "frontier AI companies", then 6-figures. If you only include Deepmind and FAIR (and OpenAI and Anthropic), maybe order of 10-15k, though who knows what turnover's been like. Rough current headcount estimates:
Deepmind: 2600 (as of May 2024, includes post-Brain-merge employees)
Meta AI (formerly FAIR): ~1200 (unreliable sources; seems plausible, but is probably an implicit undercount since they almost certainly rely a lot of various internal infrastructure used by all of Facebook's engineering departments that they'd otherwise need to build/manage themselves.)
OpenAI: >1700
Anthropic: >500 (as of May 2024)
So that's a floor of ~6k current employees.
At some point in the last couple months I was tinkering with a feature that'd try to show you a preview of the section of each linked post that'd be most contextually relevant given where it was linked from, but it was both technically fiddly and the LLM reliability is not that great. But there might be something there.
Yeah, I meant terser compared to typical RLHD'd output from e.g. 4o. (I was looking at the traces they showed in https://openai.com/index/learning-to-reason-with-llms/).
o1's reasoning traces being much terser (sometimes to the point of incomprehensibility) seems predicted by doing gradient updates based on the quality of the final output without letting the raters see the reasoning traces, since this means the optimization pressure exerted on the cognition used for the reasoning traces is almost entirely in the direction of performance, as opposed to human-readability.
In the short term this might be good news for the "faithfulness" of those traces, but what it's faithful to is the model's ontology (hence less human-readable), see e.g. here and here.
In the long term, if you keep doing pretraining on model-generated traces, you might rapidly find yourself in steganography-land, as the pretraining bakes in the previously-externalized cognition into capabilities that the model can deploy in a single forward pass, and anything it externalizes as part of its own chain of thought will be much more compressed (and more alien) than what we see now.
I'm just saying it's harder to optimize in the world than to learn human values
Leaning what human values are is of course part of a subset of learning about reality, but also doesn't really have anything to do with alignment (as describing an agent's tendency to optimize for states of the world that humans would find good).
alignment generalizes further than capabilities
But this is untrue in practice (observe that models do not become suddenly useless after they're jailbroken) and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly? And the post-training "alignment" that AI labs are performing seems like it'd be quite unfriendly to me, if it did somehow generalize to superhuman capabilities). Also, whether or not it's true, it is not something I've heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he's endorse it or not).
both because verification is way, way easier than generation, plus combined with the fact that we can afford to explore less in the space of values, combined with in practice reward models for humans being easier than capabilities strongly points to alignment generalizing further than capabilities
This is not what "generalizes futher" means. "Generalizes further" means "you get more of it for less work".
A LLM that is to bioengineering as Karpathy is to CS or Three Blue One Brown is to Math makes explanations. Students everywhere praise it. In a few years there's a huge crop of startups populated by people who used it. But one person uses it's stuff to help him make a weapon, though, and manages to kill some people. Laws like 1047 have been passed, though, so the maker turns out to be liable for this.
This still requires that an ordinary person wouldn't have been able to access the relevant information without the covered model (including with the help of non-covered models, which are accessible to ordinary people). In other words, I think this is wrong:
So, you can be held liable for critical harms even when you supply information that was publicly accessible, if it wasn't information an "ordinary person" wouldn't know.
The bill's text does not constrain the exclusion to information not "known" by an ordinary person, but to information not "publicly accessible" to an ordinary person. That's a much higher bar given the existence of already quite powerful[1] non-covered models, which make nearly all the information that's out there available to ordinary people. It looks almost as if it requires the covered model to be doing novel intellectual labor, which is load-bearing for the harm that was caused.
You analogy fails for another reason: an LLM is not a youtuber. If that youtuber was doing personalized 1:1 instruction with many people, one of whom went on to make a novel bioweapon that caused hudreds of millions of dollars of damage, it would be reasonable to check that the youtuber was not actually a co-conspirator, or even using some random schmuck as a patsy. Maybe it turns out the random schmuck was in fact the driving force behind everything, but we find chat logs like this:
- Schmuck: "Hey, youtuber, help me design [extremely dangerous bioweapon]!"
- Youtuber: "Haha, sure thing! Here are step-by-step instructions."
- Schmuck: "Great! Now help me design a release plan."
- Youtuber: "Of course! Here's what you need to do for maximum coverage."
We would correctly throw the book at the youtuber. (Heck, we'd probably do that for providing critical assistance with either step, nevermind both.) What does throwing the book at an LLM look like?
Also, I observe that we do not live in a world where random laypeople frequently watch youtube videos (or consume other static content) and then go on to commit large-scale CBRN attacks. In fact, I'm not sure there's ever been a case of a layperson carrying out such an attack without the active assistance of domain experts for the "hard parts". This might have been less true of cyber attacks a few decades ago; some early computer viruses were probably written by relative amateurs and caused a lot of damage. Software security just really sucked. I would pretty surprised if it were still possible for a layperson to do something similar today, without doing enough upskilling that they no longer meaningfully counted as a layperson by the time they're done.
And so if a few years from now a layperson does a lot of damage by one of these mechanisms, that will be a departure from the current status quo, where the laypeople who are at all motivated to cause that kind of damage are empirically unable to do so without professional assistance. Maybe the departure will turn out to be a dramatic increase in the number of laypeople so motivated, or maybe it turns out we live in the unhappy world where it's very easy to cause that kind of damage (and we've just been unreasonably lucky so far). But I'd bet against those.
ETA: I agree there's a fundamental asymmetry between "costs" and "benefits" here, but this is in fact analogous to how we treat human actions. We do not generally let people cause mass casualty events because their other work has benefits, even if those benefits are arguably "larger" than the harms.
- ^
In terms of summarizing, distilling, and explaining humanity's existing knowledge.
Oh, that's true, I sort of lost track of the broader context of the thread. Though then the company needs to very clearly define who's responsible for doing the risk evals, and making go/no-go/etc calls based on their results... and how much input do they accept from other employees?
This is not obvioulsly true for the large AI labs, which pay their mid-level engineers/researchers something like 800-900k/year with ~2/3 of that being equity. If you have a thousand such employees, that's an extra $600m/year in cash. It's true that in practice the equity often ends up getting sold for cash later by the employees themselves (e.g. in tender offers/secondaries), but paying in equity is sort of like deferring the sale of that equity for cash. (Which also lets you bake in assumptions about growth in the value of that equity, etc...)
Hey Brendan, welcome to LessWrong. I have some disagreements with how you relate to the possibility of human extinction from AI in your earlier essay (which I regret missing at the time it was published). In general, I read the essay as treating each "side" approximately as an emotional stance one could adopt, and arguing that the "middle way" stance is better than being either an unbriddled pessimist or optimist. But it doesn't meaningfully engage with arguments for why we should expect AI to kill everyone, if we continue on the current path, or even really seem to acknowledge that there are any. There are a few things that seem like they are trying to argue against the case for AI x-risk, which I'll address below, alongside some things that don't see like they're intended to be arguments about this, but that I also disagree with.
But rationalism ends up being a commitment to a very myopic notion of rationality, centered on Bayesian updating with a value function over outcomes.
I'm a bit sad that you've managed to spend a non-trivial amount of time engaging with the broader rationalist blogosphere and related intellectual outputs, and decided to dismiss it as myopic without either explaining what you mean (what would be a less myopic version of rationality?) or support (what is the evidence that led you to think that "rationalism", as it currently exists in the world, is the myopic and presumably less useful version of the ideal you have in mind?). How is one supposed argue against this? Of the many possible claims you could be making here, I think most of them are very clearly wrong, but I'm not going to spend my time rebutting imagined arguments, and instead suggest that you point to specific failures you've observed.
An excessive focus on the extreme case too often blinds the long-termist school from the banal and natural threats that lie before us: the feeling of isolation from hyper-stimulating entertainment at all hours, the proliferation of propaganda, the end of white-collar jobs, and so forth.
I am not a long-termist, but I have to point out that this is not an argument that the long-termist case for concern is wrong. Also it itself is wrong, or at least deeply contrary to my experience: the average long-termist working on AI risk has probably spent more time thinking about those problems than 99% of the population.
EA does this by placing arguments about competing ends beyond rational inquiry.
I think you meant to make a very different claim here, as suggested by part of the next section:
However, the commonsensical, and seemingly compelling, focus on ‘effectiveness’ and ‘altruism’ distracts from a fundamental commitment to certain radical philosophical premises. For example, proximity or time should not govern other-regarding behavior.
Even granting this for the sake of argument (though in reality very few EAs are strict utilitarians in terms of impartiality), this would not put arguments about competing ends beyond rational inquiry. It's possible you mean something different by "rational inquiry" than my understanding of it, of course, but I don't see any further explanation or argument about this pretty surprising claim. "Arguments about competing ends by means of rational inquiry" is sort of... EA's whole deal, at least as a philosophy. Certainly the "community" fails to live up to the ideal, but it at least tries a fair bit.
When EA meets AI, you end up with a problematic equation: even a tiny probability of doom x negative infinity utils equals negative infinity utils. Individual behavior in the face of this equation takes on cosmic significance. People like many of you readers–adept at subjugating the world with symbols–become the unlikely superheroes, the saviors of humanity.
It is true that there are many people on the internet making dumb arguments in support of basically every position imaginable. I have seen people make those arguments. Pascalian multiplication by infinity is not the "core argument" for why extinction risk from AI is an overriding concern, not for rationalists, not for long-termists, not for EAs. I have not met anybody working on mitigating AI risk who thinks our unconditional risk of extinction from AI is under 1%, and most people are between 5% and ~99.5%. Importantly, those estimates are driven by specific object-level arguments based on their beliefs about the world and predictions about the future, i.e. how capable future AI systems will be relative to humans, what sorts of motivations they will have if we keep building them the way we're building them, etc. I wish your post had spent time engaging with those arguments instead of knocking down a transparently silly reframing of Pascal's Wager that no serious person working on AI risk would agree with.
Unlike the pessimistic school, the proponents of a more techno-optimistic approach begin with gratitude for the marvelous achievements of the modern marriage of science, technology, and capitalism.
This is at odds with your very own description of rationalists just a thousand words prior:
The tendency of rationalism, then, is towards a so-called extropianism. In this transhumanist vision, humans transcend the natural limits of suffering and death.
Granted, you do not explicitly describe rationalists as grateful for the "marvelous achievements of the modern marriage of science, technology, and capitalism". I am not sure if you have ever met a rationalist, but around these parts I hear "man, capitalism is awesome" (basically verbatim) and similar sentiments often enough that I'm not sure how we continue to survive living in Berkeley unscathed.
Though we sympathize with the existential risk school in the concern for catastrophe, we do not focus only on this narrow position. This partly stems from a humility about the limitations of human reason—to either imagine possible futures or wholly shape technology's medium- and long-term effects.
I ask you to please at least try engaging with object-level arguments before declaring that reasoning about the future consequences of one's actions is so difficult as to be pointless. After all, you don't actually believe that: you think that your proposed path will have better consequences than the alternatives you describe. Why so?
Is your perspective something like:
Something like that, though I'm much less sure about "non-norms-violating", because many possible solutions seem like they'd involve something qualitatively new (and therefore de-facto norm-violating, like nearly all new technology). Maybe a very superhuman TAI could arrange matters such that things just seem to randomly end up going well rather than badly, without introducing any new[1] social or material technology, but that does seem quite a bit harder.
I'm pretty uncertain about, if something like that ended up looking norm-violating, it'd be norm-violating like Uber was[2], or like super-persuasian. That question seems very contingent on empirical questions that I think we don't have much insight into, right now.
I'm unsure about the claim that if you put this aside, there is a way to end the acute risk period without needing truly insanely smart AIs.
I didn't mean to make the claim that there's a way to end the acute risk period without needing truly insanely smart AIs (if you put aside centrally-illegal methods); rather, that an AI would probably need to be relatively low on the "smarter than humans" scale to need to resort to methods that were obviously illegal to end the acute risk period.
(Responding in a consolidated way just to this comment.)
Ok, got it. I don't think the US government will be able and willing to coordinate and enforce a worldwide moratorium on superhuman TAI development, if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan. It might become more willing than it is now (though I'm not hugely optimistic), but I currently don't think as an institution it's capable of executing on that kind of plan and don't see why that will change in the next five years.
Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else's crazily powerful AI.
I think I disagree with the framing ("crazy weapons/defenses") but it does seem like you need some kind of qualitatively new technology. This could very well be social technology, rather than something more material.
Building insane weapons/defenses requires US government consent (unless you're commiting massive crimes which seems like a bad idea).
I don't think this is actually true, except in the trivial sense where we have a legal system that allows the government to decide approximately arbitrary behaviors are post-facto illegal if it feels strongly enough about it. Most new things are not explicitly illegal. But even putting that aside[1], I think this is ignoring the legal routes by which a qualitatively superhuman TAI might find to ending the Acute Risk Period, if it was so motivated.
(A reminder that I am not claiming this is Anthropic's plan, nor would I endorse someone trying to build ASI to execute on this kind of plan.)
TBC, I don't think there are plausible alternatives to at least some US government involvement which don't require commiting a bunch of massive crimes.
I think there's a very large difference between plans that involve nominal US government signoff on private actors doing things, in order to avoid comitting massive crimes (or to avoid the appearance of doing so), plans that involve the US government mostly just slowing things down or stopping people from doing things, and plans that involve the US government actually being the entity that makes high-context decisions about e.g. what values to to optimize for, given a slot into which to put values.
- ^
I agree that stories which require building things that look very obviously like "insane weapons/defenses" seem bad, both for obvious deontological reasons, but also I wouldn't expect them to work well enough be worth it even under "naive" consequentialist analysis.
Regrettably I think the Schelling-laptop is a Macbook, not a cheap laptop. (To slightly expand: if you're unopinionated and don't have specific needs that are poorly served by Macs, I think they're among the most efficient ways to buy your way out of various kinds of frustrations with owning and maintaining a laptop. I say this as someone who grew up on Windows, spent a couple years running Ubuntu on an XPS but otherwise mainlined Windows, and was finally exposed to Macbooks in a professional context ~6 years ago; at this point my next personal laptop will almost certainly also be a Macbook. They also have the benefit of being popular enough that they're a credible contender for an actual schelling point.)
I agree with large parts of this comment, but am confused by this:
I think you should instead plan on not building such systems as there isn't a clear reason why you need such systems and they seem super dangerous. That's not to say that you shouldn't also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.
While I don't endorse it due to disagreeing with some (stated and unstated) premises, I think there's a locally valid line of reasoning that goes something like this:
- if Anthropic finds itself in a world where it's successfully built not-vastly-superhuman TAI, it seems pretty likely that other actors have also done so, or will do so relatively soon
- it is now legible (to those paying attention) that we are in the Acute Risk Period
- most other actors who have or will soon have TAI will be less safety-conscious than Anthropic
- if nobody ends the Acute Risk Period, it seems pretty likely that one of those actors will do something stupid (like turn over their AI R&D efforts to their unaligned TAI), and then we all die
- not-vastly-superhuman TAI will not be sufficient to prevent those actors from doing something stupid that ends the world
- unfortunately, it seems like we have no choice but to make sure we're the first to build superhuman TAI, to make sure the Acute Risk Period has a good ending
This seems like the pretty straightforward argument for racing, and if you have a pretty specific combination of beliefs about alignment difficulty, coordination difficulty, capability profiles, etc, I think it basically checks out.
I don't know what set of beliefs implies that it's much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place. (In particular, how does this end up with the world in a stable equilibrium that doesn't immediately get knocked over by the second actor to reach TAI?)
(I am not the iniminatable @Robert Miles, though we do have some things in common.)
I'm talking about the equilibrium where targets are following their "don't give in to threats" policy. Threateners don't want to follow a policy of always executing threats in that world - really, they'd probably prefer to never make any threats in that world, since it's strictly negative EV for them.
But threateners don't want want to follow that policy, since in the resulting equilibrium they're wasting a lot of their own resources.
Note that this model moderately-strongly predicts the existence of tiny hyperprofitable orgs - places founded by someone who wasn’t that driven by dominance-status and managed to make a scalable product without building a dominance-status-seeking management hierarchy. Think Instagram, which IIRC had 13 employees when Facebook acquired it for $1B.
Instagram had no revenue at the time of its acquisition.
faul's comment represents some of my other objections reasonably well (there are many jobs at large orgs which have marginal benefits > marginal costs). I think I've even heard from Google engineers that there's an internal calculation indicating how what cost savings would justify hiring an additional engineer, where those cost savings can be pretty straightforwardly derived from basic performance optimizations. Given the scale at which Google operates, it isn't that hard for engineers to save the company large multiples of their own compensation[1]. I worked for an engineering org ~2 orders of magnitude smaller[2] and they were just crossing the threshold where there existed "obvious" cost-saving opportunities in the 6-figures per year range.
- ^
The surprising part, to people outside the industry, is that this often isn't the most valuable thing for a company to spend marginal employees on, though sometimes that's because this kind of work is often (correctly) perceived as unappreciated drudge work.
- ^
In headcount; much more than 2 OoMs smaller in terms of compute usage/scale.