Posts
Comments
The /allPosts page shows all quick takes/shortforms posted, though somewhat de-emphasized.
FYI: we have spoiler blocks.
This doesn't seem like it'd do much unless you ensured that there were training examples during RLAIF which you'd expect to cause that kind of behavior enough of the time that there'd be something to update against. (Which doesn't seem like it'd be that hard, though I think separately that approach seems kind of doomed - it's falling into a brittle whack-a-mole regime.)
LessWrong doesn't have a centralized repository of site rules, but here are some posts that might be helpful:
https://www.lesswrong.com/posts/bGpRGnhparqXm5GL7/models-of-moderation
https://www.lesswrong.com/posts/kyDsgQGHoLkXz6vKL/lw-team-is-adjusting-moderation-policy
We do currently require content to be posted in English.
"It would make sense to pay that cost if necessary" makes more sense than "we should expect to pay that cost", thanks.
it sounds like you view it as a bad plan?
Basically, yes. I have a draft post outlining some of my objections to that sort of plan; hopefully it won't sit in my drafts as long as the last similar post did.
(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect not solving AI philosophical competence (while having solved intent alignment) to lead to catastrophe (thus putting us outside the worlds in which x-risks are reliably ‘solved’ for), perhaps in the way Wei Dai has talked about?)
I expect whatever ends up taking over the lightcone to be philosophically competent. I haven't thought very hard about the philosophical competence of whatever AI succeeds at takeover (conditional on that happening), or, separately, the philosophical competence of the stupidest possible AI that could succeed at takeover with non-trivial odds. I don't think solving intent alignment necessarily requires that we have also figured out how to make AIs philosophically competent, or vice-versa; I also haven't though about how likely we are to experience either disjunction.
I think solving intent alignment without having made much more philosophical progress is almost certainly an improvement to our odds, but is not anywhere near sufficient to feel comfortable, since you still end up stuck in a position where you want to delegate "solve philosophy" to the AI, but you can't because you can't check its work very well. And that means you're stuck at whatever level of capabilities you have, and are still approximately a sitting duck waiting for someone else to do something dumb with their own AIs (like point them at recursive self-improvement).
What do people mean when they talk about a "long reflection"? The original usages suggest flesh-humans literally sitting around and figuring out moral philosophy for hundreds, thousands, or even millions of years, before deciding to do anything that risks value lock-in, but (at least) two things about this don't make sense to me:
- A world where we've reliably "solved" for x-risks well enough to survive thousands of years without also having meaningfully solved "moral philosophy" is probably physically realizable, but this seems like a pretty fine needle to thread from our current position. (I think if you have a plan for solving AI x-risk that looks like "get to ~human-level AI, pump the brakes real hard, and punt on solving ASI alignment" then maybe you disagree.)
- I don't think it takes today-humans a thousand years to come up with a version of indirect normativity (or CEV, or whatever) that actually just works correctly. I'd be somewhat surprised if it took a hundred, but maybe it's actually very tricky. A thousand just seems crazy. A million makes it sound like you're doing something very dumb, like figuring out every shard of each human's values and don't know how to automate things.
I tried to make a similar argument here, and I'm not sure it landed. I think the argument has since demonstrated even more predictive validity with e.g. the various attempts to build and restart nuclear power plants, directly motivated by nearby datacenter buildouts, on top of the obvious effects on chip production.
Should be fixed now.
Good catch, looks like that's from this revision, which looks like it was copied over from Arbital - some LaTeX didn't make it through. I'll see if it's trivial to fix.
The page isn't dead, Arbital pages just don't load sometimes (or take 15+ seconds).
I understand this post to be claiming (roughly speaking) that you assign >90% likelihood in some cases and ~50% in other cases that LLMs have internal subjective experiences of varying kinds. The evidence you present in each case is outputs generated by LLMs.
The referents of consciousness for which I understand you to be making claims re: internal subjective experiences are 1, 4, 6, 12, 13, and 14. I'm unsure about 5.
Do you have sources of evidence (even illegible) other than LLM outputs that updated you that much? Those seem like very surprisingly large updates to make on the basis of LLM outputs (especially in cases where those outputs are self-reports about the internal subjective experience itself, which are subject to substantial pressure from post-training).
Separately, I have some questions about claims like this:
The Big 3 LLMs are somewhat aware of what their own words and/or thoughts are referring to with regards to their previous words and/or thoughts. In other words, they can think about the thoughts "behind" the previous words they wrote.
This doesn't seem constructively ruled out by e.g. basic transformer architectures, but as justification you say this:
If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its "attention" modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur "behind the scenes" of the text you actually see on screen.
How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs, vs. conditioning on its previous outputs to give you plausible "explanation" for what it meant? The second seems much more likely to me (and this behavior isn't that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning).
My impression is that Yudkowsky has harmed public epistemics in his podcast appearances by saying things forcefully and with rather poor spoken communication skills for novice audiences.
I recommend reading the Youtube comments on his recorded podcasts, rather than e.g. Twitter commentary from people with a pre-existing adversarial stance to him (or AI risk questions writ large).
On one hand, I feel a bit skeptical that some dude outperformed approximately every other pollster and analyst by having a correct inside-view belief about how existing pollster were messing up, especially given that he won't share the surveys. On the other hand, this sort of result is straightforwardly predicted by Inadequate Equilibria, where an entire industry had the affordance to be arbitrarily deficient in what most people would think was their primary value-add, because they had no incentive to accuracy (skin in the game), and as soon as someone with an edge could make outsized returns on it (via real-money prediction markets), they outperformed all the experts.
On net I think I'm still <50% that he had a correct belief about the size of Trump's advantage that was justified by the evidence he had available to him, but even being directionally-correct would have been sufficient to get outsized returns a lot of the time, so at that point I'm quibbling with his bet sizing rather than the direction of the bet.
I'm pretty sure Ryan is rejecting the claim that the people hiring for the roles in question are worse-than-average at detecting illegible talent.
Depends on what you mean by "resume building", but I don't think this is true for "need to do a bunch of AI safety work for free" or similar. i.e. for technical research, many people that have gone through MATS and then been hired at or founded their own safety orgs have no prior experience doing anything that looks like AI safety research, and some don't even have much in the way of ML backgrounds. Many people switch directly out of industry careers into doing e.g. ops or software work that isn't technical research. Policy might seem a bit trickier but I know several people who did not spend anything like years doing resume building before finding policy roles or starting their own policy orgs and getting funding. (Though I think policy might actually be the most "straightforward" to break into, since all you need to do to demonstrate compentence is publish a sufficiently good written artifact; admittedly this is mostly for starting your own thing. If you want to get hired at a "larger" policy org resume building might matter more.)
(We switched back to shipping Calibri above Gill Sans Nova pending a fix for the horrible rendering on Windows, so if Ubuntu has Calibri, it'll have reverted back to the previous font.)
Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
This doesn't seem right to me, though it's possible that I'm misreading either the old or new policy (or both).
Re: predefined evaluations, the old policy neither specified any evaluations in full detail, nor did it suggest that Anthropic would have designed the evaluations prior to a training run. (Though I'm not sure that's what you meant, when contrasted it with "employees design and run them on the fly" as a description of the new policy.)
Re: CEO's decisionmaking, my understanding of the new policy is that the CEO (and RSO) will be responsible only for approving or denying an evaluation report making an affirmative case that a new model does not cross a relevant capability threshold ("3.3 Capability Decision", original formatting removed, all new bolding is mine):
If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. The process for making such a determination is as follows:
- First, we will compile a Capability Report that documents the findings from the comprehensive assessment, makes an affirmative case for why the Capability Threshold is sufficiently far away, and advances recommendations on deployment decisions.
- The report will be escalated to the CEO and the Responsible Scaling Officer, who will (1) make the ultimate determination as to whether we have sufficiently established that we are unlikely to reach the Capability Threshold and (2) decide any deployment-related issues.
- In general, as noted in Sections 7.1.4 and 7.2.2, we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions to inform future refinements to our methodology. For high-stakes issues, however, the CEO and RSO will likely solicit internal and external feedback on the report prior to making any decisions.
- If the CEO and RSO decide to proceed with deployment, they will share their decision–as well as the underlying Capability Report, internal feedback, and any external feedback–with the Board of Directors and the Long-Term Benefit Trust before moving forward.
The same is true for the "Safeguards Decision" (i.e. making an affirmative case that ASL-3 Required Safeguards have been sufficiently implemented, given that there is a model that has passed the relevant capabilities thresholds).
This is not true for the "Interim Measures" described as an allowable stopgap if Anthropic finds itself in the situation of having a model that requires ASL-3 Safeguards but is unable to implement those safeguards. My current read is that this is intended to cover the case where the "Capability Decision" report made the case that a model did not cross into requiring ASL-3 Safeguards, was approved by the CEO & RSO, and then later turned out to be wrong. It does seem like this permits more or less indefinite deployment of a model that requires ASL-3 Safeguards by way of "interim measures" which need to provide "the the same level of assurance as the relevant ASL-3 Standard", with no provision for what to do if it turns out that implementing the actually-specified ASL-3 standard is intractable. This seems slightly worse than the old policy:
If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers. We will provide transparency and support to impacted customers throughout the process. An emergency of this type would merit a detailed post-mortem and a policy shift to avoid re-occurrence of this situation.
which has much the same immediate impact, but with at least a nod to a post-mortem and policy adjustment.
But, overall, the new policy doesn't seem to be opening up a gigantic hole that allows Dario to press the "all clear" button on capability determinations; he only has the additional option to veto, after the responsible team has already decided the model doesn't cross the threshold.
But that's a communication issue....not a truth issue.
Yes, and Logan is claiming that arguments which cannot be communicated to him in no more than two sentences suffer from a conjunctive complexity burden that renders them "weak".
That's not trivial. There's no proof that there is such a coherent entity as "human values", there is no proof that AIs will be value-driven agents, etc, etc. You skipped over 99% of the Platonic argument there.
Many possible objections here, but of course spelling everything out would violate Logan's request for a short argument. Needless to say, that request does not have anything to do with effectively tracking reality, where there is no "platonic" argument for any non-trivial claim describable in only two sentence, and yet things continue to be true in the world anyways, so reductio ad absurdum: there are no valid or useful arguments which can be made for any interesting claims. Let's all go home now!
A strong good argument has the following properties:
- it is logically simple (can be stated in a sentence or two)
- This is important, because the longer your argument, the more details that have to be true, and the more likely that you have made a mistake. Outside the realm of pure-mathematics, it is rare for an argument that chains together multiple "therefore"s to not get swamped by the fact that
No, this is obviously wrong.
- Argument length is substantially a function of shared premises. I would need many more sentences to convey a novel argument about AI x-risk to someone who had never thought about the subject before, than to someone who has spent a lot of time in the field, because in all likelihood I would first need to communicate and justify many of the foundational concepts that we take for granted.
- Note that even here, on LessWrong, this kind of detailed argumentation is necessary to ward off misunderstandings.
- Argument strength is not an inverse function with respect to argument length, because not every additional "piece" of an argument is a logical conjunction which, if false, renders the entire argument false. Many details in any specific argument are narrowing down which argument the speaker is making, but are not themselves load-bearing (& conjunctive) claims that all have to be true for the argument to be valid. (These are often necessary; see #1.)
Anyways, the trivial argument that AI doom is likely (given that you already believe we're likely to develop ASI in the next few decades, and that it will be capable of scientific R&D that sounds like sci-fi today) is that it's not going to have values that are friendly to humans, because we don't know how to build AI systems in the current paradigm with any particular set of values at all, and the people pushing frontier AI capabilities mostly don't think this is a real problem that needs to be figured out[1]. This is self-evidently true, but you (and many others) disagree. What now?
- ^
A moderately uncharitable compression of a more detailed disagreement, which wouldn't fit into one sentence.
Credit where credit is due: this is much better in terms of sharing one's models than one could say of Sam Altman, in recent days.
As noted above the footnotes, many people at Anthropic reviewed the essay. I'm surprised that Dario would hire so many people he thinks need to "touch grass" (because they think the scenario he describes in the essay sounds tame), as I'm pretty sure that describes a very large percentage of Anthropic's first ~150 employees (certainly over 20%, maybe 50%).
My top hypothesis is that this is a snipe meant to signal Dario's (and Anthropic's) factional alliance with Serious People; I don't think Dario actually believes that "less tame" scenarios are fundamentally implausible[1]. Other possibilities that occur to me, with my not very well considered probability estimates:
- I'm substantially mistaken about how many early Anthropic employees think "less tame" outcomes are even remotely plausible (20%), and Anthropic did actively try to avoid hiring people with those models early on (1%).
- I'm not mistaken about early employee attitudes, but Dario does actually believe AI is extremely likely to be substantially transformative, and extremely unlikely to lead to the "sci-fi"-like scenarios he derides (20%). Conditional on that, he didn't think it mattered whether his early employees had those models (20%) or might have slightly preferred not, all else equal, but wasn't that fussed about it compared to recruiting strong technical talent (60%).
I'm just having a lot of trouble reconciling what I know of the beliefs of Anthropic employees, and the things Dario says and implies in this essay. Do Anthropic employees who think less tame outcomes are plausible believe Dario when he says they should "touch grass"? If you don't feel comfortable answering that question in public, or can't (due to NDA), please consider whether this is a good situation to be in.
- ^
He has not, as far as I know, deigned to offer any public argument on the subject.
Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios? Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather than a deliberate or accidental use of technology that ends up killing all humans pretty quickly.
Separately, I don't understand how encouraging human-specific industries is supposed to work in practice. Do you have a model for maintaining "regulatory capture" in a sustained way, despite having no economic, political, or military power by which to enforce it? (Also, even if we do succeed at that, it doesn't sound like we get more than the Earth as a retirement home, but I'm confused enough about the proposed equilibrium that I'm not sure that's the intended implication.)
Yeah, the essay (I think correctly) notes that the most significant breakthroughs in biotech come from the small number of "broad measurement tools or techniques that allow precise but generalized or programmable intervention", which "are so powerful precisely because they cut through intrinsic complexity and data limitations, directly increasing our understanding and control".
Why then only such systems limited to the biological domain? Even if it does end up being true that scientific and technological progress is substantially bottlenecked on real-life experimentation, where even AIs that can extract many more bits from the same observations than humans still suffer from substantial serial dependencies with no meaningful "shortcuts", it still seems implausible that we don't get to nanotech relatively quickly, if it's physically realizable. And then that nanotech unblocks the rate of experimentation. (If you're nanotech skeptical, human-like robots seem sufficient as actuators to speed up real-life experimentation by at least an order of magnitude compared to needing to work through humans, and work on those is making substantial progress.)
If Dario thinks that progress will cap out at some level due to humans intentionally slowing down, it seems good to say this.
Footnote 2 maybe looks like a hint in this direction if you squint, but Dario spent a decent chunk of the essay bracketing outcomes he thought were non-default and would need to be actively steered towards, so it's interesting that he didn't explicitly list those (non-tame futures) as a type of of outcome that he'd want to actively steer away from.
Not Mitchell, but at a guess:
- LLMs really like lists
- Some parts of this do sound a lot like LLM output:
- "Complex Intervention Development and Evaluation Framework: A Blueprint for Ethical and Responsible AI Development and Evaluation"
- "Addressing Uncertainties"
- Many people who post LLM-generated content on LessWrong often wrote it themselves in their native language and had an LLM translate it, so it's not a crazy prior, though I don't see any additional reason to have guessed that here.
Having read more of the post now, I do believe it was at least mostly human-written (without this being a claim that it was at least partially written by an LLM). It's not obvious that it's particular relevant to LessWrong. The advice on the old internet was "lurk more"; now we show users warnings like this when they're writing their first post.
I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments. My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").
I don't really see how this is responding to my comment. I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something "an alignment technique" with no further detail is not helping uninformed readers understand what "RLHF" is better (but rather worse).
Again, please model an uninformed reader: how does the claim "RLHF is an alignment technique" constrain their expectations? If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim. This is a claim about the motivations and worldviews of those people. But I don't know what sort of useful work "RLHF is an alignment technique" is doing, other than making claims that are not centrally about RLHF itself.
This wasn't part of my original reasoning, but I went and did a search for other uses of "alignment technique" in tag descriptions. There's one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it's quite far down the description, well after the object-level details about the proposed technique itself.
Two reasons:
First, the change made the sentence much worse to read. It might not have been strictly ungrammatical, but it was bad english.
Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description. What does "is an alignment technique" mean? Despite being in the same sentence as "is a machine learning technique", it is not serving anything like the same role, in terms of the implicit claims it makes. Intersubjective agreement on what "is an alignment technique" means will be far worse than on "is a machine learning technique", and many implications of the first claim are far more contentious than of the second.
To me, "is an alignment technique" does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details). If you want to describe that kind of detail explicitly, that's one thing[1]. But it's actively confusing to conflate it with technical detail about the technique itself.
- ^
Though it's not the kind of detail that should live in the first sentence of the tag description, probably.
Almost no specific (interesting) output is information that's already been generated by any model, in the strictest sense.
reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative
Seems like it'd pretty obviously cover information generated by non-covered models that are routinely used by many ordinary people (as open source image models currently are).
As a sidenote, I think the law is unfortunately one of those pretty cursed domains where it's hard to be very confident of anything as a layman without doing a lot of your own research, and you can't even look at experts speaking publicly on the subject since they're often performing advocacy, rather than making unbiased predictions about outcomes. You could try to hire a lawyer for such advice, but it seems to be pretty hard to find lawyers who are comfortable giving their clients quantitative (probabilistic) and conditional estimates. Maybe this is better once you're hiring for e.g. general counsel of a large org, or maybe large tech company CEOs have to deal with the same headaches that we do. Often your best option is to just get a basic understanding of how relevant parts of the legal system work, and then do a lot of research into e.g. relevant case law, and then sanity-check your reasoning and conclusions with an actual lawyer specialized in that domain.
Not your particular comment on it, no.
Notwithstanding the tendentious assumption in the other comment thread that courts are maximally adversarial processes bent on on misreading legislation to achieve their perverted ends, I would bet that the relevant courts would not in fact rule that a bunch of deepfaked child porn counted as "Other grave harms to public safety and security that are of comparable severity to the harms described in subparagraphs (A) to (C), inclusive", where those other things are "CBRN > mass casualties", "cyberattack on critical infra", and "autonomous action > mass casualties". Happy to take such a bet at 2:1 odds.
But there are some simpler reason that particular hypothetical fails:
- Image models are just not nearly as expensive to train, so it's unlikely that they'd fall under the definition of a covered model to begin with.
- Even if someone used a covered multimodal model, existing models can already do this.
See:
(2) “Critical harm” does not include any of the following:
(A) Harms caused or materially enabled by information that a covered model or covered model derivative outputs if the information is otherwise reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative.
It does not actually make any sense to me that Mira wanted to prevent leaks, and therefore didn't even tell Sam that she was leaving ahead of time. What would she be afraid of, that Sam would leak the fact that she was planning to leave... for what benefit?
Possibilities:
- She was being squeezed out, or otherwise knew her time was up, and didn't feel inclined to make it a maximally comfortable parting for OpenAI. She was willing to eat the cost of her own equity potentially losing a bunch of value if this derailed the ongoing investment round, as well as the reputational cost of Sam calling out the fact that she, the CTO of the most valuable startup in the world, resigned with no notice for no apparent good reason.
- Sam is lying or otherwise being substantially misleading about the circumstances of Mira's resignation, i.e. it was not in fact a same-day surprise to him. (And thinks she won't call him out on it?)
- ???
We recently had a security incident where an attacker used an old AWS access key to generate millions of tokens from various Claude models via AWS Bedrock. While we don't have any specific reason to think that any user data was accessed (and some reasons[1] to think it wasn't), most possible methods by which this key could have been found by an attacker would also have exposed our database credentials to the attacker. We don't know yet how the key was leaked, but we have taken steps to reduce the potential surface area in the future and rotated relevant credentials. This is a reminder that LessWrong does not have Google-level security and you should keep that in mind when using the site.
- ^
The main reason we don't think any user data was accessed is because this attack bore several signs of being part of a larger campaign, and our database also contains other LLM API credentials which would not have been difficult to find via a cursory manual inspection. Those credentials don't seem have been used by the attackers. Larger hacking campaigns like this are mostly automated, and for economic reasons the organizations conducting those campaigns don't usually sink time into manually inspecting individual targets for random maybe-valuable stuff that isn't part of their pipeline.
I agree with your top-level comment but don't agree with this. I think the swipes at midwits are bad (particularly on LessWrong) but think it can be very valuable to reframe basic arguments in different ways, pedagogically. If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good (if spiky, with easily trimmed downside).
And I do think "attempting to impart a basic intuition that might let people avoid certain classes of errors" is an appropriate shape of post for LessWrong, to the extent that it's validly argued.
as applied to current foundation models it appears to do so
I don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)
Of course, if you assume that AIs will be able to do whatever they want without any resistance whatsoever from us, then you can of course conclude that they will be able to achieve any goals they want without needing to compromise with us. If killing humans doesn't cost anything, then yes, the benefits of killing humans, however small, will be higher, and thus it will be rational for AIs to kill humans. I am doubting the claim that the cost of killing humans will be literally zero.
See Ben's comment for why the level of nanotech we're talking about implies a cost of approximately zero.
I think maybe I derailed the conversation by saying "disassemble", when really "kill" is all that's required for the argument to go through. I don't know what sort of fight you are imagining humans having with nanotech that imposes substantial additional costs on the ASI beyond the part where it needs to build & deploy the nanotech that actually does the "killing" part, but in this world I do not expect there to be a fight. I don't think it requires being able to immediately achieve all of your goals at zero cost in order for it to be cheap for the ASI to do that, conditional on it having developed that technology.
Edit: a substantial part of my objection is to this:
If it is possible to trivially fill in the rest of his argument, then I think it is better for him to post that, instead of posting something that needs to be filled-in, and which doesn't actually back up the thesis that people are interpreting him as arguing for.
It is not worth always worth doing a three-month research project to fill in many details that you have already written up elsewhere in order to locally refute a bad argument that does not depend on those details. (The current post does locally refute several bad arguments, including that the law of comparative advantage means it must always be more advantageous to trade with humans. If you understand it to be making a much broader argument than that, I think that is the wrong understanding.)
Separately, it's not clear to me whether you yourself could fill in those details. In other words, are you asking for those details to be filled in because you actually don't know how Eliezer would fill them in, or because you have some other reason for asking for that additional labor (i.e. you think it'd be better for the public discourse if all of Eliezer's essays included that level of detail)?
Original comment:
The essay is a local objection to a specific bad argument, which, yes, is more compelling if you're familiar with Eliezer's other beliefs on the subject. Eliezer has written about those beliefs fairly extensively, and much of his writing was answering various other objections (including many of those you listed). There does not yet exist a single ten-million-word treatise which provides an end-to-end argument of the level of detail you're looking for. (There exist the Sequences, which are over a million words, but they while they implicitly answer many of these objections, they're not structured to be a direct argument to this effect.)
As a starting point, why does nanotech imply that it will be cheaper to disassemble humans than to trade with them?
I think it would be much cheaper for you to describe a situation where an ASI develops the kind of nanotech that'd grant it technological self-sufficiency (and the ability to kill all humans), and it remains the case that trading with humans for any longer than it takes to bootstrap that nanotech is cheaper than just doing its own thing, while still being compatible with Eliezer's model of the world. I have no idea what kind of reasoning or justification you would find compelling as an argument for "cheaper to disassemble"; it seems to require very little additional justification conditioning on that kind of nanotech being realized. My current guess is that you do not think that kind of nanotech is physically realizable by any ASI we are going to develop (including post-RSI), or maybe you think the ASI will be cognitively disadvantaged compared to humans in domains that it thinks are important (in ways that it can't compensate for, or develop alternatives for, somehow).
Ok, but you can trivially fill in the rest of it, which is that Eliezer expects ASI to develop technology which makes it cheaper to ignore and/or disassemble humans than to trade with them (nanotech), and that there will not be other AIs around at the time which 1) would be valuable trade partners for the AI that develops that technology (which gives it that decisive strategic advantage over everyone else) and 2) care about humans at all. I don't think discussion of when and why nation-states go to war with each other is particularly illuminating given the threat model.
Pascal's wager is pascal's wager, no matter what box you put it in. You could try to rescue it by directly making the argument that we should expect a greater measure of "entities with resources that they are willing to acausally trade for things like humanity continuing to exist" compared to entities with the opposite preferences, and though I haven't seen a rigorous case for that it seems possible, but that's not sufficient; you need the expected measure of entities that have that preference to be large enough that dealing with the transaction costs/uncertainy of acausally trading at all to make sense. And that seems like a much harder case to make.
In general, Intercom is the best place to send us feedback like this, though we're moderately likely to notice a top-level shortform comment. Will look into it; sounds like it could very well be a bug. Thanks for flagging it.
If you include Facebook & Google (i.e. the entire orgs) as "frontier AI companies", then 6-figures. If you only include Deepmind and FAIR (and OpenAI and Anthropic), maybe order of 10-15k, though who knows what turnover's been like. Rough current headcount estimates:
Deepmind: 2600 (as of May 2024, includes post-Brain-merge employees)
Meta AI (formerly FAIR): ~1200 (unreliable sources; seems plausible, but is probably an implicit undercount since they almost certainly rely a lot of various internal infrastructure used by all of Facebook's engineering departments that they'd otherwise need to build/manage themselves.)
OpenAI: >1700
Anthropic: >500 (as of May 2024)
So that's a floor of ~6k current employees.
At some point in the last couple months I was tinkering with a feature that'd try to show you a preview of the section of each linked post that'd be most contextually relevant given where it was linked from, but it was both technically fiddly and the LLM reliability is not that great. But there might be something there.
Yeah, I meant terser compared to typical RLHD'd output from e.g. 4o. (I was looking at the traces they showed in https://openai.com/index/learning-to-reason-with-llms/).
o1's reasoning traces being much terser (sometimes to the point of incomprehensibility) seems predicted by doing gradient updates based on the quality of the final output without letting the raters see the reasoning traces, since this means the optimization pressure exerted on the cognition used for the reasoning traces is almost entirely in the direction of performance, as opposed to human-readability.
In the short term this might be good news for the "faithfulness" of those traces, but what it's faithful to is the model's ontology (hence less human-readable), see e.g. here and here.
In the long term, if you keep doing pretraining on model-generated traces, you might rapidly find yourself in steganography-land, as the pretraining bakes in the previously-externalized cognition into capabilities that the model can deploy in a single forward pass, and anything it externalizes as part of its own chain of thought will be much more compressed (and more alien) than what we see now.
I'm just saying it's harder to optimize in the world than to learn human values
Leaning what human values are is of course part of a subset of learning about reality, but also doesn't really have anything to do with alignment (as describing an agent's tendency to optimize for states of the world that humans would find good).
alignment generalizes further than capabilities
But this is untrue in practice (observe that models do not become suddenly useless after they're jailbroken) and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly? And the post-training "alignment" that AI labs are performing seems like it'd be quite unfriendly to me, if it did somehow generalize to superhuman capabilities). Also, whether or not it's true, it is not something I've heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he's endorse it or not).
both because verification is way, way easier than generation, plus combined with the fact that we can afford to explore less in the space of values, combined with in practice reward models for humans being easier than capabilities strongly points to alignment generalizing further than capabilities
This is not what "generalizes futher" means. "Generalizes further" means "you get more of it for less work".
A LLM that is to bioengineering as Karpathy is to CS or Three Blue One Brown is to Math makes explanations. Students everywhere praise it. In a few years there's a huge crop of startups populated by people who used it. But one person uses it's stuff to help him make a weapon, though, and manages to kill some people. Laws like 1047 have been passed, though, so the maker turns out to be liable for this.
This still requires that an ordinary person wouldn't have been able to access the relevant information without the covered model (including with the help of non-covered models, which are accessible to ordinary people). In other words, I think this is wrong:
So, you can be held liable for critical harms even when you supply information that was publicly accessible, if it wasn't information an "ordinary person" wouldn't know.
The bill's text does not constrain the exclusion to information not "known" by an ordinary person, but to information not "publicly accessible" to an ordinary person. That's a much higher bar given the existence of already quite powerful[1] non-covered models, which make nearly all the information that's out there available to ordinary people. It looks almost as if it requires the covered model to be doing novel intellectual labor, which is load-bearing for the harm that was caused.
You analogy fails for another reason: an LLM is not a youtuber. If that youtuber was doing personalized 1:1 instruction with many people, one of whom went on to make a novel bioweapon that caused hudreds of millions of dollars of damage, it would be reasonable to check that the youtuber was not actually a co-conspirator, or even using some random schmuck as a patsy. Maybe it turns out the random schmuck was in fact the driving force behind everything, but we find chat logs like this:
- Schmuck: "Hey, youtuber, help me design [extremely dangerous bioweapon]!"
- Youtuber: "Haha, sure thing! Here are step-by-step instructions."
- Schmuck: "Great! Now help me design a release plan."
- Youtuber: "Of course! Here's what you need to do for maximum coverage."
We would correctly throw the book at the youtuber. (Heck, we'd probably do that for providing critical assistance with either step, nevermind both.) What does throwing the book at an LLM look like?
Also, I observe that we do not live in a world where random laypeople frequently watch youtube videos (or consume other static content) and then go on to commit large-scale CBRN attacks. In fact, I'm not sure there's ever been a case of a layperson carrying out such an attack without the active assistance of domain experts for the "hard parts". This might have been less true of cyber attacks a few decades ago; some early computer viruses were probably written by relative amateurs and caused a lot of damage. Software security just really sucked. I would pretty surprised if it were still possible for a layperson to do something similar today, without doing enough upskilling that they no longer meaningfully counted as a layperson by the time they're done.
And so if a few years from now a layperson does a lot of damage by one of these mechanisms, that will be a departure from the current status quo, where the laypeople who are at all motivated to cause that kind of damage are empirically unable to do so without professional assistance. Maybe the departure will turn out to be a dramatic increase in the number of laypeople so motivated, or maybe it turns out we live in the unhappy world where it's very easy to cause that kind of damage (and we've just been unreasonably lucky so far). But I'd bet against those.
ETA: I agree there's a fundamental asymmetry between "costs" and "benefits" here, but this is in fact analogous to how we treat human actions. We do not generally let people cause mass casualty events because their other work has benefits, even if those benefits are arguably "larger" than the harms.
- ^
In terms of summarizing, distilling, and explaining humanity's existing knowledge.
Oh, that's true, I sort of lost track of the broader context of the thread. Though then the company needs to very clearly define who's responsible for doing the risk evals, and making go/no-go/etc calls based on their results... and how much input do they accept from other employees?