Posts
Comments
I think this post deserves a response.
I think if you want to pursue a project in the "justice" space, you should write up the problems you see and your planned incentive structures in a way that is legible. Then people can decide if they should trust your justice procedure.
Yeah ok. But this essay was posted on Earth. And on Earth I read it as response to a percieved failure-mode of an Effective Altruism philosophy.
How much does it make a difference that Lincoln just came out and volunteered that information? The non-disparagement contracts are not themselves hidden.
Could we have a list of everything you think this post gets wrong, separately from the evidence that each of those points is wrong?
Maybe I'm missing something, but it seems like it should take less than an hour to read the post, make a note of every claim that's not true, and then post that list of false claims, even if it would take many days to collect all the evidence that shows those points are false.
I imagine that would be helpful for you, because readers are much more likely to reserve judgement if you listed which specific things are false.
Personally, I could look over that list and say "oh yeah, number 8 [or whatever] is cruxy for me. If that turns out not to be true, I think that substantially changes my sense of the situation.", and I would feel actively interested in what evidence you provide regarding that point later. And it would let you know which points to prioritize refuting, because you would know which things are cruxy for people reading.
In contrast, a generalized bid to reserve judgement because "many of the important claims were false or extremely misleading"...well, it just seems less credible, and so leaves me less willing to actually reserve judgement.
Indeed, deferring on producing such a list of claims-you-think-are-false suggests the possibility that you're trying to "get your story straight." ie that you're taking the time now to hurriedly go through and check which facts you and others will be able to prove or disprove, so that you know which things you can safely lie or exagerate about, or what narrative paints you in the best light while still being consistent with the legible facts.
I think it is not correct to refer to a person of a "cofounder" of an org because they seem to be a generalist taking responsibility for the org, if they did not actually co-found the org and are not referred to as a cofounder by the org.
This seems like a simple error / oversight, rather than a deliberate choice.
But I definitely don't feel like the assessment of "this person was in a defacto cofounder role, in practice, so it's not a big deal if we call them a cofounder" holds water.
promote this behavior among rationalists in general
What are you imagining when you say "promote this behavior"? Writing lesswrong posts in favor? Choosing to live that way yourself? Privately recommending that people do that? Not commenting when other people say that they're planning to do something that violates the Chesterton's fence?
I think I would like it if there were some soft nudges away from drama posts.
Something like "hi, you've been reading this post for 30 minutes today, here's a button for blocking it for the rest of the day (you can read and catch up on comments tomorrow)."
Lawsuits are of an importantly different category than violence. Lawsuits are one of the several mechanisms that society uses settle disputes without needing to resort to violence.
They may be inappropriate here, but I want to reject the equivocation between suing (or threatening to sue) someone and shooting (or threatening to shoot) them.
I'm unclear on why you you posted this comment. Is this a reminder not to resort to violence? Who are you reminding?
Thank you for taking the time to preregister your thoughts. This was great, and helpful for me to read.
Somehow this seems like a very big diff.
What's an example of fake backchaining?
I think Mark Zuckerberg said this on the Tim Ferris podcast, not Ferris himself?
I feel strongly about this part:
How to Measure Anything notably won't help you (much) with aligning AI. It teaches "good decisionmaking", but doesn't teach "research taste in novel domains". I think there's a concrete branch of rationality training that'd be relevant for novel research, that requires pretty different feedbackloops from the "generally be successful at life" style of training. I think some of "research taste rationality" is reasonably alive in academia, but many elements are not, or are not emphasized enough.
I want to keep pushing for people to disambiguate for what precisely they mean when they use the word "rationality". It seems to me that there are a bunch separate project that plausibly, but not necessarily, overlap, which have been lumped together under a common term, which causes people to overestimate how much they do overlap.
In particular, “effective decision-making in the real world”, “how to make progress on natural philosophy when you don’t have traction on the problem” are much more different than one might think from reading the sequences (which talks about both in the same breath, and under the same label).
Problems where "rational choice under uncertainty" / necessarily, problems where you already have a frame to operate in. If nothing else, you have your decision theory and probability theory frame.
Making progress on research questions about which you are extremely confused is mostly a problem of finding and iterating on a frame for the problem.
And the project of "raising the sanity waterline" and of "evidence-based self-help", are different still.
Strong agree: I said more or less the same thing a few months ago: A note on “instrumental rationality”
This scenario has too many burdensome details to be given much attention, except as one example from an extremely large space of possibilities.
But what it's actually about is just the fact that whenI'm dead I can't achieve my goals.
Or more strictly, what it's about is that if you're dead, you can't achieve evolution's goals for you.
What's that? Do you have a link to a good overview?
How do I embed the market directly into the comment, instead of having a link to which people click through?
Here's a market for your claim.
GPT-4 performance and compute efficiency from a simple architecture before 2026
So in evaluating that, the key question here is whether LLMs were on the critical path already.
Is it more like...
- We're going to get AGI at some point and we might or might not have gotten LLMs before that.
or
- It was basically invertible that we get LLMs before AGI. LLMs "always" come X years ahead of AGI.
or
- It was basically inevitable that we get LLMs before AGI, but there's a big range of when they can arrive relative to AGI.
- And OpenAI made the gap between LLMs and AGI bigger than the counterfactual.
or
- And OpenAI made the gap between LLMs and AGI smaller than the counterfactual.
- And OpenAI made the gap between LLMs and AGI bigger than the counterfactual.
My guess is that the true answer is closest to the second option: LLMs happen a predictable-ish period ahead of AGI, in large part because they're impressive enough and generally practical enough to drive AGI development.
What this is suggesting to me is that if OpenAI didn't bet on LLMs, we effectively wouldn't have gotten more time to do alignment research, because most alignment research done before an understanding of LLMs would have been a dead end. And that actually solving alignment may require people who have internalized the paradigm shift represented by LLMs and figuring out solutions based on that. Under this model, even if we are in an insight-constrained world, OpenAI mostly hasn't burned away effective years of alignment research (because alignment research carried out before we had LLMs would have been mostly useless anyway).
Here's a paraphrase of the way I take you to be framing the question. Please let me know if I'm distorting it in my translation.
We often talk about 'the timeline to AGI' as a resource that can be burned. We want to have as much time as we can to prepare before the end. But that's actually not quite right. The relevant segment of time is not (from "as soon as we notice the problem" to "the arrival of AGI") it's (from "as soon as we can make real technical headway on the problem" to "the arrival of AGI"). We'll call that second time-segment "preparation time".
The development of LLMs maybe did bring the date of AGI towards us, but it also pulled forward the start of the "preparation time clock".In fact it was maybe feasible that the "preparation time" clock might have started only just before AGI, or not at all.
So all things considered, the impact of pulling the start time forward seems much larger than the impact of pulling the time of AGI forward.
How's that as a summary?
And how come the overwhelming majority of patients don't quit smoking when their doctor tells them to do so, but people often do quit smoking after they've personally experienced the negative consequences (eg had their first heart attack)?
It seems like the obvious answer is "because the experience of abstract words from their doctor isn't vivid enough to trigger the reinforcement machinery, but the experience of having a heart attack is."
I wrote the following comment during this AMA back in 2019, but didn't post it because of the reasons that I note in the body of the comment.
I still feel somewhat unsatisfied with what I wrote. I think something about the tone feels wrong, or gives the wrong impression, somehow. Or maybe this only presents part of the story. But it still seems better to say aloud than not.
I feel more comfortable posting it now, since I'm currently early in the process of attempting to build an organization / team that does meet these standards. In retrospect, I think probably it would have been better if I had just posted this at the time, and hashed out some disagreements with others in the org in this thread.
(In some sense this comment is useful mainly as bit of a window into the kind of standards that I, personally, hold a rationality-development / training organization to.)
My original comment is reproduced verbatim below (plus a few edits for clarity).
I feel trepidation about posting this comment, because it seems in bad taste to criticize a group, unless one is going to step up and do the legwork to fix the problem. This is one of the top 5 things that bothers me about CFAR, and maybe I will step up to fix it at some point, but I’m not doing that right now and there are a bunch of hard problems that people are doing diligent work to fix. Criticizing is cheap. Making things better is hard.
[edit 2023: I did run a year long CFAR instructor training that was explicitly designed to take steps on this class of problems though. It is not as if I was just watching from the sidelines. But shifting the culture of even a small org, especially from a non-executive role, is pretty difficult, and my feeling is that I made real progress in the direction that I wanted, but only about one twentieth of the way to what I would think is appropriate.]My view is that CFAR does not meaningfully eat its own dogfood, or at least doesn’t enough, and that this hurts the organization’s ability to achieve its goals.
This is not to contradict the anecdotes that others have left here, which I think are both accurate presentations, and examples of good (even inspiring) actions. But while some members of CFAR do have personal practices (with varying levels of “seriousness”) in correct thought and effective action, CFAR, as an institution, doesn’t really make much use of rationality. I resonate strongly with Duncan’s comment about counting up vs. counting down.
More specific data, both positive and negative:
- CFAR did spend some 20 hours of staff meeting time Circling in 2017, separately from a ~50 hour CFAR circling retreat the most of the staff participated in, and various other circling events that CFAR staff attended together (but were not “run by CFAR”).
- I do often observe people doing Focusing moves and Circling moves in meetings.
- I have observed occasional full explicit Double Crux conversations on the order of three or four times a year.
- I frequently (on the order of once every week or two) observe CFAR staff applying the Double Crux moves (offering cruxes, crux checking, operationalizing, playing the Thursday-Friday game) in meetings and in conversation with each other.
- Group goal-factoring has never happened, to the best of my knowledge, even though there are a number of things that happen at CFAR that seem very inefficient, seem like “shoulds”, or are frustrating / annoying to at least one person [edit 2023: these are explicit triggers for goal factoring]. I can think of only one instance in which two of us (Tim and I, specifically) tried to goal-factor something (a part of meetings that some of us hate).]
- We’ve never had an explicit group pre-mortem, to the best of my knowledge. There is the occasional two-person session of simulating a project (usually workshop or workshop activity), and the ways in which it goes wrong. [edit 2023: Anna said that she had participated in many long form postmortems regarding hiring in particular, when I sent her a draft of this comment in 2019.]
- There is no infrastructure for tracking predictions or experiments. Approximately, CFAR as an institution doesn’t really run [formal] experiments, at least experiments with results that are tracked by anything other than the implicit intuitions of the staff. [edit 2023: some key features of a "formal experiment" as I mean it are writing down predictions in advance, and having a specific end date at which the group reviews the results. This is in contrast to simply trying new ideas sometimes.]
- There is no explicit processes for iterating on new policies or procedures (such as iterating on how meetings are run).
- [edit 2023: An example of an explicit process for iterating on policies and procedures is maintaining a running document for a particular kind of meeting. Every time you have that kind of meeting, you start by referring to the notes from the last session. You try some specific procedural experiments, and then end the meeting with five minutes of reflection on what worked well or poorly, and log those in the document. This way you are explicitly trying new procedures and capturing the results, instead finding procedural improvements mainly by stumbling into them, and often forgetting improvements rather than integrating and building upon them. I use documents like this for my personal procedural iteration.
Or in Working Backwards, the authors describe not just organizational innovations that Amazon came up with to solve explicitly-noted organizational problems, but the sequence of iteration that led to those final form innovations.]- There is informal, but effective, iteration on the workshops. The processes that run CFAR’s internals however, seem to me to be mostly stagnant [edit 2023: in the sense that there's not deliberate intentional effort on solving long-standing institutional frictions, or developing more effective procedures for doing things.]
- As far as I know, there are no standardized checklists for employing CFAR techniques in relevant situations (like starting a new project). I wouldn’t be surprised if there were some ops checklists with a murphyjitsu step. I’ve never seen a checklist for a procedure at CFAR, excepting some recurring shopping lists for workshops.
- The interview process does not incorporate the standard research about interviews and assessment contained in Thinking, Fast and Slow. (I might be wrong about this. I, blessedly, don’t have to do admissions interviews.)
- No strategic decision or choice to undertake a project, that I’m aware of, has involved quantitative estimates of impact, or quantitative estimates of any kind. (I wouldn’t be surprised if the decision to run the first MSFP did, [edit 2023: but I wasn't at CFAR at the time. My guess is that there wasn't.])
- Historically, strategic decisions were made to a large degree by inertia. This is more resolved now, but for a period of several years, I think most of the staff didn’t really understand why we were running mainlines, and in fact when people [edit 2023: workshop participants] asked about this, we would say things like “well, we’re not sure what else to do instead.” This didn’t seem unusual, and didn’t immediately call out for goal factoring.
- There’s not designated staff training time for learning or practicing the mental skills, or for doing general tacit knowledge transfer between staff. However, Full time CFAR staff have historically had a training budget, which they could spend on whatever personal development stuff they wanted, at their own discretion.
- CFAR does have a rule that you’re allowed / mandated to take rest days after a workshop, since the workshop eats into your weekend.
Overall, CFAR strikes me as a mostly a normal company, populated by some pretty weird hippy-rationalists. There aren’t any particular standards that the employees are expected to use rationality techniques, nor institutional procedures for doing rationality [edit 2023: as distinct from having shared rationality-culture].
This is in contrast to say, Bridgewater associates, which is clearly structured intentionally to enable updating and information processing, on the organizational level. (Incidentally, Bridgewater is rich in the most literal sense.)
Also, I’m not fully exempt from these critiques myself: I have not really internalized goal factoring, yet, for instance, and think that I personally, am making the same kind of errors of inefficient action that I’m accusing CFAR of making. I also don’t make much use of quantitative estimates, and I have lots of empirical iteration procedures, but haven’t really gotten the hang of doing explicit experiments. (I do track decisions and predictions though, for later review.)
Overall, I think this gap is about due 10% “these tools don’t work as well, especially at the group level, as we seem to credit them, and we are correct to not use them”, about 30% to this being harder to do than it seems, and about 60% due to CFAR not really trying at this (and maybe it shouldn’t be trying at this, because there are trade offs and other things to focus on).
Elaborating on the 30%: I do think that making an org like this, especially when not starting from scratch, is deceptively difficult. I think that while implementing some of these seems trivial on the surface, but that it actually entails a shift in culture and expectations, and doing this effectively requires leadership and institution-building skills that CFAR doesn’t currently have. Like, if I imagine something like this existing, it would need to have a pretty in depth onboarding process for new employees, teaching the skills, and presenting “how we do things here.” If you wanted to bootstrap into this kind of culture, at anything like a fast enough speed, you would need the same kind of on-boarding for all of the existing employees, but it would be even harder, because you wouldn’t have the culture already going to provide example and immersion.
I think my biggest crux here is how much the development to AGI is driven by compute progress.
I think it's mostly driven by new insights plus trying out old, but expensive, ideas. So, I provisionally think that OpenAI has mostly been harmful, far in excess of it's real positive impacts.
Elaborating:
Compute vs. Insight
One could adopt a (false) toy model in which the price of compute is the only input to AGI. Once the price falls low enough, we get AGI. [a compute-constrained world]
Or a different toy model: When AGI arrives depends entirely on algorithmic / architectural progress, and the price of compute is irrelevant. In this case there's a number of steps on the "tech tree" to AGI, and the world takes each of those steps, approximately in sequence. Some of those steps are new core insights, like the transformer architecture or RLHF or learning about the chinchilla scaling laws, and others are advances in scaling, going from GPT-2 to GPT-3. [an insight-constrained world]
(Obviously both those models are fake. Both compute and architecture are inputs to AGI, and to some extent they can substitute for each other: you can make up for having a weaker algorithm with more brute force, and vis versa. But these extreme cases are easier for me, at least, to think about.)
In the fully compute-constrained world, OpenAI's capabilities work is strictly good, because it means we get intermediate products of AGI development earlier.
In this world, progress towards AGI is ticking along at the drum-beat of Moore's law. We're going to get AGI in 20XY. But because of OpenAI, we get GPT-3 and 4, which give us subjects for interpretability work, and gives the world a headsup about what's coming.
Under the compute-constraint assumption, OpenAI is stretching out capabilities development, by causing some of the precursor developments to happen earlier, but more gradually. AGI still arrives at 20XY, but we get intermediates earlier than we otherwise would have.
In the fully insight-constrained world, OpenAI's impact is almost entirely harmful. Under that model, Large Language Models would have been discovered eventually, but OpenAI made a bet on scaling GPT-2. That caused us to get that technology earlier, and also pulled forward the date of AGI, both by checking off one of the steps, and by showing what was possible and so generating counterfactual interest in transformers.
In this world, OpenAI might have other benefits, but they are at least doing the counterfactual harm of burning our serial time.
They don't get the credit for "sounding the alarm" by releasing ChatGPT, because that was on the tech tree already, that was going to happen at some point. Giving OpenAI credit for it, would be sort of the reverse of "shooting the messenger", where you credit someone for letting you know about a bad situation when they were the cause of the bad situation in the first place (or at least made it worse).
Again, neither of these models is correct. But I think our world is closer to the insight-constrained world than the compute-constrained world.
This makes me much less sympathetic to OpenAI.
Costs and Benefits
It doesn't settle the question, because maybe OpenAI's other impacts (many of which I agree are positive!) more than make up for the harm done by shortening the timeline to AGI.
In particular...
- I'm not inclined to give them credit for deciding to release their models for the world to engage with, rather than keep them as private lab-curiosities. Releasing their language models as products, it seems to me, is fully aligned with their incentives. They have an impressive and useful new technology. I think the vast majority of possible counterfactual companies would do the same thing in their place. It isn't (I think) an extra service they're doing the world, relative to the counterfactual.[1]
- I am inclined to give them credit for their charter, their pseudo-non-profit structure, and the merge and assist clause[2], all of which seems like at least a small improvement over the kind of commitment to the public good that I would expect from a counterfactual AGI lab.
- I am inclined to give them credit for choosing not to release the technical details of GPT-4.
- I am inclined to give them credit for publishing their plans and thoughts regarding x-risk, AI alignment, and planning for superintelligence.
Overall, it currently seems to me that OpenAI is somewhat better than a random draw from the distribution of possible counterfactual AGI companies (maybe 90th percentile?). But also that they are not so much better that that makes up for burning 3 to 7 years of the timeline.
3 to 7 years is just my eyeballing how much later someone would have developed ChatGPT-like capabilities, if OpenAI hadn't bet on scaling up GPT-2 into GPT-3 and hadn't decided to to invest in RLHF, both moves that it looks to me like few orgs in the world were positioned to try, and even fewer actually would have tried in the near term.
That's not a very confident number. I'm very interested in getting more informed estimates of how long it would have taken for the world to develop something like ChatGPT without OpenAI.
(I'm selecting ChatGPT as the criterion, because I think that's the main pivot point at which the world woke up to the promise and power of AI. Conditional on someone developing something ChatGPT-like, it doesn't seem plausible to me that the world goes another three years without developing a language model as impressive as GPT-4. At that point developing bigger and better language models is an obvious thing to try, rather than an interesting bet that the broader world isn't much interested in.)
I'm also very interested if anyone thinks that the benefits (either ones that I listed or others) outweigh an extra 3 to 7 years of working on alignment (not to mention 3 to 7 years of additional years of life expectancy for all of us).
- ^
It is worth noting that at some point PaLM was (probably) the most powerful LLM in the world, and google didn't release it as a product.
But I don't think this is a very stable equilibrium. I expect to see a ChatGPT competitor from google before 2024 (50%) and before 2025 (90%). - ^
That said, "a value-aligned, safety-conscious project comes close to building AGI before we do", really gives a lot of wiggle-room for deciding if some competitor is "a good guy". But, still better than the counterfactual.
Which is a shame, because I would like to listen to them!
Yudkowsky's Genocide the Borderers Facebook post
What is this?
There's actually pretty large differences of perspective on this claim.
The links of the previous discussions seem dead?
I’m not going to argue about whether LLM-plateau-ism is right or wrong—that’s outside the scope of this post, and also difficult for me to discuss publicly thanks to infohazard issues.[4] Oh well, we’ll find out one way or the other soon enough.
I really like this paragraph (and associated footnote) for being straightforward about what you're not saying, not leaking scary info, and nevertheless projecting a "friendly" (by my sensors) vibe.
Good job! : )
You could just not have the multipliers apply to negative karma posts.
I would contribute $75 to the prize. : )
I haven't read the whole post yet.
My initial thought: I'm much more sketched out by platforms rate limiting users, than I am with them straight up banning them. The first is much more "micro-managing-y", and I wonder if it can lead to more subtle and powerful distortions than outright bans, which are at least transparent.
I haven't thought about it much, but I think I feel much safer about a marketplace of ideas that has a gate (some people are not permitted) than a marketplace of ideas that is slanted to advantage some ideas over others, especially if group deciding the slanting is centralized.
I think part of my intuition here is that, if you don't let people into the marketplace, that's one thing, but if you fuck with which things get propagated, you are distorting the basic mechanism of the marketplace of ideas, in which people sharing the things that they think are true leads to correct ideas gaining dominance over time.
On the other hand, given social media and hyper-competive memes, I'm pretty sympathetic to the idea that "the marketplace of ideas" is an outdated frame, and we need better concepts for pursuing the enlightenment project in a way that doesn't get subverted.
No one cares because...there are other systems who are not operating on a complete set of human values (including many small, relatively dumb AI systems) that are steering the world instead?
CFAR tried to do both IIRC.
According to me (who worked at CFAR for 5 years) CFAR did approximately 0-rationality verification whatsoever.
Indeed, while that would be crucial to the kind of experimental rationality development that's described in the Craft and the Community, it isn't and wasn't a natural component of CFAR's functional strategy, which was something more like rationality community-building and culture-building.
[I hope to write more about what CFAR did and why, and how it differed from the sort of thing outlined in the Craft and the Community, sometime.]
So, currently the way alignment gets solved is: things continue to get crazier until they literally cannot get crazier any faster. When we reach that moment, we look back and ask: was it worth it? And if the answer is yes, congratulations, we solved the alignment problem.
I don't know if I'm block-headed-ly missing the point, but I don't know what this paragraph means.
How exactly does the world accelerating as much as possible mean that we solved the alignment problem?
Most events that could have been performed in front of a large audience can now be done in front of zero audience and put on YouTube
I think this is actually not true. To a degree that would surprise the average movie-goer, the performers at the theater are playing off the energy of the audience, such that the performance is different when there's no one watching.
As a concrete example, (from years and years ago so I maybe have the exact numbers wrong) when you're rehearsing a stage fight, you always move no more than 75% of the speed that you want to move on the night, because the energy of performing in front of an audience will inevitably cause to have more energy, speed, and force.
And improv if pretty fundamentally about the relationship between the performers and the audience.
My guess is that there are lots of mediums that are like this.
This is a great comment.
This part in particular feels very clarifying.
But "corrupting the youth" is serious--existentially serious! If it looks like a whole generation wants to grow up to be jesters instead of kings and knights and counsellors, then the very polis itself is at risk. [And, like, risk of murder and enslavement; existential questions are big deals!]
I'll also observe that an antibody was planted in our culture with this post, and this one.
Related ideas:
In particular, existing AI training strategies don’t need to handle a “drastic” distribution shift from low levels of intelligence to high levels of intelligence. There’s nothing in the foreseeable ways of building AI that would call for a big transfer like this, rather than continuously training as intelligence gradually increases.
An obvious possible regime change is the shift to training (some) agents that do lifetime learning rather than only incorporating capability from SGD.
That's one thing simple thing that seems likely to generate a sharp left turn.
The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively. No particular act needs to be pivotal in order to greatly reduce the risk from unaligned AI, and the search for single pivotal acts leads to unrealistic stories of the future and unrealistic pictures of what AI labs should do.
We could maybe make the world safer a little at a time, but we do have to get to a equilibrium in which the world is protected from explosive growth when some system (including a ecosystem of multiple AIs), starts pulling away from the growth-rate of the rest of the world, and gains decisive power.
My model here is something like "even small differences in the rate at which systems are compounding power and/or intelligence lead to gigantic differences in absolute power and/or intelligence, given that the world is moving so fast."
Or maybe another way to say it: the speed at which a given system can compound it's abilities is very fast, relative to the rate at which innovations diffuse through the economy, for other groups and other AIs to take advantage of.
It seems like all of the proposals that seem like they meet this desiderata (making the world safe from that kind of explosion in power of one system over all the others), look pretty pivotal-act like, rather than a series of marginal improvements.
Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.
My understanding of Eliezer's view is that some domains are much harder to do aligned cognition in than others, and alignment is among the hardest.
(I'm not sure I clearly understand why. Maybe because it entails reasoning about humans and reasoning about building super powerful systems, so if your R&D SEAI is even a little bit unaligned, it will have ample leverage for seizing power?)
It's not so much that AIs will be able to do recursive self improvement before they're able to solve alignment. It's that making alignment progress is itself heavily alignment loaded, in a way that "recursively self improve" (without regard for alignment), isn't.
One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff. So by the time we have AI systems who can develop molecular nanotech, we will definitely have had systems that did something slightly-less-impressive-looking.
This objection only holds if you imagine that AI systems are acquiring knowledge about the world / capability only through gradient decent, as opposed to training an agent that learns, and becomes more capable thereby, over runtime.
That's the hard part.
My guess is that training cutting edge models, and not releasing them is a pretty good play, or would have been, if there wasn't huge AGI hype.
As it is, information about your models is going to leak, and in most cases the fact that something is possible is most of the secret to reverse engineering it (note: this might be true in the regime of transformer models, but it might not be true for other tasks or sub-problems).
But on the other hand, given the hype, people are going to try to do the things that you're doing anyway, so maybe leaks about your capabilities don't make that much difference?
This does point out an important consideration, which is "how much information needs to leak from your lab to enable someone else to replicate your results?"
It seems like, in many cases, there's an obvious way to do some task, and the mere fact that you succeeded is enough info to recreate your result. But presumably there are cases, where you figure out a clever trick, and even if the evidence of your model's performance leaks, that doesn't tell the world how to do it (though it does cause maybe hundreds of smart people to start looking for how you did it, trying to discover how to do it themselves).
I think I should regard the situation differently depending on the status of that axis.
In terms of speeding up AI development, not building anything > building something and keeping it completely secret > building something that your competitors learn about > building something and generating public hype about it via demos > building something with hype and publicly releasing it to users & customers.
I think it is very helpful, and healthy for the discourse, to make this distinction. I agree that many of these things might get lumped together.
But also, I want to flag the possibility that something can be very very bad to do, even if there are there other things that would have been progressively worse to do.
I want to make sure that groups get the credit that is due to them when they do good things against their incentives.
I also want to avoided falling into a pattern of thinking "well they didn't do the worst thing, or the second worst thing, so that's pretty good!" if in isolation I would have thought that action was pretty bad / blameworthy.
As of this moment, I don't have a particular opinion one way or the other about how good or bad Anthropic's release policy is. I'm merely making the abstract point at this time.
I like the creative thinking here.
I suggest a standard here, where can test our "emulation" against the researcher themselves, to see how much of a diff there is in their answers, and the researcher and rate how good a substitute the model is for themselves, on a number of different dimensions.
This continues to be one of the best and most important posts I have ever read.
I have multiple references that corroborate that.
Can you share? I would like to have a clearer sense of what happened to them. If there's info that I don't know, I'd like to see it.
I do appreciate the conciseness a lot.
It seems like I maybe would have gotten the same value from the essay (which would have taken 5 minutes to read?) as from this image (which maybe took 5 seconds).
But I don't want to create a culture that rewards snark, even more than it already does. It seems like that is the death of discourse, in a bunch of communities.
So I'm interested in if there are ways to get the benefits here, without the costs.