Posts
Comments
Soon the two are lost in a maze of words defined in other words, the problem that Steven Harnad once described as trying to learn Chinese from a Chinese/Chinese dictionary.
Of course, it turned out that LLMs do this just fine, thank you.
intensional terms
Should probably link to Extensions and Intensions; not everyone reads these posts in order.
Mati described himself as a TPM since September 2023 (after being PM support since April 2022), and Andrei described himself as a Research Engineer from April 2023 to March 2024. Why do you believe either was not a FTE at the time?
And while failure to sign isn't proof of lack of desire to sign, the two are heavily correlated—otherwise it would be incredibly unlikely for the small Superalignment team to have so many members who signed late or not at all.
With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd update my tally of the departures from OpenAI, collated with how quickly the ex-employee had signed the loyalty letter to Sam Altman last November.
The letter was leaked at 505 signatures, 667 signatures, and finally 702 signatures; in the end, it was reported that 737 of 770 employees signed. Since then, I've been able to verify 56 departures of people who were full-time employees (as far as I can tell, contractors were not allowed to sign, but all FTEs were).
I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made).
Headline numbers:
- Attrition for the 505 OpenAI employees who signed before the letter was first leaked: at least 24/505 = 4.8%
- Attrition for the next 197 to sign (it was leaked again at 667 signatures, and one last time at 702): at least 13/197 = 6.6%
- Attrition for the (reported) 68 who had not signed by the last leak: at least 19/68 = 27.9%.
Reportedly, 737 out of the 770 signed in the end, and many of the Superalignment team chose not to sign at all.
Below are my current tallies of some notable subsets. Please comment with any corrections!
People from the Superalignment team who never signed as of the 702 leak (including some policy/governance people who seem to have been closely connected) and are now gone:
- Carroll Wainwright
- Collin Burns
- Cullen O'Keefe
- Daniel Kokotajlo
- Jan Leike (though he did separately Tweet that the board should resign)
- Jeffrey Wu
- Jonathan Uesato
- Leopold Aschenbrenner
- Mati Roy
- William Saunders
- Yuri Burda
People from the Superalignment team (and close collaborators) who did sign before the final leak but are now gone:
- Jan Hendrik Kirchner (signed between 668 and 702)
- Steven Bills (signed between 668 and 702)
- John Schulman (signed between 506 and 667)
- Sherry Lachman (signed between 506 and 667)
- Ilya Sutskever (signed by 505)
- Pavel Izmailov (signed by 505)
- Ryan Lowe (signed by 505)
- Todor Markov (signed by 505)
Others who didn't sign as of the 702 leak (some of whom may have just been AFK for the wrong weekend, though I doubt that was true of Karpathy) and are now gone:
- Andrei Alexandru (Research Engineer)
- Andrej Karpathy (Co-Founder)
- Austin Wiseman (Finance/Accounting)
- Girish Sastry (Policy)
- Jay Joshi (Recruiting)
- Katarina Slama (Member of Technical Staff)
- Lucas Negritto (Member of Technical Staff, then Developer Community Ambassador)
- Zarina Stanik (Marketing)
Notable other ex-employees:
- Barrett Zoph (VP of Research, Post-Training; signed by 505)
- Bob McGrew (Chief Research Officer; signed by 505)
- Chris Clark (Head of Nonprofit and Strategic Initiatives; signed by 505)
- Diane Yoon (VP of People; signed by 505)
- Gretchen Krueger (Policy; signed by 505; posted a significant Twitter thread at the time she left)
- Mira Murati (CTO; signed by 505)
EDIT: On reflection, I made this a full Shortform post.
With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd do a more thorough scan of the departures. I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made).
Headline numbers:
- Attrition for the 505 OpenAI employees who signed before the letter was first leaked: at least 24/505 = 4.8%
- Attrition for the next 197 to sign (it was leaked again at 667 signatures, and one last time at 702): at least 13/197 = 6.6%
- Attrition for the (reported) 68 who had not signed by the last leak: at least 19/68 = 27.9%.
Reportedly, 737 out of the 770 signed in the end, and many of the Superalignment team chose not to sign at all.
Below are my current tallies of some notable subsets. Please comment with any corrections!
People from the Superalignment team who never signed as of the 702 leak (including some policy/governance people who seem to have been closely connected) and are now gone:
- Carroll Wainwright
- Collin Burns
- Cullen O'Keefe
- Daniel Kokotajlo
- Jan Leike (though he did separately Tweet that the board should resign)
- Jeffrey Wu
- Jonathan Uesato
- Leopold Aschenbrenner
- Mati Roy
- William Saunders
- Yuri Burda
People from the Superalignment team (and close collaborators) who did sign before the final leak but are now gone:
- Jan Hendrik Kirchner (signed between 668 and 702)
- Steven Bills (signed between 668 and 702)
- John Schulman (signed between 506 and 667)
- Sherry Lachman (signed between 506 and 667)
- Ilya Sutskever (signed by 505)
- Pavel Izmailov (signed by 505)
- Ryan Lowe (signed by 505)
- Todor Markov (signed by 505)
Others who didn't sign as of the 702 leak (some of whom may have just been AFK for the wrong weekend, though I doubt that was true of Karpathy) and are now gone:
- Andrei Alexandru (Research Engineer)
- Andrej Karpathy (Co-Founder)
- Austin Wiseman (Finance/Accounting)
- Girish Sastry (Policy)
- Jay Joshi (Recruiting)
- Katarina Slama (Member of Technical Staff)
- Lucas Negritto (Member of Technical Staff, then Developer Community Ambassador)
- Zarina Stanik (Marketing)
Notable other ex-employees:
- Barrett Zoph (VP of Research, Post-Training; signed by 505)
- Bob McGrew (Chief Research Officer; signed by 505)
- Chris Clark (Head of Nonprofit and Strategic Initiatives; signed by 505)
- Diane Yoon (VP of People; signed by 505)
- Gretchen Krueger (Policy; signed by 505; posted a significant Twitter thread at the time)
- Mira Murati (CTO; signed by 505)
CDT agents respond well to threats
Might want to rephrase this as "CDT agents give in to threats"
This is weirdly meta.
If families are worried about the cost of groceries, they should welcome this price discrimination. The AI will realize you are worried about costs. It will offer you prime discounts to win your business. It will know you are willing to switch brands to get discounts, and use this to balance inventory.
Then it will go out and charge other people more, because they can afford to pay. Indeed, this is highly progressive policy. The wealthier you are, the more you will pay for groceries. What’s not to love?
A problem is that this is not only a tax on indifference, but also a tax on innumeracy and on lack of leisure time. Those who don't know how to properly comparison shop are likely to be less wealthy, not more; same with those who don't have the spare time to go to more than one store.
Re: experience machine, Past Me would have refused it and Present Me would take it. The difference is due to a major (and seemingly irreversible) deterioration in my wellbeing several years ago, but not only because that makes the real world less enjoyable.
Agency is another big reason to refuse the experience machine; if I think I can make a difference in the base-level world, I feel a moral responsibility towards it. But I experience significantly less agency now (and project less agency in the future), so that factor is diminished for me.
The main factor that's still operative is epistemics: I would much rather my beliefs be accurate than be deceived about the world. But it's hard for that to outweigh the unhappiness at this point.
So if a lot of people would choose the Experience Machine, that suggests they are some combination of unhappy, not confident in their agency, and not obsessed with their epistemics. (Which does, I think, operationalize your "something is very wrong".)
- Thanks—I didn't recall the content of Yglesias' tweet, and I'd noped out of sorting through his long feed. I suspect Yglesias didn't understand why the numbers were weird, though, and people who read his tweet were even less likely to get it. And most significantly, he tries to draw a conclusion from a spurious fact!
- Allowing explicitly conditional markets with a different fee structure (ideally, all fees refunded on the counterfactual markets) could be an interesting public service on Manifold's part.
- The only part of my tone that worries me in retrospect is that I should have done more to indicate that you personally were trying to do a good thing, and I'm criticizing the deference to conditional markets rather than criticizing your actions. I'll see if I can edit the post to improve on that axis.
- I think we still differ on that. Even though the numbers for the main contenders were just a few points apart, there was massive jockeying to put certain candidates at the top end of that range, because relative position is what viewers noticed.
I'm really impressed with your grace in writing this comment (as well as the one you wrote on the market itself), and it makes me feel better about Manifold's public epistemics.
Yes, and I gained some easy mana from such markets; but the market that got the most attention by far was the intrinsically flawed conditional market.
Real-money markets do have stronger incentives for sharps to scour for arbitrage, so the 1/1/26 market would have been more likely to be noticed before months had gone by.
However (depending on the fee structure for resolving N/A markets), real-money markets have even stronger incentives for sharps to stay away entirely from spurious conditional markets, since they'd be throwing away cash and not just Internet points. Never ever ever cite out-of-the-money conditional markets.
Broke: Prediction markets are an aggregation of opinions, weighted towards informed opinions by smart people, and are therefore a trustworthy forecasting tool on any question.
Woke: Prediction markets are MMOs set in a fantasy world where, if someone is Wrong On The Internet, you can take their lunch money.
Can you share any strong evidence that you're an unusually trustworthy person in regard to confidential conversations? People would in fact be risking a lot by talking to you.
(This is sincere btw; I think this service should absolutely exist, but the best version of it is probably done by someone with a longstanding public reputation of circumspection.)
Good question! I picked it up from a friend at a LW meetup a decade ago, so it didn't come with all the extra baggage that vipassana meditation seems to usually carry. So this is just going to be the echo of it that works for me.
Step 1 is to stare at your index finger (a very sensitive part of your body) and gently, patiently try to notice that it's still producing a background level of sensory stimulus even when it's not touching anything. That attention to the background signal, focused on a small patch of your body, is what the body scan is based on.
Step 2 is learning how to "move" that awareness of the background signal slowly. Try to smoothly shift that awareness down your finger, knuckle by knuckle, keeping the area of awareness small by ceasing to focus on the original spot as you focus on a new spot. Then try moving that spot of awareness gradually to the base of your thumb, and noticing the muscle beneath the skin.
Use Case α is harnessing that kind of awareness to relax physical tension and even pain. The next time you have a paper cut or a small burn, once you've dealt with it in the obvious objective ways and now just have to handle the pain, focus your awareness right on that spot. The sensation will still be loud, but it won't be overwhelming when you're focusing on it rather than fleeing from it. Or the next time you notice a particularly tense muscle, focus your awareness there; for me, that usually loosens it at least a little.
Step 3 is the body scan itself: creating awareness for each part of your skin and muscles, gradually, bit by bit, starting from the crown of your head and slowly tracing out a path that covers everything. This is where a guided meditation could really help. I don't have one to recommend (after having the guided meditation at the meetup, I got as much of the idea as I needed), but hopefully some of the hundreds out there are as good as Random Meditating Rationalist #37 was.
And Use Case β, when you have a migraine, is to imagine moving that awareness inside your skull, to the place where the migraine pain feels like it's concentrated. (I recommend starting from a place where the migraine seems to "surface"—for me, the upper orbit of my left eye—if you have such a spot.)
There's something quite odd about how this works: your brain doesn't have pain receptors, so the pain from the migraine ends up in some phantom location on your body map, and it's (conveniently?) interpreted as being inside your head. By tracing your awareness inside your skull, you walk along that body map to the same phantom location as that pain, so it works out basically the same as if you were in Use Case α.
Hope this helps!
I have to further compliment my past self: this section aged extremely well, prefiguring the Shoggoth-with-a-smiley-face analogies several years in advance.
GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?
One might hope that because it's learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:
- Two humans are fairly similar to each other, because they have very similar architectures and are learning to succeed in the same environment.
- Two convergently evolved species will be similar in some ways but not others, because they have different architectures but the same environmental pressures.
- A mimic species will be similar in some ways but not others to the species it mimics, because even if they share recent ancestry, the environmental pressures on the poisonous one are different from the environmental pressures on the mimic.
What we have with the GPTs is the first deep learning architecture we've found that scales this well in the domain (so, probably not that much like our particular architecture), learning to mimic humans rather than growing in an environment with similar pressures. Why should we expect it to be anything but very alien under the hood, or to continue acting human once its actions take us outside of the training distribution?
Moreover, there may be much more going on under the hood than we realize; it may take much more general cognitive power to learn and imitate the patterns of humans, than it requires us to execute those patterns.
The spun-off agent foundations team seems to have less reason than most AI safety orgs to be in the Bay Area, so moving to NZ might be worth considering for them.
Note on current methodology:
- I am, for now, not doing further research when the spreadsheet lists a person whose name appears on the final leaked letter; so it's possible that some of the 23 departures among the 702 names on the final leaked letter are spurious. (I will be more thorough when I resolve the market after November.)
- I am counting only full-time employees and not counting contractors, as I currently believe that the 770 figure refers only to full-time employees. So far, I've seen no contractors among those who signed, but I've only checked a few; if the letter includes some categories of contractors, this gets a lot harder to resolve.
- I am counting nontechnical employees (e.g. recruiting, marketing) as well as technical staff, because such employees were among those who signed the letter.
Counterpoint: other labs might become more paranoid that SSI is ahead of them. I think your point is probably more correct than the counterpoint, but it's worth mentioning.
Elon diversifies in the sense of "personally micromanaging more companies", not in the sense of "backing companies he can't micromanage".
By my assessment, the employees who failed to sign the final leaked version of the Altman loyalty letter have now been literally decimated.
I'm trying to track the relative attrition for a Manifold market: of the 265 OpenAI employees who hadn't yet signed the loyalty letter by the time it was first leaked, what percent will still be at OpenAI on the one-year anniversary?
I'm combining that first leaked copy with 505 signatures, the final leaked copy with 702 signatures, the oft-repeated total headcount of 770, and this spreadsheet tracking OpenAI departures (albeit with many false positives—people self-reporting as OpenAI employees because they customized their GPTs—so I'm working to verify names that appear on the spreadsheet but not on the letter; I'm sure the spreadsheet has false negatives as well, alas).
So far, I've verified at least seven [update: seven, with a probable eighth] departures of eligible figures who hadn't signed the letter with 702 names: Leopold Aschenbrenner, Jay Joshi (not fully verified by me), Andrej Karpathy, Daniel Kokotajlo, Jan Leike, Lucas Negritto, Katarina Slama, and William Saunders. If it's true that the total headcount at the time was 770, then that's 8 out of 68, or 11.8%.
Compare that to the attrition rate (as per the spreadsheet) for those who had signed the final leaked version but not the first: 10 departures out of 197, or 5.1%; and compare that to the attrition rate for those who signed promptly: 13 departures out of 505, or 2.6%.
Any causal inferences from this correlation are left as an exercise to the reader.
(A more important exercise, however: can anyone find a confirmation of the 770 number outside of unsourced media reports, or find a copy of the loyalty letter with more than 702 signatories, or ideally find a list of everyone at OpenAI at the time? I've tried a few different avenues without success.)
I'm not even angry, just disappointed.
The diagnosis is roughly correct (I would say "most suffering is caused by an internal response of fleeing from pain but not escaping it"), but IMO the standard proffered remedy (Buddhist-style detachment from wanting) goes too far and promises too much.
Re: the diagnosis, three illustrative ways the relationship between pain, awareness, and suffering has manifested for me:
- Migraines: I get them every few weeks, and they're pretty bad. After a friend showed me how to do a vipassana body scan, on a lark I tried moving my attention to the spot inside my skull where the migraine was most intense. To my relief, this helped the suffering greatly; the pain was caused by the migraine but the suffering was caused by trying futilely to not feel the pain, and staring directly at it was a good remedy.
- Mental health: I'm chronically depressed and anxious. (I am not asking for advice about it, and there's a unique causal element so your advice is less likely than usual to help.) One thing I figured out is that I can "wallow" in it: optimize for feeling it as intensely as possible. For example, if I'm feeling miserable I'll intentionally lie in bed in the dark listening to my saddest music. This genuinely helps make things more bearable, and helps the worst moods pass faster, compared to the approach of trying to distract myself from the feelings or to force a contrary feeling.
- Psychedelics: The worst hour of my life was spent on a bad acid trip, feeling nauseous and wanting to escape that feeling, and getting stuck in a "turning away from awareness" motion (reflected in visual hallucination by trying to mentally push my visual field up and to the left, only for it to reset to its actual location several times per second). That was a tremendous mistake.
However, my depression sometimes manifests as anhedonia, i.e. true detachment from desire, and that's really not all it's cracked up to be. I'm not suffering when I lie around all day with anhedonia, but I'm not getting any positive valence from it, and meanwhile I'm stagnating as a person. And I genuinely do not see how to wallow in anhedonia, to turn my awareness inward and find something to engage with. I've tried. It just seems like nobody's home in that state.
A key, I suspect, is happiness set point. A person who takes up Buddhism or a similar practice, and starts to experience their preferences less strongly, ends up hovering more stably around their happiness set point without the highs and lows that come with excitement, anticipation, triumph, disappointment, etc. Lows hurt more than highs, so this mental motion is a net improvement. Most people have a pretty good happiness set point, so it feels like a good end result. (And in most cases people stop before ceasing to have preferences entirely; there's probably even a golden mean where they're more effective in the real world than either their natural state or a zen-wireheaded extreme.)
But I don't see much proof that detachment from desire moves one's happiness set point, so advertising it as a cure for unhappiness feels like the same sort of error as talking about how everyone—even poor people—should buy index funds. (Which is to say, it's good advice on the margin for the majority of people, but for some people it's irrelevant, and the correct first advice is more like "how to get a job, or a better job" or "how to budget" or "how to cheaply refinance your credit card debt", etc.)
And also, I'm dubious that [intentionally reducing the intensity of your good moments] is actually helpful in the same way that [intentionally reducing the intensity of your bad moments] is? Sometimes you happen to know that a good thing is going to happen with very little risk of failure. In that case, it seems strictly better to want and desire and expect that good thing.
In short, I highly recommend turning towards experiences of pain rather than fleeing from them; but I think the Buddhist thesis is questionable.
Go all-in on lobbying the US and other governments to fully prohibit the training of frontier models beyond a certain level, in a way that OpenAI can't route around (so probably block Altman's foreign chip factory initiative, for instance).
The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I'm not trying to claim that the "put up a good fight but lose" criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with "be helpful and harmless".)
I agree that "helpful-only" RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I'm frankly a bit worried about even training that model.
Thank you! I'd forgotten about that.
Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal?
How certain are you that this is always true (rather than "we've usually noticed this even though we haven't explicitly been checking for it in general"), and that it will continue to be so as models become stronger?
It seems to me like additionally running evals on base models is a highly reasonable precaution.
Oh wait, I misinterpreted you as using "much worse" to mean "much scarier", when instead you mean "much less capable".
I'd be glad if it were the case that RL*F doesn't hide any meaningful capabilities existing in the base model, but I'm not sure it is the case, and I'd sure like someone to check! It sure seems like RL*F is likely in some cases to get the model to stop explicitly talking about a capability it has (unless it is jailbroken on that subject), rather than to remove the capability.
(Imagine RL*Fing a base model to stop explicitly talking about arithmetic; are we sure it would un-learn the rules?)
That's exactly the point: if a model has bad capabilities and deceptive alignment, then testing the post-tuned model will return a false negative for those capabilities in deployment. Until we have the kind of interpretability tools that we could deeply trust to catch deceptive alignment, we should count any capability found in the base model as if it were present in the tuned model.
I'd like to see evals like DeepMind's run against the strongest pre-RL*F base models, since that actually tells you about capability.
The WWII generation is negligible in 2024. The actual effect is partly the inverted demographic pyramid (older population means more women than men even under normal circumstances), and partly that even young Russian men die horrifically often:
At 2005 mortality rates, for example, only 7% of UK men but 37% of Russian men would die before the age of 55 years
And for that, a major culprit is alcohol (leading to accidents and violence, but also literally drinking oneself to death).
Among the men who don't self-destruct, I imagine a large fraction have already been taken, meaning that the gender ratio among singles has to be off the charts.
That first statistic, that it swiped right 353 times and got to talk to 160 women, is completely insane. I mean, that’s almost a 50% match rate, whereas estimates in general are 4% to 14%.
Given Russia's fucked-up gender ratio (2.5 single women for every single man), I don't think it's that unreasonable!
Generally, the achievement of "guy finds a woman willing to accept a proposal" impresses me far less in Russia than it would in the USA. Let's see if this replicates in a competitive dating pool.
In high-leverage situations, you should arguably either be playing tic-tac-toe (simple, legible, predictable responses) or playing 4-D chess to win. If you're making really nonstandard and surprising moves (especially in PR), you have no excuse for winding up with a worse outcome than you would have if you'd acted in bog-standard normal ways.
(This doesn't mean suspending your ethics! Those are part of winning! But if you can't figure out how to win 4-D chess ethically, then you need to play an ethical tic-tac-toe strategy instead.)
Ah, I'm talking about introspection in a therapy context and not about exhorting others.
For example:
Internal coherence: "I forgive myself for doing that stupid thing".
Load-bearing but opaque: "It makes sense to forgive myself, and I want to, but for some reason I just can't".
Load-bearing and clear resistance: "I want other people to forgive themselves for things like that, but when I think about forgiving myself, I get a big NOPE NOPE NOPE".
P.S. Maybe forgiving oneself isn't actually the right thing to do at the moment! But it will also be easier to learn that in the third case than in the second.
"I endorse endorsing X" is a sign of a really promising topic for therapy (or your preferred modality of psychological growth).
If I can simply say "X", then I'm internally coherent enough on that point.
If I can only say "I endorse X", then not-X is psychologically load-bearing for me, but often in a way that is opaque to my conscious reasoning, so working on that conflict can be slippery.
But if I can only say "I endorse endorsing X", then not only is not-X load-bearing for me, but there's a clear feeling of resistance to X that I can consciously hone in on, connect with, and learn about.
Re: Canadian vs American health care, the reasonable policy would be:
"Sorry, publicly funded health care won't cover this, because the expected DALYs are too expensive. We do allow private clinics to sell you the procedure, though unless you're super wealthy I think the odds of success aren't worth the cost to your family."
(I also approve of euthanasia being offered as long as it's not a hard sell.)
I think MIRI is correct to call it as they see it, both on general principles and because if they turn out to be wrong about genuine alignment progress being very hard, people (at large, but also including us) should update against MIRI's viewpoints on other topics, and in favor of the viewpoints of whichever AI safety orgs called it more correctly.
Prior to hiring Shear, the board offered a merger to Dario Amodei, with Dario to lead the merged entity. Dario rejected the offer.
I mean, I don't really care how much e.g. Facebook AI thinks they're racing right now. They're not in the game at this point.
The race dynamics are not just about who's leading. FB is 1-2 years behind (looking at LLM metrics), and it doesn't seem like they're getting further behind OpenAI/Anthropic with each generation, so I expect that the lag at the end will be at most a few years.
That means that if Facebook is unconstrained, the leading labs have only that much time to slow down for safety (or prepare a pivotal act) as they approach AGI before Facebook gets there with total recklessness.
If Microsoft!OpenAI lags the new leaders by less than FB (and I think that's likely to be the case), that shortens the safety window further.
I suspect my actual crux with you is your belief (correct me if I'm misinterpreting you) that your research program will solve alignment and that it will not take much of a safety window for the leading lab to incorporate the solution, and therefore the only thing that matters is finishing the solution and getting the leading lab on board. It would be very nice if you were right, but I put a low probability on it.
I'm surprised that nobody has yet brought up the development that the board offered Dario Amodei the position as a merger with Anthropic (and Dario said no!).
(There's no additional important content in the original article by The Information, so I linked the Reuters paywall-free version.)
Crucially, this doesn't tell us in what order the board made this offer to Dario and the other known figures (GitHub CEO Nat Friedman and Scale AI CEO Alex Wang) before getting Emmett Shear, but it's plausible that merging with Anthropic was Plan A all along. Moreover, I strongly suspect that the bad blood between Sam and the Anthropic team was strong enough that Sam had to be ousted in order for a merger to be possible.
So under this hypothesis, the board decided it was important to merge with Anthropic (probably to slow the arms race), booted Sam (using the additional fig leaf of whatever lies he's been caught in), immediately asked Dario and were surprised when he rejected them, did not have an adequate backup plan, and have been scrambling ever since.
P.S. Shear is known to be very much on record worrying that alignment is necessary and not likely to be easy; I'm curious what Friedman and Wang are on record as saying about AI x-risk.
No, I don't think the board's motives were power politics; I'm saying that they failed to account for the kind of political power moves that Sam would make in response.
In addition to this, Microsoft will exert greater pressure to extract mundane commercial utility from models, compared to pushing forward the frontier. Not sure how much that compensates for the second round of evaporative cooling of the safety-minded.
If they thought this would be the outcome of firing Sam, they would not have done so.
The risk they took was calculated, but man, are they bad at politics.
- The quote is from Emmett Shear, not a board member.
- The board is also following the "don't say anything literally false" policy by saying practically nothing publicly.
- Just as I infer from Shear's qualifier that the firing did have something to do with safety, I infer from the board's public silence that their reason for the firing isn't one that would win back the departing OpenAI members (or would only do so at a cost that's not worth paying).
- This is consistent with it being a safety concern shared by the superalignment team (who by and large didn't sign the statement at first) but not by the rest of OpenAI (who view pushing capabilities forward as a good thing, because like Sam they believe the EV of OpenAI building AGI is better than the EV of unilaterally stopping). That's my current main hypothesis.
It's too late for a conditional surrender now that Microsoft is a credible threat to get 100% of OpenAI's capabilities team; Ilya and Jan are communicating unconditional surrender because the alternative is even worse.
I agree, it's critical to have a very close reading of "The board did *not* remove Sam over any specific disagreement on safety".
This is the kind of situation where every qualifier in a statement needs to be understood as essential—if the statement were true without the word "specific", then I can't imagine why that word would have been inserted.
The most likely explanation I can think of, for what look like about-faces by Ilya and Jan this morning, is realizing that the worst plausible outcome is exactly what we're seeing: Sam running a new OpenAI at Microsoft, free of that pesky charter. Any amount of backpedaling, and even resigning in favor of a less safety-conscious board, is preferable to that.
They came at the king and missed.
Did anyone at OpenAI explicitly say that a factor in their release cadence was getting the public to wake up about the pace of AI research and start demanding regulation? Because this seems more like a post hoc rationalization for the release policy than like an actual intended outcome.
I expect AGI to emerge as part of the frontier model training run (and thus get a godshatter of human values), rather than only emerging after fine-tuning by a troll (and get a godshatter of reversed values), so I think "humans modified to be happy with something much cheaper than our CEV" is a more likely endstate than "humans suffering" (though, again, both much less likely than "humans dead").