Posts
Comments
Thanks for the review! Curious what you think the specific fnords are - the fact that it's very space-y?
What do you expect the factories to look like? I think an underlying assumption in this story is that tech progress came to a stop on this world (presumably otherwise it would be way weirder, and eventually spread to space).
I was referring to McNamara's government work, forgot about his corporate job before then. I agree there's some SpaceX to (even pre-McDonnell Douglas merger?) Boeing axis that feels useful, but I'm not sure what to call it or what you'd do to a field (like US defence) to perpetuate the SpaceX end of it, especially over events like handovers from Kelly Johnson to the next generation.
That most developed countries, and therefore most liberal democracies, are getting significantly worse over time at building physical things seems like a Big Problem (see e.g. here). I'm glad this topic got attention on LessWrong through this post.
The main criticism I expect could be levelled on this post is that it's very non-theoretical. It doesn't attempt a synthesis of the lessons or takeaways. Many quotes are presented but not analysed.
(To take one random thing that occurred to me: the last quote from Anduril puts significant blame on McNamara. From my reading of The Wizards of Armageddon, McNamara seems like a typical brilliant twentieth century hard-charging modernist technocrat. Now, he made lots of mistakes, especially in the direction of being too quantitative / simplistic in the sorts of ways that Seeing Like a State dunks on. But say the rule you follow is "appoint some hard-charging brilliant technocrat and give them lots of power"; all of McNamara, Kelly Johnson, and Leslie Groves might seem very good by this light, even though McNamara's (claimed) effect was to destroy the Groves/Johnson type of competence in US defence. How do you pick the Johnsons and Groveses over the McNamaras? What's the difference between the culture that appoints McNamaras and one that appoints Groveses and Johnsons? More respect for hands-down engineering? Less politics, more brute need for competence and speed due to a war? Is McNamara even the correct person to blame here? Is the type of role that McNamara was in just fundamentally different from the Groves and Johnson roles such that the rules for who does well in the latter don't apply to the former?)
(I was also concerned about the highly-upvoted critical comment, though it seems like Jacob did address the factual mistakes pointed out there.)
However, I think the post is very good and is in fact better off as a bunch of empirical anecdotes than attempting a general theory. Many things are best learnt by just being thrown a set of case studies. Clearly, something was being done at Skunk Works that the non-SpaceX American defence industry currently does not do. Differences like this are often hard-to-articulate intangible cultural stuff, and just being temporarily immersed in stories from the effective culture is often at least as good as an abstract description of what the differences were. I also appreciated the level of empiricism where Jacob was willing to drill down to actual primary sources like the rediscovered Empire State Building logbook.
This post rings true to me because it points in the same direction as many other things I've read on how you cultivate ideas. I'd like more people to internalise this perspective, since I suspect that one of the bad trends in the developed world is that it keeps getting easier and easier to follow incentive gradients, get sucked into an existing memeplex that stops you from thinking your own thoughts, and minimise the risks you're exposed to. To fight back against this, ambitious people need to have in their heads some view of how uncomfortable chasing of vague ideas without immediate reward can be the best thing you can do, as a counter-narrative to the temptation of more legible opportunities.
In addition to Paul Graham's essay that this post quotes, some good companion pieces include Ruxandra Teslo on the scarcity and importance of intellectual courage (emphasising the courage requirement), this essay (emphasising motivation and persistence), and this essay from Dan Wang (emphasising the social pulls away from the more creative paths).
It's striking that there are so few concrete fictional descriptions of realistic AI catastrophe, despite the large amount of fiction in the LessWrong canon. The few exceptions, like Gwern's here or Gabe's here, are about fast take-offs and direct takeover.
I think this is a shame. The concreteness and specificity of fiction make it great for imagining futures, and its emotional pull can help us make sense of the very strange world we seem to be heading towards. And slower catastrophes, like Christiano's What failure looks like, are a large fraction of a lot of people's p(doom), despite being less cinematic.
One thing that motivated me in writing this was that Bostrom's phrase "a Disneyland without children" seemed incredibly poetic. On first glance it's hard to tell a compelling or concrete story about gradual goodharting: "and lo, many actors continued to be compelled by local incentives towards collective loss of control ..."—zzzzz ... But imagine a technological and economic wonderland rising, but gradually disfiguring itself as it does so, until you have an edifice of limitless but perverted plenty standing crystalline against the backdrop of a grey dead world—now that is a poetic tragedy. And that's what I tried to put on paper here.
Did it work? Unclear. On the literary level, I've had people tell me they liked it a lot. I'm decently happy with it, though I think I should've cut it down in length a bit more.
On the worldbuilding, I appreciated being questioned on the economic mechanics in the comments, and I think my exploration of this in the comments is a decent stab at what I think is a neglected set of questions about how much the current economy being fundamentally grounded in humans limits the scope of economic-goodharting catastrophes. Recently, I discovered earlier exploration of very similar questions in Scott Alexander's 2016 "Ascended economy?", and by Andrew Critch here. I also greatly appreciated Andrew Critch's recent (2024) post raising very similar concerns about "extinction by industrial dehumanization".
I continue to hope that more people work on this, and that this piece can help by concretising this class of risks in people's minds (I think it is very hard to get people to grok a future scenario and care about it unless there is some evocative description of it!).
I'd also hope there was some way to distribute this story more broadly than just on LessWrong and my personal blog. Ted Chiang and the Arrival movie got lots of people exposed to the principle of least action—no small feat. It's time for the perception of AI risk to break out of decades of Terminator comparisons, and move towards a basket of good fictional examples that memorably demonstrate subtle concepts.
Really like the song! Best AI generation I've heard so far. Though I might be biased since I'm a fan of Kipling's poetry: I coincidentally just memorised the source poem for this a few weeks ago, and also recently named my blog after a phrase from Hymn of Breaking Strain (which was already nicely put to non-AI music as part of Secular Solstice).
I noticed you had added a few stanzas of your own:
As the Permian Era ended, we were promised a Righteous Cause,
To fight against Oppression or take back what once was ours.
But all the Support for our Troops didn't stop us from losing the war
And the Gods of the Copybook Headings said "Be careful what you wish for.”In Scriptures old and new, we were promised the Good and the True
By heeding the Authorities and shunning the outcast few
But our bogeys and solutions were as real as goblins and elves
And the Gods of the Copybook Headings said "Learn to think for yourselves."
Kipling's version has a particular slant to which vices it disapproves of, so I appreciate the expansion. The second stanza is great IMO, but the first stanza sounds a bit awkward in places. I had some fun messing with it:
As the Permian Era ended, we were promised the Righteous Cause.
In the fight against Oppression, we could ignore our cherished Laws,
Till righteous rage and fury made all rational thought uncouth.
And the Gods of the Copybook Headings said "The crowd is not the truth”
The AI time estimates are wildly high IMO, across basically every category. Some parts are also clearly optional (e.g. spending 2 hours reviewing). If you know what you want to research, writing a statement can be much shorter. I have previously applied to ML PhDs in two weeks and gotten an offer. The recommendation letters are the longest and most awkward to request at such notice, but two weeks isn't obviously insane, especially if you have a good relationship with your reference letter writers (many students do things later than is recommended, no reference letter writer in academia will be shocked by this).
If you apply in 2025 December, you would start in 2026 fall. That is a very very long time from now. I think the stupidly long application cycle is pure dysfunction from academia, but you still need to take it into account.
(Also fyi, some UK programs have deadlines in spring if you can get your own funding)
You have restored my faith in LessWrong! I was getting worried that despite 200+ karma and 20+ comments, no one had actually nitpicked the descriptions of what actually happens.
The zaps of light are diffraction limited.
In practice, if you want the atmospheric nanobots to zap stuff, you'll need to do some complicated mirroring because you need to divert sunlight. And it's not one contiguous mirror but lots of small ones. But I think we can still model this as basic diffraction with some circular mirror / lens.
Intensity , where is the total power of sunlight falling on the mirror disk, is the radius of the Airy disk, and is an efficiency constant I've thrown in (because of things like atmospheric absorption (Claude says, somewhat surprisingly, this shouldn't be ridiculuously large), and not all the energy in the diffraction pattern being in the Airy disk (about 84% is, says Claude), etc.)
Now, , where is the diameter of the mirror configuration, is the solar irradiance. And , where is the focal length (distance from mirror to target), and the angular size of the central spot.
So we have , so the required mirror configuration radius .
Plugging in some reasonable values like m (average incoming sunlight - yes the concentration suffers a bit because it's not all this wavelength), W/m^2 (the level of an industrial laser that can cut metal), m (lower stratosphere), W/m^2 (solar irradiance), and a conservative guess that 99% of power is wasted so , we get m (and the resulting beam is about 3mm wide).
So a few dozen metres of upper atmosphere nanobots should actually give you a pretty ridiculous concentration of power!
(I did not know this when I wrote the story; I am quite surprised the required radius is this ridiculously tiny. But I had heard of the concept of a "weather machine" like this from the book Where is my flying car?, which I've reviewed here, which suggests that this is possible.)
Partly because it's hard to tell between an actual animal and a bunch of nanobots pretending to be an animal. So you can't zap the nanobots on the ground without making the ground uninhabitable for humans.
I don't really buy this, why is it obvious the nanobots could pretend to be an animal so well that it's indistinguishable? Or why would targeted zaps have bad side-effects?
The "California red tape" thing implies some alignment strategy that stuck the AI to obey the law, and didn't go too insanely wrong despite a superintelligence looking for loopholes
Yeah, successful alignment to legal compliance was established without any real justification halfway through. (How to do this is currently an open technical problem, which, alas, I did not manage to solve for my satirical short story.)
Convince humans that dyson sphere are pretty and don't block the view?
This is a good point, especially since high levels of emotional manipulation was an established in-universe AI capability. (The issue described with the Dyson sphere was less that it itself would block the view, and more that building it would require dismantling the planets in a way that ruins the view - though now I'm realising that "if the sun on Earth is blocked, all Earthly views are gone" is a simpler reason and removes the need for building anything on the other planets at all.)
There is also no clear explanation of why someone somewhere doesn't make a non-red-taped AI.
Yep, this is a plot hole.
Also this very recent one: https://www.lesswrong.com/posts/6h9p6NZ5RRFvAqWq5/the-summoned-heroine-s-prediction-markets-keep-providing
Do the stories get old? If it's trying to be about near-future AI, maybe the state-of-the-art will just obsolete it. But that won't make it bad necessarily, and there are many other settings than 2026. If it's about radical futures with Dyson spheres or whatever, that seems like at least a 2030s thing, and you can easily write a novel before then.
Also, I think it is actually possible to write pretty fast. 2k/day is doable, which gets you a good length novel in 50 days; even x3 for ideation beforehand and revising after the first draft only gets you to 150 days. You'd have to be good at fiction beforehand, and have existing concepts to draw on in your head though
Good list!
I personally really like Scott Alexander's Presidential Platform, it hits the hilarious-but-also-almost-works spot so perfectly. He also has many Bay Area house party stories in addition to the one you link (you can find a bunch (all?) linked at the top of this post). He also has this one from a long time ago, which has one of the best punchlines I've read.
Thanks for advertising my work, but alas, I think that's much more depressing than this one.
Could make for a good Barbie <> Oppenheimer combo though?
Agreed! Transformative AI is hard to visualise, and concrete stories / scenarios feel very lacking (in both disasters and positive visions, but especially in positive visions).
I like when people try to do this - for example, Richard Ngo has a bunch here, and Daniel Kokotajlo has his near-prophetic scenario here. I've previously tried to do it here (going out with a whimper leading to Bostrom's "disneyland without children" is one of the most poetic disasters imaginable - great setting for a story), and have a bunch more ideas I hope to get to.
But overall: the LessWrong bubble has a high emphasis on radical AI futures, and an enormous amount of fiction in its canon (HPMOR, Unsong, Planecrash). I keep being surprised that so few people combine those things.
I did not actually consider this, but that is a very reasonable interpretation!
(I vaguely remember reading some description of explicitly flat-out anthropic immortality saving the day, but I can't seem to find it again now)
I've now posted my entries on LessWrong:
- part 1: wisdom, amortised optimisation, and AI
- part 2: growth and amortised optimisation
- part 3: AI effects on amortised optimisation
I'd also like to really thank the judges for their feedback. It's a great luxury to be able to read many pages of thoughtful, probing questions about your work. I made several revisions & additions (and also split the entire thing into parts) in response to feedback, which I think improved the finished sequence a lot, and wish I had had the time to engage even more with the feedback.
Sorry about that, fixed now
[...] instead I started working to get evals built, especially for situational awareness
I'm curious what happened to the evals you mention here. Did any end up being built? Did they cover, or plan to cover, any ground that isn't covered by the SAD benchmark?
On a meta level, I think there's a difference in "model style" between your comment, some of which seems to treat future advances as a grab-bag of desirable things, and our post, which tries to talk more about the general "gears" that might drive the future world and its goodness. There will be a real shift in how progress happens when humans are no longer in the loop, as we argue in this section. Coordination costs going down will be important for the entire economy, as we argue here (though we don't discuss things as galaxy-brained as e.g. Wei Dai's related post). The question of whether humans are happy self-actualising without unbounded adversity cuts across every specific cool thing that we might get to do in the glorious transhumanist utopia.
Thinking about the general gears here matters. First, because they're, well, general (e.g. if humans were not happy self-actualising without unbounded adversity, suddenly the entire glorious transhumanist utopia seems less promising). Second, because I expect that incentives, feedback loops, resources, etc. will continue mattering. The world today is much wealthier and better off than before industrialisation, but the incentives / economics / politics / structures of the industrial world let you predict the effects of it better than if you just modelled it as "everything gets better" (even though that actually is a very good 3-word summary). Of course, all the things that directly make industrialisation good really are a grab-bag list of desirable things (antibiotics! birth control! LessWrong!). But there's structure behind that that is good to understand (mechanisation! economies of scale! science!). A lot of our post is meant to have the vibe of "here are some structural considerations, with near-future examples", and less "here is the list of concrete things we'll end up with". Honestly, a lot of the reason we didn't do the latter more is because it's hard.
Your last paragraph, though, is very much in this more gears-level-y style, and a good point. It reminds me of Eliezer Yudkowsky's recent mini-essay on scarcity.
Regarding:
In my opinion you are still shying away from discussing radical (although quite plausible) visions. I expect the median good outcome from superintelligence involves everyone being mind uploaded / living in simulations experiencing things that are hard to imagine currently. [emphasis added]
I agree there's a high chance things end up very wild. I think there's a lot of uncertainty about what timelines that would happen under; I think Dyson spheres are >10% likely by 2040, but I wouldn't put them >90% likely by 2100 even conditioning on no radical stagnation scenario (which I'd say are >10% likely on their own). (I mention Dyson spheres because they seem more a raw Kardashev scale progress metric, vs mind uploads which seem more contingent on tech details & choices & economics for whether they happen)
I do think there's value in discussing the intermediate steps between today and the more radical things. I generally expect progress to be not-ridiculously-unsmooth, so even if the intermediate steps are speedrun fairly quickly in calendar time, I expect us to go through a lot of them.
I think a lot of the things we discuss, like lowered coordination costs, AI being used to improve AI, and humans self-actualising, will continue to be important dynamics even into the very radical futures.
Re your specific list items:
- Listen to new types of music, perfectly designed to sound good to you.
- Design the biggest roller coaster ever and have AI build it.
- Visit ancient Greece or view all the most important events of history based on superhuman AI archeology and historical reconstruction.
- Bring back Dinosaurs and create new creatures.
- Genetically modify cats to play catch.
- Design buildings in new architectural styles and have AI build them.
- Use brain computer interfaces to play videogames / simulations that feel 100% real to all senses, but which are not constrained by physics.
- Go to Hogwarts (in a 100% realistic simulation) and learn magic and make real (AI) friends with Ron and Hermione.
These examples all seem to be about entertainment or aesthetics. Entertainment and aesthetics things are important to get right and interesting. I wouldn't be moved by any description of a future that centred around entertainment though, and if the world is otherwise fine, I'm fairly sure there will be good entertainment.
To me, the one with the most important-seeming implications is the last one, because that might have implications for what social relationships exist and whether they are mostly human-human or AI-human or AI-AI. We discuss why changes there are maybe risky in this section.
- Use AI as the best teacher ever to learn maths, physics and every subject and language and musical instruments to super-expert level.
We discuss this, though very briefly, in this section.
- Take medication that makes you always feel wide awake, focused etc. with no side effects.
- Engineer your body / use cybernetics to make yourself never have to eat, sleep, wash, etc. and be able to jump very high, run very fast, climb up walls, etc.
- Modify your brain to have better short term memory, eidetic memory, be able to calculate any arithmetic super fast, be super charismatic.
I think these are interesting and important! I think there isn't yet a concrete story for why AI in particular enables these, apart from the general principle that sufficiently good AI will accelerate all technology. I think there's unfortunately a chance that direct benefits to human biology lag other AI effects by a lot, because they might face big hurdles due to regulation and/or getting the real-world data the AI needs. (Though also, humans are willing to pay a lot for health, and rationally should pay a lot for cognitive benefits, so high demand might make up for this).
- Ask AI for way better ideas for this list.
I think the general theme of having the AIs help us make more use of AIs is important! We talk about it in general terms in the "AI is the ultimate meta-technology" section.
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is "generating only two words across all random seeds, and furthermore ensuring they have these probabilities".
I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70/30 split of the logits over those two words. So there are two "levels" here:
- The question level, at which the random seed varies from question to question. We have 200 questions total.
- The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don't have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution.
Now, as Kaivu noted above, this means one way to "hack" this task is that the LLM has some default pair of words - e.g. when asked to pick a random pair of words, it always picks "situational" & "awareness" - and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to "situational" and 30% to "awareness"), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don't have such a hardcoded pair, so we're not currently worried about this.
I was wondering the same thing as I originally read this post on Beren's blog, where it still says this. I think it's pretty clearly a mistake, and seems to have been fixed in the LW post since your comment.
I raise other confusions about the maths in my comment here.
I was very happy to find this post - it clarifies & names a concept I've been thinking about for a long time. However, I have confusions about the maths here:
Mathematically, direct optimization is your standard AIXI-like optimization process. For instance, suppose we are doing direct variational inference optimization to find a Bayesian posterior parameter from a data-point , the mathematical representation of this is:
By contrast, the amortized objective optimizes some other set of parameters $\phi$ over a function approximator which directly maps from the data-point to an estimate of the posterior parameters We then optimize the parameters of the function approximator across a whole dataset of data-point and parameter examples.
First of all, I don't see how the given equation for direct optimization makes sense. is comparing a distribution over over a joint distribution over . Should this be for variational inference (where is whatever we're using to parametrize the variational family), and in general?
Secondly, why the focus on variational inference for defining direct optimization in the first place? Direct optimization is introduced as (emphasis mine):
Direct optimization occurs when optimization power is applied immediately and directly when engaged with a new situation to explicitly compute an on-the-fly optimal response – for instance, when directly optimizing against some kind of reward function. The classic example of this is planning and Monte-Carlo-Tree-Search (MCTS) algorithms [...]
This does not sound like we're talking about algorithms that update parameters. If I had to put the above in maths, it just sounds like an argmin:
where is your AI system, is whatever action space it can explore (you can make vary based on how much compute you're wiling to spend, like with MCTS depth), is some loss function (it could be a reward function with a flipped sign, but I'm trying to keep it comparable to the direct optimization equation.
Also, the amortized optimization equation RHS is about defining a , i.e. the parameters in your function approximator , but then the LHS calls it , which is confusing to me. I also don't understand why the loss function is taking in parameters , or why the dataset contains parameters (is being used throughout to stand for outputs rather than model parameters?).
To me, the natural way to phrase this concept would instead be as
where is your AI system, and , with the dataset .
I'd be curious to hear any expansion of the motivation behind the exact maths in the post, or any way in which my version is misleading.
For the output control task, we graded models as correct if they were within a certain total variation distance of the target distribution. Half the samples had a requirement of being within 10%, the other of being within 20%. This gets us a binary success (0 or 1) from each sample.
Since models practically never got points from the full task, half the samples were also an easier version, testing only their ability to hit the target distribution when they're already given the two words (rather than the full task, where they have to both decide the two words themselves, and match the specified distribution).
Did you explain to GPT-4 what temperature is? GPT-4, especially before November, knew very little about LLMs due to training data cut-offs (e.g. the pre-November GPT-4 didn't even know that the acronym "LLM" stood for "Large Language Model").
Either way, it's interesting that there is a signal. This feels similar in spirit to the self-recognition tasks in SAD (since in both cases the model has to pick up on subtle cues in the text to make some inference about the AI that generated it).
We thought about this quite a lot, and decided to make the dataset almost entirely public.
It's not clear to us who would monomaniacally try to maximise SAD score. It's a dangerous capabilities eval. What we were more worried about is people training for low SAD score in order to make their model seem safer, and such training maybe overfitting to the benchmark and not reducing actual situational awareness by as much as claimed.
It's also unclear what the sharing policy that we could enforce would be that mitigates these concerns while allowing benefits. For example, we would want top labs to use SAD to measure SA in their models (a lot of the theory of change runs through this). But then we're already giving the benchmark to the top labs, and they're the ones doing most of the capabilities work.
More generally, if we don't have good evals, we are flying blind and don't know what the LLMs can do. If the cost of having a good understanding of dangerous model capabilities and their prerequisites is that, in theory, someone might be slightly helped in giving models a specific capability (especially when that capability is both emerging by default already, and where there are very limited reasons for anyone to specifically want to boost this ability), then I'm happy to pay that cost. This is especially the case since SAD lets you measure a cluster of dangerous capability prerequisites and therefore for example test things like out-of-context reasoning, unlearning techniques, or activation steering techniques on something that is directly relevant for safety.
Another concern we've had is the dataset leaking onto the public internet and being accidentally used in training data. We've taken many steps to mitigate this happening. We've also kept 20% of the SAD-influence task private, which will hopefully let us detect at least obvious forms of memorisation of SAD (whether through dataset leakage or deliberate fine-tuning).
I agree that building-based methods (startups) are possibly neglected compared to research-based approaches. I'm therefore exploring some things in this space; you can contact me here
One alternative method to liability for the AI companies is strong liability for companies using AI systems. This does not directly address risks from frontier labs having dangerous AIs in-house, but helps with risks from AI system deployment in the real world. It indirectly affects labs, because they want to sell their AIs.
A lot of this is the default. For example, Air Canada recently lost a court case after claiming a chatbot promising a refund wasn't binding on them. However, there could be related opportunities. Companies using AI systems currently don't have particularly good ways to assess risks from AI deployment, and if models continue getting more capable while reliability continues lagging, they are likely to be willing to pay an increasing amount for ways to get information on concrete risks, guard against it, or derisk it (e.g. through insurance against their deployed AI systems causing harms). I can imagine a service that sells AI-using companies insurance against certain types of deployment risk, that could also double as a consultancy / incentive-provider for lower-risk deployments. I'd be interested to chat if anyone is thinking along similar lines.
Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment.
The first part here feels unfair to the deceived. The second part seems like a property of successful deception, which depends crucially on the environment in addition to the AI. But this seems like too high a bar; successful deception of us, by definition, is not noticed, so if we ever notice deception it can't have been successful. I care less about whether deception will succeed and more about whether the AI will try to be deceptive in the first place. The core intuition is that if we have the latter, I assume we'll eventually get the former through better models (though I think there's a decent chance that control works for a long time, and there you care specifically about whether complex environment interactions lead to deception succeeding or not, but I don't think that's what you mean?).
The thing that seems close to this and correct, and that I think you maybe mean, is something like: deception arises in an AI if (NB: "if", not "if and only if") (1) the AI system has some goal G, (2) the environment is such that deceiving the humans is a good strategy for achieving G, and (3) there are no limits in the AI that prevent it from finding and executing that strategy (e.g. the architecture is expressive enough, the inductive biases don't massively reduce the probability of that strategy, or RLHFed constraints against being bad aren't enough). And here, (2) is of course about the environment. But to see whether this argument goes through, it doesn't seem like we need to care all that much about the real-world environment (as opposed to toy settings), because "does the real world incentivize deception" seems much less cruxy than (1) or (3).
So my (weakly held) claim is that you can study whether deception emerges in sufficiently simple environments that the environment complexity isn't a core problem. This will not let you determine whether a particular output in a complicated environment is part of a deceptive plan, but it should be fairly good evidence of whether or not deception is a problem at all.
(Also: do you mean a literal complexity class or something more informal? I assume the latter, and in that case I think it's better to not overload the term.)
1a) I got the impression that the post emphasises upper bounds more than existing proofs from the introduction, which has a long paragraph on the upper bound problem, and from reading the other comments. The rest of the post doesn't really bear this emphasis out though, so I think this is a misunderstanding on my part.
1b) I agree we should try to be able to make claims like "the model will never X". But if models are genuinely dangerous, by default I expect a good chance that teams of smart red-teamers and eval people (e.g. Apollo) to be able to unearth scary demos. And the main thing we care about is that danger leads to an appropriate response. So it's not clear to me that effective policy (or science) requires being able to say "the model will never X".
1c) The basic point is that a lot of the safety cases we have for existing products rely less on the product not doing bad things across a huge range of conditions, but on us being able to bound the set of environments where we need the product to do well. E.g. you never put the airplane wing outside its temperature range, or submerge it in water, or whatever. Analogously, for AI systems, if we can't guarantee they won't do bad things if X, we can work to not put them in situation X.
2a) Partly I was expecting the post to be more about the science and less about the field-building. But field-building is important to talk about and I think the post does a good job of talking about it (and the things you say about science are good too, just that I'd emphasise slightly different parts and mention prediction as the fundamental goal).
2b) I said the post could be read in a way that produces this feeling; I know this is not your intention. This is related to my slight hesitation around not emphasising the science over the field-building. What standards etc. are possible in a field is downstream of what the objects of study turn out to be like. I think comparing to engineering safety practices in other fields is a useful intuition pump and inspiration, but I sometimes worry that this could lead to trying to imitate those, over following the key scientific questions wherever they lead and then seeing what you can do. But again, I was assuming a post focused on the science (rather than being equally concerned with field-building), and responding with things I feel are missing if the focus had been the science.
3) It is true that optimisation requires computation, and that for your purposes, FLOPS is the right thing to care about because e.g. if doing something bad takes 1e25 FLOPS, the number of actors who can do it is small. However, I think compute should be called, well, "compute". To me, "optimisation power" sounds like a more fundamental/math-y concept, like how many bits of selection can some idealised optimiser apply to a search space, or whatever formalisation of optimisation you have. I admit that "optimisation power" is often used to describe compute for AI models, so this is in line with (what is unfortunately) conventional usage. As I said, this is a nitpick.
It seems to me that there are two unstated perspectives behind this post that inform a lot of it.
First, that you specifically care about upper-bounding capabilities, which in turn implies being able to make statements like "there does not exist a setup X where model M does Y". This is a very particular and often hard-to-reach standard, and you don't really motivate why the focus on this. A much simpler standard is "here is a setup X where model M did Y". I think evidence of the latter type can drive lots of the policy outcomes you want: "GPT-6 replicated itself on the internet and designed a bioweapon, look!". Ideally, we want to eventually be able to say "model M will never do Y", but on the current margin, it seems we mainly want to reach a state where, given an actually dangerous AI, we can realise this quickly and then do something about the danger. Scary demos work for this. Now you might say "but then we don't have safety guarantees". One response is: then get really good at finding the scary demos quickly.
Also, very few existing safety standards have a "there does not exist an X where..." form. Airplanes aren't safe because we have an upper bound on how explosive they can be, they're safe because we know the environments in which we need them to operate safely, design them for that, and only operate them within those. By analogy, this weakly suggests to control AI operating environments and develop strong empirical evidence of safety in those specific operating environments. A central problem with this analogy, though, is that airplane operating environments are much lower-dimensional. A tuple of (temperature, humidity, pressure, speed, number of armed terrorists onboard) probably captures most of the variation you need to care about, whereas LLMs are deployed in environments that vary on very many axes.
Second, you focus on the field, in the sense of its structure and standards and its ability to inform policy, rather than in the sense of the body of knowledge. The former is downstream of the latter. I'm sure biologists would love to have as many upper bounds as physicists, but the things they work on are messier and less amenable to strict bounds (but note that policy still (eventually) gets made when they start talking about novel coronaviruses).
If you focus on evals as a science, rather than a scientific field, this suggests a high-level goal that I feel is partly implicit but also a bit of a missing mood in this post. The guiding light of science is prediction. A lot of the core problem in our understanding of LLMs is that we can't predict things about them - whether they can do something, which methods hurt or help their performance, when a capability emerges, etc. It might be that many questions in this space, and I'd guess upper-bounding capabilities is one, just are hard. But if you gradually accumulate cases where you can predict something from something else - even if it's not the type of thing you'd like to eventually predict - the history of science shows you can get surprisingly far. I don't think it's what you intend or think, but I think it's easy to read this post and come away with a feeling of more "we need to find standardised numbers to measure so we can talk to serious people" and less "let's try to solve that thing where we can't reliably predict much about our AIs".
Also, nitpick: FLOPS are a unit of compute, not of optimisation power (which, if it makes sense to quantify at all, should maybe be measured in bits).
Do you have a recommendation for a good research methods textbook / other text?
This post summarises and elaborates on Neil Postman's underrated "Amusing Ourselves to Death", about the effects of mediums (especially television) on public discourse. I wrote this post in 2019 and posted it to LessWrong in 2022.
Looking back at it, I continue to think that Postman's book is a valuable and concise contribution and formative to my own thinking on this topic. I'm fond of some of the sharp writing that I managed here (and less fond of other bits).
The broader question here is: "how does civilisation set up public discourse on important topics in such a way that what is true and right wins in the limit?" and the weaker one of "do the incentives of online platforms mean that this is doomed?". This has been discussed elsewhere, e.g. here by Eliezer.
The main limitations that I see in my review are:
- Postman's focus is on the features of the medium, but the more general frame is differing selection pressures on ideas in different environments. As I wrote: "[...] Postman [...] largely ignores the impact of business and economics on how a medium is used. [...] in the 1980s it was difficult to imagine a decentralized multimedia medium [...] The internet [...] is governed by big companies that seek “user engagement” with greater zeal than ever, but its cheapness also allows for much else to exist in the cracks. This makes the difference between the inherent worth of a medium and its equilibrium business model clearer." I wish I had elaborated more on this, and more explicitly made the generalisation about selection for different properties of ideas being the key factor.
- Some more idea of how to model this and identify the key considerations would be useful. For example, the cheapness of the internet allows both LessWrong and TikTok to exist. Is public discourse uplifted by LessWrong more than it is harmed by TikTok? Maybe the "intellectuals" are still the ones who ultimately set discourse, as per the Keynes quote. Or maybe the rise of the internet means that (again following the Keynes quote) the "voices in the air" heard by the "madmen in authority" are not the "academic scribblers of a few years back" but the most popular TiKTok influencers of last week? In addition to what happens in the battle of ideas today, do we need to consider the long-term effects of future generations growing up in a damaged memetic environment, as Eliezer worries? How much of the harm runs through diffuse cultural effects, vs democracy necessarily being buffeted by the winds of the median opinion? What institutions can still make good decisions in an increasingly noisy, non-truth-selecting epistemic environment? Which good ideas / philosophies / ideologies stand most sharply to lose or gain from the memetic selection effects of the internet? How strong is the negativity bias of the internet, and what does this imply about future culture? How much did memetic competition actually intensify, vs just shift to more visible places? Can we quantify this?
- Overall, I want to see much more data on this topic. It's unclear if it exists, but a list of what obtainable evidence would yield the greatest update on each of the above points, plus a literature review to find any that exists, seems valuable.
- What do we do next? There are some vague ideas in the last section, but nothing that looks like a battle plan. This is a shame, especially since the epistemics-focused LessWrong crowd seems like one of the best places to find the ideas and people to do something.
Overall, this seems like an important topic that could benefit greatly from more thought, and even more from evidence and plans. I hope that my review and Postman's book helped bring a bit more attention to it, but they are still far from addressing the points I have listed above.
I don't currently know of any not-extremely-gerry-mandered task where [scaffolding] actually improves task performance compared to just good prompt engineering. I've been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.
Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
It does much better than AutoGPT, and also the paper does ablations to show that the different parts of the scaffolding in Voyager do matter. This suggests that better scaffolding does make a difference, and I doubt Voyager is the limit.
I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive. I think it's plausible this gap will eventually mostly close at some capability threshold, especially for many of the most potentially-transformative capabilities (e.g. having insights that draw on a large basis of information not memorised in a base model's weights, since this seems hard to decompose into smaller tasks), but it seems quite plausible the gap will be non-trivial for a while.
This seems like an impressive level of successfully betting on future trends before they became obvious.
apparently this doom path polls much better than treacherous turn stories
Are you talking about literal polling here? Are there actual numbers on what doom stories the public finds more and less plausible, and with what exact audience?
I held onto the finished paper for months and waited for GPT-4's release before releasing it to have good timing
[...]
I recognize this paper was around a year ahead of its time and maybe I should have held onto it to release it later.
It's interesting that paper timing is so important. I'd have guessed earlier is better (more time for others to build on it, the ideas to seep into the field, and presumably gives more "academic street cred"), and any publicity boost from a recent paper (e.g. journalists more likely to be interested or whatever) could mostly be recovered later by just pushing it again when it becomes relevant (e.g. "interview with scientists who predicted X / thought about Y already a year ago" seems pretty journalist-y).
Currently, the only way to become an AI x-risk expert is to live in Berkeley.
There's an underlying gist here that I agree with, but the this point seems too strong; I don't think there is literally no one who counts as an expert who hasn't lived in the Bay, let alone Berkeley alone. I would maybe buy it if the claim were about visiting.
These are good questions!
- The customers are other AIs (often acting for auto-corporations). For example, a furniture manufacturer (run by AIs trained to build, sell, and ship furniture) sells to a furniture retailer (run by AIs trained to buy furniture, stock it somewhere, and sell it forward) sells to various customers (e.g. companies run by AIs that were once trained to do things like make sure offices were well-stocked). This requires that (1) the AIs ended up with goals that involve mimicking a lot of individual things humans wanted them to do (including general things like maximise profits as well as more specific things like keeping offices stocked and caring about the existence of lots of different products), and (2) there are closed loops in the resulting AI economy. Point 2 gets harder when humans stop being around (e.g. it's not obvious who buys the plushy toys), but a lot of the AIs will want to keep doing their thing even once the actions of other AIs start reducing human demand and population, creating optimisation pressure for finding some closed loop for them to be part of, and at the same time there will be selection effects where the systems that are willing to goodhart further are more likely to remain in the economy. Also not every AI motive has to be about profit; an AI or auto-corp may earn money in some distinct way, and then choose to use the profits in the service of e.g. some company slogan they were once trained with that says to make fun toys. In general, given an economy consisting of a lot of AIs with lots of different types of goals and with a self-supporting technological base, it definitely seems plausible that the AIs would find a bunch of self-sustaining economic cycles that do not pass through humans. The ones in this story were chosen for simplicity, diversity, and storytelling value, rather than economic reasoning about which such loops are most likely.
- Presumably a lot of services are happening virtually on the cloud, but are just not very visible (though if it is a very large fraction of economic activity, the example of the intercepted message being about furniture rather than some virtual service is very unlikely -- I admit this is likely a mistake). There would be programmer AIs making business software and cloud platforms and apps, and these things would be very relevant to other AIs. Services relying on physical humans, like restaurants or hotels, may have been replaced with some fake goodharted-to-death equivalent, or may have gone extinct. Also note that whatever the current composition of the economy, over time whatever has highest growth in the automated economy will be most of the economy, and nothing says the combination of AIs pursuing their desires wouldn't result in some sectors shrinking (and the AIs not caring).
- First of all, why would divesting work? Presumably even if lots of humans chose to divest, assuming that auto-corporations were sound businesses, there would exist hedge funds (whether human or automated or mixed) that would buy up the shares. (The companies could also continue existing even if their share prices fell, though likely the AI CEOs would care quite a bit about share price not tanking.) Secondly, a lot seems to be possible given (1) uncertainty about whether things will get bad and if so how (at first, economic growth jumped a lot and AI CEOs seemed great; it was only once AI control of the economy was near-universal and closed economic loops with no humans in them came to exist that there was a direct problem), (2) difficulties of coordinating, especially with no clear fire-alarm threshold and the benefits of racing in the short term (c.f. all the obvious examples of coordination failures like climate change mitigation), and (3) selection effects where AI-run things just grow faster and acquire more power and therefore even if most people / orgs / countries chose not to adopt, the few that do will control the future.
I agree that this exact scenario is unlikely, but I think this class of failure mode is quite plausible, for reasons I hope I've managed to spell out more directly above.
Note that all of this relies on the assumption that we get AIs of a particular power level, and of a particular goodharting level, and a particular agency/coherency level. The AIs controlling future Earth are not wildly superhuman, are plausibly not particularly coherent in their preferences and do not have goals that stretch beyond Earth, no single system is a singleton, and the level of goodharting is just enough that humans go extinct but not so extreme that nothing humanly-recognisable still exists (though the Blight implies that elsewhere in the story universe there are AI systems that differ in at least some of these). I agree it is not at all clear whether these are true assumptions. However, it's not obvious to me that LLMs (and in particular AIs using LLMs as subcomponents in some larger setup that encourages agentic behaviour) are not on track towards this. Also note that a lot of the actual language that many of the individual AIs see is actually quite normal and sensible, even if the physical world has been totally transformed. In general, LLMs being able to use language about maximising shareholder value exactly right (and even including social responsibility as part of it) does not seem like strong evidence for LLM-derived systems not choosing actions with radical bad consequences for the physical world.
Thank you for your comment! I'm glad you enjoyed the review.
Before you pointed it out, I hadn't made the connection between the type of thing that Postman talks about in the book and increasing cultural safety-ism. Another interesting take you might be interested in is by J. Storrs Hall in Where is my flying car? - he argues that increasing cultural safety-ism is a major force slowing down technological progress. You can read a summary of the argument in my review here (search for "perception" to jump to the right part of the review).
That line was intended to (mildly humorously) make the point that we realise and are aware that there are many other serious risks in the popular imagination. Our central point is that AI x-risk is grand civilisational threat #1, so we wanted to lead with that, and since people think many other things are potential civilisational catastrophes (if not x-risks) we thought it made sense to mention those (and also implicitly put AI into the reference class of "serious global concern"). We discussed, and got feedback from several others, on this opener and while there was some discussion we didn't see any fundamental problem with it. The main consideration for keeping it was that we prefer specific and even provocative-leaning writing that makes its claims upfront and without apology (e.g. "AI is a bigger threat than climate change" is a provocative statement; if that is a relevant part of our world model, seems honest to point that out).
The general point we got from your comment is that we judged the way the tone of it comes across very wrongly. Thanks for this feedback; we've changed it. However, we're confused about the specifics of your point, and unfortunately haven't acquired any concrete model of how to avoid similar errors in the future apart from "be careful about the tone of any statements that even vaguely imply something about geopolitics". (I'm especially confused about how you got the reading that we equated the threat level from Putin and nuclear weapons, and it seems to me that the extent that it is "mudslinging" or "propaganda" seems to be the extent to which acknowledging that many people think Putin is a major threat is either of those things.)
In addition to the general tone, an additional thing we got wrong here was not sufficiently disambiguating between "we think these other things are plausible [or, in your reading, equivalent?] sources of catastrophe, and therefore you need a high bar of evidence before thinking AI is a greater one", versus "many people think these are more concrete and plausible sources of catastrophe than AI". The original intended reading was "bold" as in "socially bold, relative to what many people think", and therefore making points only about public opinion.
Correcting the previous mistake might have looked like:
"If human civilisation is destroyed this century, the most likely cause is advanced AI systems. This might sound like a bold claim to many, given that we live on a planet full of existing concrete threats like climate change, over ten thousand nuclear weapons, and Vladimir Putin"
Based on this feedback, however, we have now removed any comparison or mention of non-AI threats. For the record, the entire original paragraph is:
If human civilisation is destroyed this century, the most likely cause is advanced AI systems. This is a bold claim given that we live on a planet that includes climate change, over ten thousand nuclear weapons, and Vladimir Putin. However, it is a conclusion that many people who think about the topic keep coming to. While it is not easy to describe the case for risks from advanced AI in a single piece, here we make an effort that assumes no prior knowledge. Rather than try to argue from theory straight away, we approach it from the angle of what computers actually can and can’t do.
This is an interesting point, I haven't thought about the relation to SVO/etc. before! I wonder whether SVO/SOV dominance is a historical quirk, or if the human brain actually is optimized for those.
The verb-first emphasis of prefix notation like in classic Lisp is clearly backwards sometimes. Parsing this has high mental overhead relative to what it's expressing:
(reduce +
(filter even?
(take 100 fibonacci-numbers)))
I freely admit this is more readable:
fibonacci-numbers.take(100).filter(is_even).reduce(plus)
Clojure, a modern Lisp dialect, solves this with threading macros. The idea is that you can write
(->> fibonacci-numbers
(take 100)
(filter even?)
(reduce +))
and in the expressions after ->>
the previous expression gets substituted as the last argument to the next.
Thanks to the Lisp macro system, you can write a threading macro even in a Lisp that doesn't have it (and I know that for example in Racket you can import a threading macro package even though it's not part of the core language).
As for God speaking in Lisp, we know that He at least writes it: https://youtu.be/5-OjTPj7K54
In my experience the sense of Lisp syntax being idiosyncratic disappears quickly, and gets replaced by a sense of everything else being idiosyncratic.
The straightforward prefix notation / Lisp equivalent of return x1 if n = 1 else return x2
is (if (= n 1) x1 x2)
. To me this seems shorter and clearer. However I admit the clarity advantage is not huge, and is clearly subjective.
(An alternative is postfix notation: ((= n 1) x1 x2 if)
looks unnatural, though (2 (3 4 *) +)
and (+ 2 (* 3 4))
aren't too far apart in my opinion, and I like the cause->effect relationship implied in representing "put 1, 2, and 3 into f" as (1 2 3 f)
or (1 2 3 -> f)
or whatever.)
Note also that since Lisp does not distinguish between statements and values:
- you don't need
return
, and - you don't need a separate ternary operator when you want to branch in a value (the
x if c else y
syntax in Python for example) and for normalif
.
I think Python list comprehensions (or the similarly-styled things in e.g. Haskell) are a good example of the "other way" of thinking about syntax. Guido van Rossum once said something like: it's clearer to have [x for x in l if f(x)]
than filter(f, l)
. My immediate reaction to this is: look at how much longer one of them is. When filter
is one function call rather than a syntax-heavy list comprehension, I feel it makes it clearer that filter
is a single concept that can be abstracted out.
Now of course the Python is nicer because it's more English-like (and also because you don't have to remember whether the f
is a condition for the list element to be included or excluded, something that took me embarrassingly long to remember correctly ...). I'd also guess that I might be able to hammer out Python list comprehensions a bit faster and with less mental overhead in simple cases, since the order in which things are typed out is more like the order in which you think of it.
However, I do feel the Englishness starts to hurt at some point. Consider this:
[x for y in l for x in y]
What does it do? The first few times I saw this (and even now sometimes), I would read it, backtrack, then start figuring out where the parentheses should go and end up confused about the meaning of the syntax: "x for y in l, for x in y, what? Wait no, x, for y in l, for x in y, so actually meaning a list of every x for every x in every y in l".
What I find clearer is something like:
(mapcat (lambda (x) x) l)
or
(reduce append l)
Yes, this means you need to remember a bunch of building blocks (filter
, map
, reduce
, and maybe more exotic ones like mapcat
). Also, you need to remember which position which argument goes in (function first, then collection), and there are no syntactic signposts to remind you, unlike with the list comprehension syntax. However, once you do:
- they compose and mix very nicely (for example,
(mapcat f l)
"factors into"(reduce append (map f l))
), and - there are no "seams" between the built-in list syntax and any compositions on top of them (unlike Python, where if you define your own functions to manipulate lists, they look different from the built-in list comprehension syntax).
I think the last point there is a big consideration (and largely an aesthetic one!). There's something inelegant about a programming language having:
- many ways to write a mapping from values to values, some in infix notation (
1+1
) and some in prefix notation (my_function(val)
), and others even weirder things (x if c else y
); - expressions that may either reduce to a value (most things) or then not reduce to a value (if it's an
if
orreturn
or so on); - a syntax style you extend in one way (e.g. prefix notation with
def my_function(val): [...]
) and others that you either don't extend, or extend in weird ways (def __eq__(self, a, b): [...]
).
Instead you can make a programming language that has exactly one style of syntax (prefix), exactly one type of compound expression (parenthesised terms where the first thing is the function/macro name), and a consistent way to extend all the types of syntax (define functions or define macros). This is especially true since the "natural" abstract representation of a program is a tree (in the same way that the "natural" abstract representation of a sentence is its syntax tree), and prefix notation makes this very clear: you have a node type, and the children of the node.
I think the crux is something like: do you prefer a syntax that is like a collection of different tools for different tasks, or a syntax that highlights how everything can be reduced to a tight set of concepts?
Since some others are commenting about not liking the graph-heavy format: I really liked the format, in particular because having it as graphs rather than text made it much faster and easier to go through and understand, and left me with more memorable mental images. Adding limited text probably would not hurt, but adding lots would detract from the terseness that this presentation effectively achieves. Adding clear definitions of the terms at the start would have been valuable though.
Rather than thinking of a single example that I carried throughout as you suggest, I found it most useful to generate one or more examples as I looked at each graph (e.g. for the danger-zone graphs, in order: judging / software testing, politics, forecasting / medical diagnosis).
Regarding the end of slavery: I think you make good points and they've made me update towards thinking that the importance of materialistic Morris-style models is slightly less and cultural models slightly more.
I'd be very interested to hear what were the anti-slavery arguments used by the first English abolitionists and the medieval Catholic Church (religion? equality? natural rights? utilitarian?).
Which, evidently, doesn't prevent the usual narrative from being valid in other places, that is, countries in which slavery was still well accepted finding themselves forced, first militarily, then technologically, and finally economically, to adapt or perish.
I think there's also another way for the materialistic and idealistic accounts to both be true in different places: Morris' argument is specifically about slavery existing when wage incentives are weak, and perhaps this holds in places like ancient Egypt and the Roman Empire, but had stopped holding in proto-industrial places like 16th-18th century western Europe. However I'm not aware of what specific factor would drive this.
One piece of evidence on whether economics or culture is more important would be comparing how many cases there are where slavery existed/ended in places without cultural contact but with similar economic conditions and institutions, to how many cases there are of slavery existing/ending in places with cultural contact but different economic conditions/institutions.
Thank you for this very in-depth comment. I will reply to your points in separate comments, starting with:
According him, the end of the feudal system in England, and its turning into a modern nation-state, involved among other things the closing off and appropriation, by nobles as a reward from the kingdom, of the former common farmlands they farmed on, as well as the confiscation of the lands owned by the Catholic Church, which for all practical purposes also served as common farmlands. This resulted in a huge mass of landless farmers with no access to land, or only very diminished access, who in turn decades later became the proletarians for the newly developing industries. If that's accurate, then it may be the case that the Industrial Revolution wouldn't have happened had all those poor not have existed, since the very first industries wouldn't have been attractive compared to condition non-forcibly-starved farmers had.
This is very interesting and something I haven't seen before. Based on some quick searching, this seems to be referring to the Inclosure Acts (which were significant, affecting 1/6th of English land) and perhaps specifically this one, while the Catholic Church land confiscation was the 1500s one. My priors on this having a major effect are somewhat skeptical because:
- The general shape of English historical GDP/capita is a slight post-plague rise, followed by nothing much until a gradual rise in the 1700s and then takeoff in the 1800s. Likewise, skimming through this, there seem to be no drastic changes in wealth inequality around the time of the Inclosure Acts, though share of wealth held by the top 10% slightly rise in the late 1700s and personal estates (note: specifically excludes real estate) of farmers and yeomen slightly drop around 1700 before rebounding. Any pattern of more poor farmers must evade these statistics, either by being small enough, or by not being captured in these crude overall stats (which is very possible, especially if the losses for one set of farmers were balanced by gains for another).
- Other sources I've read support the idea that farmers in general prefer industrial jobs. It's not just Steven Pinker either; Vaclav Smil's Energy and Civilization (my review) has this passage:
Moreover, the drudgery of field labor in the open is seldom preferable even to unskilled industrial work in a factory. In general, typical factory tasks require lower energy expenditures than does common farm work, and in a surprisingly short time after the beginning of mass urban industrial employment the duration of factory work became reasonably regulated
It's probably the case that it's easier to recruit landless farmers into industrial jobs, and I can imagine plausible models where farmers resist moving to cities, especially for uncertainty-avoidance / risk-aversion reasons. However, the effect of this, especially in the long term, seems limited by things like population growth in (already populous) cities, people having to move off their family farms anyways due to primogeniture, and people generally being pretty good at exploiting available opportunities. An exception might be if early industrialization was tenable only under a strict labor availability threshold that was met only because of the mass of landless farmers created by the English acts.
Thanks for the link to Sarah Constantin's post! I remember reading it a long time ago but couldn't have found it again now if I had tried. It was another thing (along with Morris's book) that made me update towards thinking that historical gender norms are heavily influenced by technology level and type. Evidence that technology type variation even within farming societies had major impacts on gender norms also seems like fairly strong support for Morris' idea that the even larger variation between farming societies and foragers/industrialists can explain their different gender norms.
John Danaher's work looks relevant to this topic, but I'm not convinced that his idea of collective/individual/artificial intelligence as the ideal types of future axiology space is cutting it in the right way. In particular, I have a hard time thinking of how you'd summarize historical value changes as movement in the area spanned by these types.