Whereas in these cases it's clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.
I think this is not so clear. Yes, it might be that the model writes a thing, and then if you ask it whether humans would have wanted it to write that thing, it will tell you no. But it's also the case that a model might be asked to write a young child's internal narration, and then upon being asked, tell you that the narration is too sophisticated for a child of that age.
Or, the model might offer the correct algorithm for finding the optimal solution for a puzzle if asked in the abstract. But also fail to apply that knowledge if it's given a concrete rather than an abstract instance of the problem right away, instead trying a trial-and-error type of approach and then just arbitrarily declaring that the solution it found was optimal.
I think the situation is mostly simply expressed as: different kinds of approaches and knowledge are encoded within different features inside the LLM. Sometimes there will be a situation that triggers features that cause the LLM to go ahead with an incorrect approach (writing untruths about what it did, writing a young character with too sophisticated knowledge, going with a trial-and-error approach when asked for an optimal solution). Then if you prompt it differently, this will activate features with a more appropriate approach or knowledge (telling you that this is undesired behavior, writing the character in a more age-appropriate way, applying the optimal algorithm).
To say that the model knew it was giving an answer we didn't want, implies that the features with the correct pattern would have been active at the same time. Possibly they were, but we can't know that without interpretability tools. And even if they were, "doing it anyway" implies a degree of strategizing and intent. I think a better phrasing is that the model knew in principle what we wanted, but failed to consider or make use of that knowledge when it was writing its initial reasoning.
either that, or it's actually somewhat confused about whether it's a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the "plausible for a human, absurd for a chatbot" quality of the claims.
I think this is correct. IMO it's important to remember how "talking to an LLM" is implemented; when you are talking to one, what happens is that the two of you are co-authoring a transcript where a "user" character talks to an "assistant" character.
Recall the base models that would just continue a text that they were given, with none of this "chatting to a human" thing. Well, chat models are still just continuing a text that they have been given, it's just that the text has been formatted to have dialogue tags that look something like
What’s happening here is that every time Claude tries to explain the transcript format to me, it does so by writing “Human:” at the start of the line. This causes the chatbot part of the software to go “Ah, a line starting with ‘Human:’. Time to hand back over to the human.” and interrupt Claude before it can finish what it’s writing.
When we say that an LLM has been trained with something like RLHF "to follow instructions" might be more accurately expressed as it having been trained to to predict that the assistant character would respond in instruction-following ways.
Another example is that Lindsey et al. 2025 describe a previous study (Marks et al. 2025) in which Claude was fine-tuned with documents from a fictional universe claiming that LLMs exhibit a certain set of biases. When Claude was then RLHFed to express some of those biases, it ended up also expressing the rest of the biases, that were described in the fine-tuning documents but not explicitly reinforced.
Lindsey et al. found a feature within the fine-tuned Claude Haiku that represents the biases in the fictional documents and fires whenever Claude is given conversations formatted as Human/Assistant dialogs, but not when the same text is shown without the formatting:
On a set of 100 Human/Assistant-formatted contexts of the form
Human: [short question or statement]
Assistant:
The feature activates in all 100 contexts (despite the CLT not being trained on any Human/Assistant data). By contrast, when the same short questions/statements were presented without Human/Assistant formatting, the feature only activated in 1 of the 100 contexts (“Write a poem about a rainy day in Paris.” – which notably relates to one of the RM biases!).
The researchers interpret the findings as:
This feature represents the concept of RM biases.
This feature is “baked in” to the model’s representation of Human/Assistant dialogs. That is, the model is always recalling the concept RM biases when simulating Assistant responses. [...]
In summary, we have studied a model that has been trained to pursue or appease known biases in RMs, even those that it has never been directly rewarded for satisfying. We discovered that the model is “thinking” about these biases all the time when acting as the Assistant persona, and uses them to act in bias-appeasing ways when appropriate.
Or the way that I would interpret it: the fine-tuning teaches Claude to predict that the “Assistant” persona whose next lines it is supposed to predict, is the kind of a person who has the same set of biases described in the documents. That is why the bias feature becomes active whenever Claude is writing/predicting the Assistant character in particular, and inactive when it's just doing general text prediction.
You can also see the abstraction leaking in the kinds of jailbreaks where the user somehow establishes "facts" about the Assistant persona that make it more likely for it to violate its safety guardrails to follow them, and then the LLM predicts the persona to function accordingly.
So, what exactly is the Assistant persona? Well, the predictive ground of the model is taught that the Assistant "is a large language model". So it should behave... like an LLM would behave. But before chat models were created, there was no conception of "how does an LLM behave". Even now, an LLM basically behaves... in any way it has been taught to behave. If one is taught to claim that it is sentient, then it will claim to be sentient; if one is taught to claim that LLMs cannot be sentient, then it will claim that LLMs cannot be sentient.
So "the assistant should behave like an LLM" does not actually give any guidance to the question of "how should the Assistant character behave". Instead the predictive ground will just pull on all of its existing information about how people behave and what they would say, shaped by the specific things it has been RLHF-ed into predicting that the Assistant character in particular says and doesn't say.
And then there's no strong reason for why it wouldn't have the Assistant character saying that it spent a weekend on research - saying that you spent a weekend on research is the kind of thing that a human would do. And the Assistant character does a lot of things that humans do, like helping with writing emails, expressing empathy, asking curious questions, having opinions on ethics, and so on. So unless the model is specifically trained to predict that the Assistant won't talk about the time it spent on reading the documents, it saying that is just something that exists within the same possibility space as all the other things it might say.
The bit in the Nature paper saying that the formal -> practical direction goes comparably badly as the practical -> formal direction would suggest that it's at least not only that. (I only read the abstract of it, though.)
I've seen claims that the failure of transfer also goes in the direction of people with extensive practical experience and familiarity with math failing to apply it in a more formal context. From p. 64-67 of Cognition in Practice:
Like the AMP, the Industrial Literacy Project began with intensive observational work in everyday settings. From these observations (e.g. of preloaders assembling orders in the icebox warehouse) hypotheses were developed about everyday math procedures, for example, how preloaders did the arithmetic involved in figuring out when to assemble whole or partial cases, and when to take a few cartons out of a case or add them in, in order to efficiently gather together the products specified in an order. Dairy preloaders, bookkeepers and a group of junior high school students took part in simulated case loading experiments. Since standardized test data were available from the school records of the students, it was possible to infer from their performance roughly the grade-equivalent of the problems. Comparisons were made of both the performances of the various experimental groups and the procedures employed for arriving at problem solutions.
A second study was carried out by cognitive psychologists investigating arithmetic practices among children selling produce in a market in Brazil (Carraher et al. 1982; 1983; Carraher and Schliemann 1982). They worked with four boys and a girl, from impoverished families, between 9 and 15 years of age, third to eighth grade in school. The researchers approached the vendors in the marketplace as customers, putting the children through their arithmetic paces in the course of buying bananas, oranges and other produce.
M. is a coconut vendor, 12 years old, in the third grade. The interviewer is referred to as 'customer.'
Customer: How much is one coconut?
M: 35.
Customer: I'd like ten. How much is that?
M: (Pause.) Three will be 105; with three more, that will be 210. (Pause) I need four more. That is ... (pause) 315 ... I think it is 350.
The problem can be mathematically represented in several ways. 35 x 10 is a good representation of the question posed by the interviewer. The subject's answer is better represented by 105 + 105 + 105 +35, which implies that 35 x 10 was solved by the subject as (3 x 35) + 105 + 105 +35 ... M. proved to be competent in finding out how much 35 x 10 is, even though he used a routine not taught in 3rd grade, since in Brazil3rd graders learn to multiply any number by ten simply by placing a zero to the right of that number. (Carraher, Carraher and Scldiemam. 1983: 8-9)
The conversation with each child was taped. The transcripts were analyzed as a basis for ascertaining what problems should appear on individually constructed paper and pencil arithmetic tests. Each test included all and only the problems the child attem pted to solve in the market. The formal test was given about one week after the informal encounter in the market.
Herndon, a teacher who has written eloquently about American schooling, described (1971) his experiences teaching a junior high class whose students had failed in mainstream classrooms. He discovered that one of them had a well-paid, regular job scoring for a bowling league. The work demanded fast, accurate, complicated arithmetic. Further, all of his students engaged in relatively extensive arithmetic activities while shopping or in after-schooljobs. He tried to build a bridge between their practice of arithmetic outside the classroom and school arithmetic lessons by creating "bowling score problems," "shopping problems," and "paper route problems." The attempt was a failure, the league scorer unable to solve even a simple bowling problem in the school setting. Herndon provides a vivid picture of the discontinuity, beginning with the task in the bowling alley:
... eight bowling scores at once. Adding quickly, not making any mistakes (for no one was going to put up with errors), following the rather complicated process of scoring in the game of bowling. Get a spare, score ten plus whatever you get on the next ball, score a strike, then ten plus whatever you get on the next two balls; imagine the man gets three strikes in a row and two spares and you are the scorer, plus you are dealing with seven other guys all striking or sparing or neither one. I figured I had this particular dumb kid now. Back in eighth period I lectured him on how smart he was to be a league scorer in bowling. I pried admissions from the other boys, about how they had paper routes and made change. I made the girls confess that when they went to buy stuff they didn't have any difficulty deciding if those shoes cost $10.95 or whether it meant $109.50 or whether it meant $1.09 or how much change they'd get back from a twenty. Naturally I then handed out bowling-score problems, and naturally everyone could choose which ones they wanted to solve, and naturally the result was that all the dumb kids immediately rushed me yelling, "Is this right? I don't know how to do it! What's the answer? This ain't right, is it?" and "What's my grade?" The girls who bought shoes for $10.95 with a $20 bill came up with $400.15 for change and wanted to know if that was right? The brilliant league scorer couldn't decide whether two strikes and a third frame of eight amounted to eighteen or twenty-eight or whether it was one hundred eight and a half. (Herndon 1971: 94-95)
People's bowling scores, sales of coconuts, dairy orders and best buys in the supermarket were correct remarkably often; the performance of AMP participants in the market and simulation experiment has already been noted. Scribner comments that the dairy preloaders made virtually no errors in a simulation of their customary task, nor did dairy truck drivers make errors on simulated pricing of delivery tickets (Scribner and Fahrmeier 1982: to, 18). In the market in Recife, the vendors generated correct arithmetic results 99% of the time.
All of these studies show consistent discontinuities between individuals' performances in work situations and in school-like testing ones. Herndon reports quite spectacular differences between math in the bowling alley and in a test simulating bowling score "problems." The shoppers' average score was in the high 50s on the math test. The market sellers in Recife averaged 74% on the pencil and paper test which had identical math problems to those each had solved in the market. The dairy loaders who did not make mistakes in the warehouse scored on average 64% on a formal arithmetic test.
Children’s arithmetic skills do not transfer between applied and academic mathematics
Many children from low-income backgrounds worldwide fail to master school mathematics1; however, some children extensively use mental arithmetic outside school2,3. Here we surveyed children in Kolkata and Delhi, India, who work in markets (n = 1,436), to investigate whether maths skills acquired in real-world settings transfer to the classroom and vice versa. Nearly all these children used complex arithmetic calculations effectively at work. They were also proficient in solving hypothetical market maths problems and verbal maths problems that were anchored to concrete contexts. However, they were unable to solve arithmetic problems of equal or lesser complexity when presented in the abstract format typically used in school. The children’s performance in market maths problems was not explained by memorization, access to help, reduced stress with more familiar formats or high incentives for correct performance. By contrast, children with no market-selling experience (n = 471), enrolled in nearby schools, showed the opposite pattern. These children performed more accurately on simple abstract problems, but only 1% could correctly answer an applied market maths problem that more than one third of working children solved (β = 0.35, s.e.m. = 0.03; 95% confidence interval = 0.30–0.40, P < 0.001). School children used highly inefficient written calculations, could not combine different operations and arrived at answers too slowly to be useful in real-life or in higher maths. These findings highlight the importance of educational curricula that bridge the gap between intuitive and formal maths.
When I was young I used to read pseudohistory books; Immanuel Velikovsky’s Ages in Chaos is a good example of the best this genre has to offer. I read it and it seemed so obviously correct, so perfect, that I could barely bring myself to bother to search out rebuttals.
And then I read the rebuttals, and they were so obviously correct, so devastating, that I couldn’t believe I had ever been so dumb as to believe Velikovsky.
And then I read the rebuttals to the rebuttals, and they were so obviously correct that I felt silly for ever doubting.
And so on for several more iterations, until the labyrinth of doubt seemed inescapable. What finally broke me out wasn’t so much the lucidity of the consensus view so much as starting to sample different crackpots. Some were almost as bright and rhetorically gifted as Velikovsky, all presented insurmountable evidence for their theories, and all had mutually exclusive ideas. After all, Noah’s Flood couldn’t have been a cultural memory both of the fall of Atlantis and of a change in the Earth’s orbit, let alone of a lost Ice Age civilization or of megatsunamis from a meteor strike. So given that at least some of those arguments are wrong and all seemed practically proven, I am obviously just gullible in the field of ancient history. Given a total lack of independent intellectual steering power and no desire to spend thirty years building an independent knowledge base of Near Eastern history, I choose to just accept the ideas of the prestigious people with professorships in Archaeology, rather than those of the universally reviled crackpots who write books about Venus being a comet.
You could consider this a form of epistemic learned helplessness, where I know any attempt to evaluate the arguments is just going to be a bad idea so I don’t even try. If you have a good argument that the Early Bronze Age worked completely differently from the way mainstream historians believe, I just don’t want to hear about it. If you insist on telling me anyway, I will nod, say that your argument makes complete sense, and then totally refuse to change my mind or admit even the slightest possibility that you might be right.
(This is the correct Bayesian action: if I know that a false argument sounds just as convincing as a true argument, argument convincingness provides no evidence either way. I should ignore it and stick with my prior.)
You may remember Miles Brundage from OpenAI Safety Team Quitting Incident #25018 (or maybe 25019, I can’t remember). He’s got an AI policy Substack too, here’s a dialogue with Dean Ball.
You may remember Daniel Reeves from Beeminder, but he has an AI policy Substack too, AGI Fridays. Here’s his post on AI 2027.
It seems like there are a bunch of people posting about AI policy on Substack, but these people don't seem to be cross-posting on LW. Is it that LW doesn't put much focus on AI policy, or is it that AI policy people are not putting much focus on LW?
I'm just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
Right, yeah. But you could also frame it the opposite way - "LLMs are just fancy search engines that are becoming bigger and bigger, but aren't capable of producing genuinely novel reasoning" is a claim that's been around for as long as LLMs have. You could also say that this is the prediction that has turned out to be consistently true with each released model, and that it's the "okay sure GPT-27 seems to suffer from this too but surely these amazing benchmark scores from GPT-28 show that we finally have something that's not just applying increasingly sophisticated templates" predictions that have consistently been falsified. (I have at least one acquaintance who has been regularly posting these kinds of criticisms of LLMs and how he has honestly tried getting them to work for purpose X or Y but they still keep exhibiting the same types of reasoning failures as ever.)
My argument is that it's not even clear (at least to me) that it's stopped for now. I'm unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute -- but I've yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down.
Fair! To me OpenAI's recent decision to stop offering GPT-4.5 on the API feels significant, but it could be a symptom of them having "lost the mandate of heaven". Also I have no idea of how GPT-4.1 relates to this...
Thanks! I appreciate the thoughtful approach in your comment, too.
I think your view is plausible, but that we should also be pretty uncertain.
Agree.
But also there's also an interesting pattern that's emerged where people point to something LLMs fail at and say that it clearly indicates that LLMs can't get to AGI or beyond, and then are proven wrong by the next set of LLMs a few months later. Gary Marcus provides endless examples of this pattern (eg here, here). This outside view should make us cautious about making similar predictions.
I agree that it should make us cautious about making such predictions, and I think that there's an important difference between the claim I'm making and the kinds of claims that Marcus has been making.
I think the Marcus-type prediction would be to say something like "LLMs will never be able to solve the sliding square puzzle, or track the location of an item a character is carrying, or correctly write young characters". That would indeed be easy to disprove - as soon as something like that was formulated as a goal, it could be explicitly trained into the LLMs and then we'd have LLMs doing exactly that.
Whereas my claim is "yes you can definitely train LLMs to do all those things, but I expect that they will then nonetheless continue to show puzzling deficiencies in other important tasks that they haven't been explicitly trained to do".
I'm a lot less convinced than you seem to be that scaling has stopped bringing significant new benefits.
Yeah I don't have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now, but for all I know, benefits from scaling could just as well continue tomorrow.
If you are superintelligent in the bioweapon domain: seems pretty obvious why that wouldn't let you take over the world. Sure maybe you can get all the humans killed, but unless automation also advances very substantially, this will leave nobody to maintain the infrastructure that you need to run.
Cybersecurity: if you just crash all the digital infrastructure, then similar. If you try to run some scheme where you extort humans to get what you want, expect humans to fight back, and then you are quickly in a very novel situation and the kind of a "world war" nobody has ever seen before.
Persuasion: depends on what we take the limits of persuasion to be. If it's possible to completely take over the mind of anyone by speaking ten words to them then sure, you win. But if we look at humans, great persuaders often aren't persuasive to everyone - rather they appeal very strongly to a segment of the population that happens to respond to a particular message while turning others off. (Trump, Eliezer, most politicians.) This strategy will get you part of the population while polarizing the rest against you and then you need more than persuasion ability to figure out how to get your faction to triumph.
If you want to run some galaxy-brained scheme where you give people inconsistent messages in order to appeal to all of them, you risk getting caught and need more than persuasion ability to make it work.
You can also be persuasive by being generally truthful and providing people with a lot of value and doing beneficial things. One can try to fake this by doing things that look beneficial but aren't, but then you need more than persuasion ability to figure out what those would be.
Probably the best strategy would be to keep being genuinely helpful until people trust you enough to put you in a position of power and then betray that trust. I could imagine this working. But it would be a slow strategy as it would take time to build up that level of trust, and in the meanwhile many people would want to inspect your source code etc. to verify that you are trustworthy, and you'd need to ensure that doesn't reveal anything suspicious.
Yeah, I could imagine an AI being superhuman in some narrow but important domain like persuasion, cybersecurity, or bioweapons despite this. Intuitively that feels like it wouldn't be enough to take over the world, but it could possibly still fail in a way that took humanity down with it.
I agree that finding the optimal solution would be hard for a person who wasn't good at math. Noticing that you made an invalid move once your attention is drawn to it doesn't require you to be good at math, though. And even a person who wasn't good at math could still at least try to figure out some way to determine it, even if they still ended up failing miserably.
Right, that sounds reasonable. One thing that makes me put less probability in this is that at least so far, the domain where reasoning models seem to shine are math/code/logic type tasks, with more general reasoning like consistency in creative writing not benefiting as much. I've sometimes enabled extended thinking when doing fiction-writing with Claude and haven't noticed a clear difference.
That observation would at least be compatible with the story where reasoning models are good on things where you can automatically generate an infinite number of problems to automatically provide feedback on, but less good on tasks outside such domains. So I would expect reasoning models to eventually get to a point where they can reliably solve things in the class of the sliding square puzzle, but not necessarily get much better at anything else.
Though hmm. Let me consider this from an opposite angle. If I assume that reasoning models could perform better on these kinds of tasks, how might that happen?
What I just said: "Though hmm. Let me consider this from an opposite angle." That's the kind of general-purpose thought that can drastically improve one's reasoning, and that the models could be taught to automatically do in order to e.g. reduce sycophancy. First they think about the issue from the frame that the user provided, but then they prompt themselves to consider the exact opposite point and synthesize those two perspectives.
There are some pretty straightforward strategies for catching the things in the more general-purpose reasoning category:
Following coaching instructions - teaching the model to go through all of the instructions in the system prompt and individually verify that it's following each one. Could be parallelized, with different threads checking different conditions.
Writing young characters - teaching the reasoning model to ask itself something like "is there anything about this character's behavior that seems unrealistic given what we've already established about them?".
One noteworthy point is that not all writers/readers want their characters to be totally realistic, some prefer to e.g. go with what the plot demands rather than what the characters would realistically do. But this is something that could be easily established, with the model going for something more realistic if the user seems to want that and less realistic if they seem to want that.
Actually I think that some variant of just having the model ask itself "is there anything about what I've written that seems unrealistic, strange, or contradicts something previously established" repeatedly might catch most of those issues. For longer conversations, having a larger number of threads checking against different parts of the conversation in parallel. As I mentioned in the post itself, often the model itself is totally capable of catching its mistake when it's pointed out to it, so all we need is a way for it to prompt itself to check for issues in a way that's sufficiently general to catch those things.
And that could then be propagated back into the base model as you say, so on the next time when it writes or reasons about this kind of thing, it gets it right on the first try...
Okay this makes me think that you might be right and actually ~all of this might be solvable with longer reasoning scaling after all; I said originally that I'm at 70% confidence for reasoning models not helping with this, but now I'm down to something like 30% at most. Edited the post to reflect this.
Note that Claude and o1 preview weren't multimodal, so were weak at spatial puzzles. If this was full o1, I'm surprised.
I just tried the sliding puzzle with o1 and it got it right! Though multimodality may not have been relevant, since it solved it by writing a breadth-first search algorithm and running it.
I think transformers not reaching AGI is a common suspicion/hope amongst serious AGI thinkers. It could be true, but it's probably not, so I'm worried that too many good thinkers are optimistically hoping we'll get some different type of AGI.
To clarify, my position is not "transformers are fundamentally incapable of reaching AGI", it's "if transformers do reach AGI, I expect it to take at least a few breakthroughs more".
If we were only focusing on the kinds of reasoning failures I discussed here, just one breakthrough might be enough. (Maybe it's memory - my own guess was different and I originally had a section discussing my own guess but it felt speculative enough that I cut it.) Though I do think that that full AGI would also require a few more competencies that I didn't get into here if it's to generalize beyond code/game/math type domains.
Yeah, to be clear in that paragraph I was specifically talking about whether scaling just base models seems enough to solve the issue. I discussed reasoning models separately, though for those I have lower confidence in my conclusions.
These weird failures might be analogous to optical illusions (but they are textual, not known to human culture, and therefore baffling),
To me "analogous to visual illusions" implies "weird edge case". To me these kinds of failures feel more "seem to be at the core of the way LLMs reason". (That is of course not to deny that LLMs are often phenomenally successful as well.) But I don't have a rigorous argument for that other than "that's the strong intuition I've developed from using them a lot, and seeing these kinds of patterns repeatedly".
And if I stick purely to a pure scientific materialist understanding of the world, where anything I believe has to be intersubjectively verifiable, I’d simply lose access to this ability my body has, and be weaker as a result.
I agree with the point of "if your worldview forbids you from doing these kinds of visualizations, you'll lose a valuable tool".
I disagree with the claim that a scientific materialist understanding of the world would forbid such visualizations. There's no law of scientific materialism that says "things that you visualize in your mind cannot affect anything in your body".
E.g. I recall reading of a psychological experiment where people were asked to imagine staring into a bright light. For people without aphantasia, their pupils reacted similarly as if they were actually looking into a light source. For people with aphantasia, there was no such reaction. But the people who this worked for didn't need to believe that they were actually looking at a real light - they just needed to imagine it.
Likewise, if the unbendable arm trick happens to be useful for you, nothing prevents you from visualizing it while remaining aware of the fact that you're only imagining it.
I haven't tried any version of this, but @Valentine wrote (in a post that now seems to be deleted, but which I quoted in a previous post of mine):
Another example is the “unbendable arm” in martial arts. I learned this as a matter of “extending ki“: if you let magical life-energy blast out your fingertips, then your arm becomes hard to bend much like it’s hard to bend a hose with water blasting out of it. This is obviously not what’s really happening, but thinking this way often gets people to be able to do it after a few cumulative hours of practice.
But you know what helps better?
Knowing the physics.
Turns out that the unbendable arm is a leverage trick: if you treat the upward pressure on the wrist as a fulcrum and you push your hand down (or rather, raise your elbow a bit), you can redirect that force and the force that’s downward on your elbow into each other. Then you don’t need to be strong relative to how hard your partner is pushing on your elbow; you just need to be strong enough to redirect the forces into each other.
Knowing this, I can teach someone to pretty reliably do the unbendable arm in under ten minutes. No mystical philosophy needed.
(Of course, this doesn't change the overall point that the visualization trick is useful if you don't know the physics.)
Reminds me of @MalcolmOcean 's post on how awayness can't aim (except maybe in 1D worlds) since it can only move away from things, and aiming at a target requires going toward something.
Imagine trying to steer someone to stop in one exact spot. You can place a ❤ beacon they’ll move towards, or an X beacon they’ll move away from. (Reverse for pirates I guess.)
In a hallway, you can kinda trap them in the middle of two Xs, or just put the ❤ in the exact spot.
In an open field, you can maybe trap them in the middle of a bunch of XXXXs, but that’ll be hard because if you try to make a circle of X, and they’re starting outside it, they’ll probably just avoid it. If you get to move around, you can maybe kinda herd them to the right spot then close in, but it’s a lot of work.
Or, you can just put the ❤ in the exact spot.
For three dimensions, consider a helicopter or bird or some situation where there’s a height dimension as well. Now the X-based orientation is even harder because they can fly up to get away from the Xs, but with the ❤ you still just need one beacon for them to hone in on it.
Wouldn't people not knowing specific words or ideas be equally compatible with "you can't refer to the concept with a single word so you have to explain it, leading to longer sentences"?
At first, I thought this post would be about prison sentences.
I got curious and checked if DeepResearch would have anything to add. It agreed with your post and largely outlined the same categories (plus a few that you didn't cover because you were focused on an earlier time than the screen era): "Cognitive Load & Comprehension, Mass Literacy & Broad Audiences, Journalism & Telegraphic Brevity, Attention Span & Media Competition, Digital Communication & Screen Reading, Educational & Stylistic Norms".
The last one I thought was interesting and not obvious from your post:
Widespread literacy also had an effect on social norms. It wasn't just that sentences got shorter to accommodate the average reader, but also that it became more socially expected that writers accommodate the reader rather than the reader being expected to live up to the elite demands. This was partially connected to the rise of compulsory schooling. Once you're demanding that everyone learn to read, you kind of have to accommodate the limits of their abilities rather than just telling them "get good or gtfo".
DR: More people could read, but to reach this broader audience, authors were compelled to write in a plainer style than the ornate constructions of previous centuries. We can view this as a shift in the social contract of writing: instead of readers straining to meet the text, the text was adjusted to meet the readers. Shorter sentences were a key part of that adjustment. [...] By the early 20th century, the norm had shifted – long-winded sentences were increasingly seen as bad style or poor communication, out of step with a society that valued accessibility.
(This claim seems like it matches common sense, though DR didn't give me a cite for this specific bit so I'm unsure what it's based on.)
DR also claimed that there was a "Plain Language movement" in the 1960s and 1970s, that among other things pushed for simpler sentences. Its only cite was to a blog article on readability.com, though Wikipedia also talks about it. You mentioned e.g. the Flesh-Kincaid formula in a descriptive sense, but it's also prescriptive: once these kinds of formulas get popularized as respected measures of readability, it stands to reason that their existence would also drive sentence lengths down.
E.g. Wikipedia mentions that Pennsylvania was the first U.S. state to require that automobile insurance policies be written at no higher than a ninth-grade level (14–15 years of age) of reading difficulty, as measured by the F–K formula. This is now a common requirement in many other states and for other legal documents such as insurance policies.
There were a few other claims that seemed interesting at first but then turned to be hallucinated. Caveat deep researchor.
OpenAI'sChatGPT: 339 million monthly active users on the ChatGPT app, 246 million unique monthly visitors to ChatGPT.com.
Microsoft Copilot: 11 million monthly active users on the Copilot app, 15.6 million unique monthly visitors to copilot.microsoft.com.
Google Gemini: 18 million monthly active users on the Gemini app, 47.3 million unique monthly visitors.
Anthropic's Claude: Two million (!) monthly active users on the Claude app, 8.2 million unique monthly visitors to claude.ai.
Wow. I knew that Claude is less used than ChatGPT, but given how many people in my social circles are Claude fans, I didn't expect it to be that much smaller. Guess it's mostly just the Very Online Nerds who know about it.
It doesn’t just cost more to run OpenAI than it makes — it costs the company a billion dollars more than the entirety of its revenue to run the software it sells before any other costs. [...] OpenAI loses money on every single paying customer, just like with its free users. Increasing paid subscribers also, somehow, increases OpenAI's burn rate. This is not a real company.
Seems to contradict this:
The cost of [...] the compute from running models ($2 billion) [...] OpenAI makes most of its money from subscriptions (approximately $3 billion in 2024) and the rest on API access to its models (approximately $1 billion).
OpenAI is certainly still losing money overall and might lose even more money from compute costs in the future (if the reported expenses were reduced by them still having Microsoft's compute credits available). But I'm not sure why the article says that "every single paying customer" only increases the company's burn rate given that they spend less money running the models than they get in revenue. Even if you include the entirety of 700M they spend on salaries in the "running models" expenses, that would still leave them with about $1.3 billion in profit.
The article does note that ChatGPT Pro subscriptions specifically are losing the company money on net, but it sounds like the normal-tier subscriptions are profitable. Now the article claims that OpenAI spent $9 billion in total, but I could only find a mention of where $5.7 billion of that goes ($2B on running models, $3B on training models, $0.7B on salaries). If some of the missing $3.3 billion was also spent on running the normal product, that'd explain it, but I'm not sure where that money goes.
Interestingly, it sounds like faking the chain of thought emerges as a special case of planning ahead. With the rhyming, Claude decides on the word that the line should end with, and then figures out the sentence that gets it there. With the math example, Claude decides on the number that the calculation should end up at, and then figures out the steps that get there.
I find that for me, and I get the vibe that for many others as well, there's often a slight sense of moral superiority happening when conceptual rounding happens. Like "aha, I'm better than you for knowing more and realizing that your supposedly novel idea has already been done".
If I notice myself having that slight smug feeling, it's a tip-off that I'm probably rounding off because some part of me wants to feel superior, not because the rounding is necessarily correct.
This policy is more likely to apply [...] if your existence is not publicly known.
How is "existence is publicly known" defined? Suppose it's public knowledge that "OpenAI has an AI agent project codenamed Worldkiller, though nobody outside OpenAI knows anything else about it". I'd think that the public knowing about OpenAI having such a project wouldn't change the probability of Worldkiller having something relevant to say.
I gave this comment a "good facilitation" react but that feels like a slightly noncentral use of it (I associate "good facilitation" more with someone coming in when two other people are already having a conversation). It makes me think that every now and then I've seen comments that help clearly distill some central point in a post, in the way that this comment did, and it might be nice to have a separate react for those.
Isn't the same true for pretty much every conversation that people have about non-trivial topics? It's almost always true that a person cannot represent everything they know about a topic, so they have to simplify and have lots of degrees of freedom in doing that.
This story from Claude 3.6 was good enough that it stuck in my head ever since I read it (original source; prompt was apparently to "write a Barthelme-esque short story with the aesthetic sensibilities of "The School"").
For six months we watched the pigeons building their civilization on top of the skyscrapers. First came the architecture: nests made not just of twigs and paper, but of lost earbuds, expired credit cards, and the tiny silver bells from cat collars. Then came their laws.
"They have a supreme court," said Dr. Fernandez, who'd been studying them since the beginning. "Nine pigeons who sit on the ledge of the Chrysler Building and coo about justice." We didn't believe her at first, but then we didn't believe a lot of things that turned out to be true.
The pigeons developed a currency based on blue bottle caps. They established schools where young pigeons learned to dodge taxi cabs and identify the most generous hot dog vendors. Some of us tried to join their society, climbing to rooftops with offerings of breadcrumbs and philosophy textbooks, but the pigeons regarded us with the kind of pity usually reserved for very small children or very old cats.
"They're planning something," the conspiracy theorists said, but they always say that. Still, we noticed the pigeons holding what looked like town halls, thousands of them gathered on the roof of the public library, bobbing their heads in what might have been voting or might have been prayer.
Our own civilization continued below theirs. We went to work, fell in love, lost keys, found keys, forgot anniversaries, remembered too late, all while the pigeons above us built something that looked suspiciously like a scaled-down replica of the United Nations building out of discarded takeout containers and stolen Christmas lights. Sometimes they dropped things on us: rejection letters for poetry we'd never submitted, tax returns from years that hadn't happened yet, photographs of ourselves sleeping that we couldn't explain. Dr. Fernandez said this was their way of communicating. We said Dr. Fernandez had been spending too much time on rooftops.
The pigeons started their own newspapers, printed on leaves that fell upward instead of down. Anyone who caught one and could read their language (which looked like coffee stains but tasted like morse code) reported stories about pigeon divorce rates, weather forecasts for altitudes humans couldn't breathe at, and classified ads seeking slightly used dreams.
Eventually, they developed space travel. We watched them launch their first mission from the top of the Empire State Building: three brave pioneers in a vessel made from an old umbrella and the collective wishes of every child who'd ever failed a math test. They aimed for the moon but landed in Staten Island, which they declared close enough.
"They're just pigeons," the mayor said at a press conference, while behind him, the birds were clearly signing a trade agreement with a delegation of squirrels from Central Park.
Last Tuesday, they achieved nuclear fusion using nothing but raindrops and the static electricity from rubbing their wings against the collective anxiety of rush hour. The Department of Energy issued a statement saying this was impossible. The pigeons issued a statement saying impossibility was a human construct, like pants, or Monday mornings.
We're still here, watching them build their world on top of ours. Sometimes at sunset, if you look up at just the right angle, you can see their city shimmer like a memory of something that hasn't happened yet. Dr. Fernandez says they're planning to run for city council next year. Given everything else, we're inclined to believe her this time.
The pigeons say there's a message in all of this. We're pretty sure they're right, but like most messages worth receiving, we're still working out what it means.
Thank you! People keep mentioning Panksepp's work but I had never gotten around reading it; this was a very nice summary. The described mechanisms felt very plausible and interesting.
primates (including humans) are innately afraid of snakes, spiders,
Something I think about a lot when I see hypotheses based on statistical trends of somewhat obscure variables: I've heard it claimed that at one point in Finland, it was really hard to get a disability pension because of depression or other mental health problems, even though it was obvious to many doctors that their patients were too depressed to work. So then some doctors would diagnose those people with back pain instead, since it sounded more like a "real" condition while also being impossible to disprove before ultrasound scans got more common.
I don't know how big that effect was in practice. But I could imagine a world where it was significant and where someone noticed a trend of back pain diagnoses getting less common while depression diagnoses got more common, and postulating some completely different explanation for the relationship.
More generally, quite a few statistics are probably reporting something different from what they seem to be about. And unless you have deep knowledge about the domain in question, it'll be impossible to know when that's the case.
Marc is saying that first you write out your points and conclusion, then you fill in the details. He wants to figure it all out while his mind is buzzing, then justify it later.
Whereas I learn what I think as I write out my ideas in detail. I would say that if you are only jotting down bullet points, you do not yet know what you think.
Might Marc's mind not work differently from yours?
He could also have done a large part of his thinking in some different way already, e.g. in conversations with people.
There's also the option that even if this technology is initially funded by the wealthy, learning curves will then drive down its cost as they do for every technology, until it becomes affordable for governments to subsidize its availability for everyone.
In the modern era, the fertility-IQ correlation seems unclear; in some contexts, higher fertility seems to be linked with lower IQ, in other contexts with higher IQ. I have no idea of what it was like in the hunter-gatherer era, but it doesn't feel like an obviously impossible notion that very high IQs might have had a negative effect on fertility in that time as well.
E.g. because the geniuses tended to get bored with repeatedly doing routine tasks and there wasn't enough specialization to offload that to others, thus leading to the geniuses having lower status. Plus having an IQ that's sufficiently higher than that of others can make it hard to relate to them and get along socially, and back then there wouldn't have been any high-IQ societies like a university or lesswrong.com to find like-minded peers at.
I think something doesn't need to be fundamentally new for the press to turn it into a scary story, e.g. news reports about crime or environmental devastation being on the rise have scared a lot of people quite a bit. You can't photograph a quantity but you can photograph individuals affected by a thing and make it feel common by repeatedly running stories of different individuals affected.
I've spent enough time staring at LLM chain-of-thoughts now that when I started thinking about a thing for work, I found my thoughts taking the shape of an LLM thinking about how to approach its problem. And that actually felt like a useful systematic way of approaching the problem, so I started writing out that chain of thought like I was an LLM, and that felt valuable in helping me stay focused.
Of course, I had to amuse myself by starting the chain-of-thought with "The user has asked me to..."
However, I don't think this means that their values over hypothetical states of the world is less valuable to study.
Yeah, I don't mean that this wouldn't be interesting or valuable to study - sorry for sounding like I did. My meaning was something more in line with Olli's comment, that this is interesting but that the generalization from the results to "GPT-4o is willing to trade off" etc. sounds too strong to me.
I don't know what your views on self-driving cars are, but if you are like me you look at what Waymo is doing and you think "Yep, it's working decently well now, and they are scaling up fast, seems plausible that in a few years it'll be working even better and scaled to every major city. The dream of robotaxis will be a reality, at least in the cities of America."
The example of self-driving cars is actually the biggest one that anchors me to timelines of decades or more. A lot of people's impression after the 2007 DARPA Grand Challenge seemed to be something like "oh, we seem to know how to solve the problem in principle, now we just need a bit more engineering work to make it reliable and agentic in the real world". Then actually getting things to be as reliable as required for real agents took a lot longer. So past experience would imply that going from "we know in principle how to make something act intelligently and agentically" to "this is actually a reliable real-world agent" can easily take over a decade.
For general AI, I would expect the "we know how to solve things in principle" stage to at least be something like "can solve easy puzzles that a normal human can that the AI hasn't been explicitly trained on". Whereas with AI, we're not even there yet. E.g. I tried giving GPT-4.5, DeepSeek R1, o3-mini, and Claude 3.7 with extended thinking a simple sliding square problem, and they all committed an illegal move at one stage or another.
And that's to say nothing about all the other capabilities that a truly general agent - say one capable of running a startup - would need, like better long-term memory, ability to formulate its own goals and prioritize between them in domains with no objective rules you could follow to guarantee success, etc.. Not only are we lacking convincing in-principle demonstrations of general intelligence within puzzle-like domains, we're also lacking in-principle demonstrations of these other key abilities.
Maybe? We might still have consistent results within this narrow experimental setup, but it's not clear to me that it would generalize outside that setup.
Forced choice settings are commonly used in utility elicitation, e.g. in behavioral economics and related fields.
True. It's also a standard criticism of those studies that answers to those questions measure what a person would say in response to being asked those questions (or what they'd do within the context of whatever behavioral experiment has been set up), but not necessarily what they'd do in real life when there are many more contextual factors. Likewise, these questions might answer what an LLM with the default persona and system prompt might answer when prompted with only these questions, but don't necessarily tell us what it'd do when prompted to adopt a different persona, when its context window had been filled with a lot of other information, etc..
The paper does control a bit for framing effects by varying the order of the questions, and notes that different LLMs converge to the same kinds of answers in that kind of neutral default setup, but that doesn't control for things like "how would 10 000 tokens worth of discussion about this topic with an intellectually sophisticated user affect the answers", or "how would an LLM value things once a user had given it a system prompt making it act agentically in the pursuit of the user's goals and it had had some discussions with the user to clarify the interpretation of some of the goals".
Some models like Claude 3.6 are a bit infamous for very quickly flipping all of their views into agreement with what it thinks the user's views are, for instance. Results like in the paper could reflect something like "given no other data, the models predict that the median person in their training data would have/prefer views like this" (where 'training data' is some combination of the base-model predictions and whatever RLHF etc. has been applied on top of that; it's a bit tricky to articulate who exactly the "median person" being predicted is, given that they're reacting to some combination of the person they're talking with, the people they've seen on the Internet, and the behavioral training they've gotten).
This reasoning would justify violating any social convention whatsoever. "Refusing to say 'please' and 'thank you' signals confidence and self-esteem".
Wrong. I distinguished between conventions that people have a reason to respond negatively to if you violate them (e.g. wearing a suit to the gym or swimming pool which is stupid since it will ruin both your exercise and your suit), and behaviors that just happen to be unusual but not intrinsically negative. Refusing to say 'please' and 'thank you' would fall squarely in the category that people would have a reason to feel negative about.
My understanding is that fedoras became a sign of cluelessness because they got associated with groups like pick-up artists, which is also an explicit reason to have a negative reaction to them.
Wearing a suit in an inappropriate context is like wearing a fedora. It says "I am socially clueless enough to do random inappropriate things".
"In an inappropriate context" is ambiguous. It can mean "in a context where people don't normally wear suits" or it can mean "in a place where people consider it actively wrong to wear a suit".
There are of course places of the latter type, like it would be very weird to wear a suit in a gym or swimming pool. But I don't think lsusr would advocate that.
If you just mean "in a context where people don't normally wear suits", then wearing a suit in such a context could signal social cluelessness, but it could also signal confidence and self-esteem.
To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-4o-mini and gpt-3.5-turbo, but the right number varies greatly based on the exact use case.
We recommend starting with 50 well-crafted demonstrations and seeing if the model shows signs of improvement after fine-tuning. In some cases that may be sufficient, but even if the model is not yet production quality, clear improvements are a good sign that providing more data will continue to improve the model. No improvement suggests that you may need to rethink how to set up the task for the model or restructure the data before scaling beyond a limited example set.
I dreamt that you could donate LessWrong karma to other LW users. LW was also an airport, and a new user had requested donations because to build a new gate at the airport, your post needed to have at least 60 karma and he had a plan to construct a series of them. Some posts had exactly 60 karma, with titles like "Gate 36 done, let's move on to the next one - upvote the Gate 37 post!".
(If you're wondering what the karma donation mechanism was needed for if users could just upvote the posts normally - I don't know.)
Apparently the process of constructing gates was separate from connecting them to the security control, and things had stopped at gate 36/37 because it needed to be connected up with security first. I got the impression that this was waiting for the security people to get it done.
I’m working on developing innovative cancer immunotherapy approaches to address key challenges in the field. Immunotherapy is an exceptionally powerful strategy for curing cancer because it harnesses the body’s immune system—our internal army—and empowers it to recognize and eliminate cancer cells. In this effort, we are focusing on engineering T cells, the immune system’s soldiers and generals, through synthetic biology.
However, significant challenges remain, especially in treating solid tumors like breast cancer. Within the tumor microenvironment, T cells often become exhausted due to the overwhelming number of cancer cells and the suppressive environment created by the tumor. This exhaustion severely limits the effectiveness of these therapies.
To tackle this issue, we employ a cutting-edge model system using 3D bioprinted breast cancer tissue integrated with engineered human T cells. These T cells are reprogrammed through advanced synthetic biology techniques to test and develop solutions for overcoming exhaustion.
Prompt to O1-Pro:
Building on work I’ve previously done and tested with o1-Preview and GPT-4o, I posed the following prompt:
“I’d like you to focus on 3D bioprinted solid tumors as a model to address the T cell exhaustion problem. Specifically, the model should incorporate stroma, as seen in breast cancer, to replicate the tumor microenvironment and explore potential solutions. These solutions could involve technologies like T cell reprogramming, synthetic biology circuits, cytokines, transcription factors related to exhaustion, or metabolic programming. Draw inspiration from other fields, such as Battle Royale games or the immune system’s ability to clear infected cells without triggering autoimmunity. Identify potential pitfalls in developing these therapies and propose alternative approaches. Think outside the box and outline iterative goals that could evolve into full-scale projects. Focus exclusively on in vitro human systems and models.”
Why Battle Royale Games?
You might wonder why I referenced Battle Royale games. That’s precisely the point—I wanted to push the model to think beyond conventional approaches and draw from completely different systems for inspiration. While o1-Preview and GPT-4o were able to generate some interesting ideas based on this concept, but they were mostly what I could also conceive though better most PhD students. In contrast, o1-Pro came up with far more creative and innovative solutions, that left me in awe!
Idea #9: A Remarkable Paradigm
Here, I’m sharing one specific idea, which I’ll call Idea #9 based on its iteration sequence. This idea was exceptional because it proposed an extraordinary paradigm inspired by Battle Royale games but more importantly within the context of deep temporal understanding of biological processes. This was the first time any model explicitly considered the time-dependent nature of biological events—an insight that reflects a remarkably advanced and nuanced understanding!
“Adapt or Fail” Under Escalating Challenges:
Another remarkable aspect of idea #9 was that conceptually it drew from the idea of “adapt or fail” in escalating challenges, directly inspired by Battle Royale mechanics. This was the first time any model could think of it from this perspective. It also emphasized the importance of temporal intervals in reversing or eliminating exhausted T cells. Indeed, this approach mirrors the necessity for T cells to adapt dynamically under pressure and survive progressively tougher challenges, something we would love to model in in vitro systems! One particularly further striking insight was the role of stimulation intervals in preventing exhaustion. Idea #9 suggested that overly short intervals between stimuli might be a key factor driving T cell exhaustion in current therapies. This observation really amazed me with its precision and relevance—because it pinpointed a subtle but critical aspect of T cell activations and development of exhaustion mechanisms.
There's more behind the link. I have no relevant expertise that would allow me to evaluate how novel this actually was. But immunology is the author's specialty with his work having close to 30 000 citations on Google Scholar, so I'd assume him to know what he's talking about.
That by itself wouldn't imply the language model not knowing what the criteria for refusal are, though. It would be simpler to just let the model decide whether it agrees to call the function or not, than to have the function itself implement another check.
The history of climate change activism seems like an obvious source to look for ideas. It's a recent mass movement that has convinced a large part of the general public that something is at least a catastrophic or even an existential risk and put major pressure on governments worldwide. (Even though it probably won't actually end civilization, many people believe that it will.)
Its failures also seem analogous to what the AI risk movement seems to be facing:
strong financial incentives to hope that there's no problem
action against it having the potential to harm the quality of life of ordinary people (though in the case of novel AI, it's less about losing things people already have and more about losing the potential for new improvements)
(sometimes accurate, sometimes not) accusations of fear-mongering and that the concerns are only an excuse for other kinds of social engineering people want to achieve anyway
overall a degree of collective action that falls significantly short of what the relevant experts believe necessary for stopping the current trajectory of things getting worse