Posts
Comments
This sounds like both an alignment and a capabilities problem.
AI 2027-style takeoffs do not look plausible when you can't extract reliable work from models.
Have you read: https://www.lesswrong.com/s/SqFbMbtxGybdS2gRs
Your comment about 1e-6 p-doom is not right because we face many other X-risks that developing AGI would reduce.
Otherwise yeah I’m on board with mood of your post.
Personally I really like doing math/philosophy and I have convinced myself that it is necessary to avert doom. At least I’m not accelerating progress much!
Unfortunate consequence of sycophantic ~intelligent chatbots: everyone can get their theories parroted back to them and validated. Particularly risky for AGI, where the chatbot can even pretend to be running your cognitive architecture. Want to build a neuro-quantum-symbolic-emergent-consciousness-strange-loop AGI? Why bother, when you can just put that all in a prompt!
And as long as they keep stumbling around like this, I will remain skeptical of AGI arriving in a few years.
I wouldn’t take those markets too seriously. The resolution criteria arent clear and some years have fewer than 100 traders. Also I just moved some of them down a couple of percentage points.
This is the kind of comment that becomes harder to take at face value (from you) after reading your dissent on honesty.
I seem to recall EY once claiming that insofar as any learning method works, it is for Bayesian reasons. It just occurred to me that even after studying various representation and complete class theorems I am not sure how this claim can be justified - certainly one can construct working predictors for many problems that are far from explicitly Bayesian. What might he have had in mind?
I agree that if I could produce a wonderful truthseeking society by telling a few lies it would be worth it, I just think that extreme sincere honesty is a better path for predictable first and second order reasons.
I don’t think you got it.
What did you think about my objection to the Flynn example, or the value of the rationalist community as something other than an autism support group? I feel like you sort of ignored my stronger points and then singled out the widget job interview response because it seems to miss the point, but without engaging with my explanation of how it doesn’t miss the point. The way that you constructed the hypothetical there was plenty of time to come up with an honest way to talk about how much he enjoyed widgets.
The one of the things I value is people knowing and understanding the truth, which I find to be a beautiful thing. It’s not because someone told me to be honest at some point, it’s because I’ve done a lot of mathematics and read a lot of books and observed that the truth is beautiful.
I also wouldn’t shoot someone so I could tell someone else the truth. I don’t know where you got these numbers.
I suppose I’m not completely longtermist about my pursuit of truth, but I’m not completely longtermist about my other values either - sometimes the short term is easier to predict and get feedback from etc.
This was fun to read, but I am not sure that it contains suggestions for alignment that we can implement, beyond the technical ideas we already have.
Does seem like good relationship advice though.
This was heavily downvoted and the tone is in fact off but I think there is a little sliver of truth to it.
I think about this topic a lot, and I appreciate your dissent, particularly since it helped me organize my thoughts a little. That said, I think you're almost completely wrong. The best way to get at the problem is probably to start with your examples. Not exactly in order, sorry.
The lesson here is likewise clear: If your actual personality isn't good enough, pretend to be Flynn Rider to everyone at all times, with the sole carve-out being people who love you, like your mother or a princess. This works because people who love you will find your openness endearing, whereas everyone else will think you pathetic and use it against you.
Here's a true story. I once met a lovely and intelligent woman who didn't like that I'm a bit blunt and ruthlessly truthseeking. I didn't stop being that way, and mainly for that reason we didn't become romantically involved. A few months later I met a lovely, intelligent, reasonable, sometimes blunt, and open-minded woman who did like that I'm a bit blunt and ruthlessly truthseeking. We've been dating for 2.5 years now and I'm on balance very happy with how everything worked out.
A Paragon of Morality is out travelling, when he is beset by bandits. They demand he hand over his gold or they will kill him and take it from his corpse. This is not a decision-theoretic threat because the bandits value getting his gold more than they disprefer commiting murder, but would otherwise avoid the murder if possible. If he hands over all his gold he will lose all his gold. If he hands over all the gold in his pockets, neglects the extra he has hidden in his sock, and says "I have given you all my gold" in a sufficiently convincing tone of voice, then he will lose less than all his gold.
These isn't Omega we're dealing with here, they're totally trickable by a moderately convincing performance. If he keeps some of the gold he can donate it to Givewell approves charities and save however many QALYs or whatever.
Does he have a moral obligation to lie?
He certainly doesn't have a moral obligation to tell the truth. But a lot of moral obligations change when someone points a gun at you. For instance, it becomes morally permissible (though not necessarily feasible) to shoot at them, or to give up what money you must and later steal it back at the first available opportunity. To me, the truth is something precious, and lying is like stealing the truth; it's permissible in some extreme and usually adversarial situations. With that said, I'm a bit of a rationalist dedicate/monk and I'd prefer to fight than lie - however I don't think everyone is rationally or otherwise compelled to follow suit, for reasons that will be further explained.
A Normally Honest Man is applying for a job as a Widget Designer. He has many years of industry experience in Widget Engineering. He has memorised the Widget Manufacturing Process. He's actually kind of obsessed with Widgets. Typically whenever a conversation becomes about Widgets he gushes openly and makes a bad impression with his in-laws. Since that incident he has developed the self control to pretend otherwise, and the rest of his personality is okay.
The interviewer works for a Widget Manufacturing company but seems to only care about Widgets a normal amount. He asks "How interested are you in Widgets?" He has learnt from previous job interviews that, if he answers honestly, the interviewer will think he is any of lying, insane, or too weird to deal with, and not hire him, even though this is not in the best financial interests of the company, were they fully informed.
Should he pretend to like widgets the amount most likely to get him hired, or does he have a moral obligation to keep answering honestly until he runs out of rent money and becomes homeless?
I don't know, he could say "Honestly, I enjoy designing widgets so much that others sometimes find it strange!" That would probably work fine. I think you can actually get a way with a bit more if you say honestly first and then are actually sincere. This would also signal social awareness.
I realize that I am in some sense dodging your hypothetical but I think your hypothetical is the problem. You haven't thought hard enough about how this guy can succeed without lying.
A Self-Improvement and Epistemics Nerd has an online community for Self-Improvement and Epistemics Nerds. Half the people reading it are autists with bad social skills, who weren't at exactly the right age to be saved by Disney's Tangled. They struggle with navigating ordinary social situations and obtaining true beliefs because they're bad at lying, and insufficiently aggressive at anticipating it in others.
Would they be doing anyone a favour in encourage a social norm of truthfulness and the expectation of truthfulness in others, when all those people will inevitably have to leave the computer one day and end up like the subjects of the previous two examples? Would they be making the world a better place?
Yes and yes.
Contrary to common belief, lesswrong is not an autism support group.
And you know what? I think it made the world much better. Now we have places online and in the real world (lighthaven, meetups, Berkeley) to gather and form a community around truthseeking and rationality. I like it. I'm glad it exists. I even think some important and powerful ideas have come out of it, and I think we've learned a lot together.
Saying words is just an action, like any other action. Whether the words are literally true or not is just a fact about the action, like any other fact about an action. It's not the morally important fact. You judge actions by their consequences, whether you expect it to lead to more good or bad. Then you take the action with the best consequences overall.
Saying words is an action, but it's not like any other action, because it can guide others towards or away from the truth. Similarly, torture is an action, but it's not like any other action, because it is when one person causes another immense pain intentionally.
Sure, we judge actions by their consequences, but we do not judge all actions in the same way. Some of them are morally repugnant, and we try very, very hard to never take them unless our hands our forced, and then only take them with immense regret and sorrow. There are various distinguishing factors. For instance, the consequences of torture seem likely to be almost always bad, so I never seriously consider it. Also, I don't want to be the sort of person who tortures people (both for instrumental reasons and to some extent for intrinsic reasons). It's actually pretty hard to fully disentangle my disprefernece for torture from its consequences, because torture is inherently about causing suffering and don't either want suffering to exist or to cause it (though the former is far more important to me).
My feelings about lying are the same. I love the truth, I love the truthseeking process, I love seeing curiosity in the eyes of children and adults and kittens. I hate lies, confusion, and deceiving others. This is partially because the truth is really useful for agents (and I like agents to be able to exercise their potential, typically), it's partially because telling the truth seems to be best for me in most cases, and it's partially because I just value truth.
Rationality can be about Winning, or it can be about The Truth, but it can't be about both. Sooner or later, your The Truth will demand you shoot yourself in the foot, while Winning will offer you a pretty girl with a country-sized dowry. The only price will be presenting various facts about yourself in the most seductive order instead of the most informative one.
It can totally be about both if truth is part of winning. Yes, there are sometimes tradeoffs, and truth is not the singular source of value. But I think most of us value it very strongly, so presenting these two axes as orthogonal is highly misleading. And I want to share the truth with other people in case they decide to value it too - if not, they can always choose not to face it.
Also, there's a missing mood in your example. When you value the truth, being honest tends to get you a lot of other things that you value; you tend to end up surrounded by the right people for you, being the kind of person you can respect, in the kind of place where you belong, even if you have to create it.
Now, you're probably going to say that I can't convince you by pure reason to intrinsically value the truth. That's right. However, I also can't convince you by pure reason to intrinsically value literally anything, and if you had written an essay about how we should consider killing or torturing people because it's just an action like any other, I would have objected on similar grounds. You're totally missing the fact that it's wrong, and also (separately!) the consequences of following your advice would probably be bad for you, and certainly for most of us, over the long run.
I've finally gotten around to reading the paper from Seidenfeld et al. that you cited. I am also surprised this isn't more frequently discussed as an approach to logical/computational uncertainty.
I agree that the consistency/coherence requirement must be relaxed in a descriptive theory of choice behavior; humans can not be expected to take perfectly coherent actions. It may be difficult or impossible to define an "unnatural" sense in which we are coherent, and it's not clear that this is even desirable. I see that this is connected to the strategy of reducing incoherence by Bayesian updating, which is neat. It also reminds me of something Gilboa wrote, section 13.3.2, page 108 of these lecture notes. Gilboa's approach is to define stricter coherent probabilities/preferences which are not "complete" and as distinguished from those elicited through choice behavior - the consequence is usually some form of imprecise probability theory, which as we've discussed usually seems arbitrary. I am initially less skeptical of Seidenfeld et al.'s approach.
However, I think there may be something missing here - the model of Bayesian learning to reduce incoherence requires a coherent likelihood function, which allows certain types of learning about mathematical constants (through Monte-Carlo methods) but perhaps not the most important and powerful types. We should be able to leverage other known (or suspected) mathematical facts to constrain our expectations even when those beliefs are slightly or even seriously incoherent. I think the missing piece is an algorithm to do this. That is, by allowing incoherence but hoping to correct it, we're leaving out the "core engine" of cognition, with Garrabrant induction as the only fleshed-out possibility I am aware of. Perhaps the implication is that it is just Bayes all the way down - YOU (as Seidenfeld addresses the bounded reasoner) decide what to think about and how to update in a sort of meta-cognitive Bayesian fashion, by reasoning about what types of beliefs and updating make sense from theory or experience. It's not clear to me what sort of prior beliefs and other algorithmic details allow this process to converge, which of course ties in very closely with this post! So I think the central question is still left open by Seidenfeld et al.
I read the paper, but it seems the slides go into much more depth on increasing coherence via Bayes rule. It appears that coherence can be increased by redistributing "unused" probability mass in an appropriate way, and a Bayes-like update is one possible method?
Unfortunately, it seems that my action space doesn’t include options that matter in this current battle. Personally, my reaction to this kind of insanity is to keep climbing my local status/influence/wealth/knowledge gradient, in the hopes that my actions are relevant in the future. But perhaps it’s a reason to prioritize gaining power - this reminds me of https://www.lesswrong.com/posts/ottALpgA9uv4wgkkK/what-are-you-getting-paid-in
Highlight, right-click, the little diagonal line thing that usually symbolizes links.
None of this in any way proves or even supports belief in the existence of God. It seems just as believable as an argument for agnosticism.
This comment doesn’t seem to make much sense - that is, it doesn’t seem nonsensical, but seems to be missing enough context that I don’t know how your statements are meant to fit together.
I’m pretty sure I learned to do this based on a simple mechanistic description (without detailed physics) and practice.
Yeah, I was thinking greater effort is actually necessary in this case. For context, my lower effort posts are usually more popular. Also the ones that focus on LLMs which is really not my area of expertise.
Particularly after my last post, I think my lesswrong writing has had bit too high of a confidence / effort ratio. Possibly I just know the norms of this site well enough lately that I don't feel as much pressure to write carefully. I think I'll limit my posting rate a bit while I figure this out.
I am saying so! The market is definitely not pricing in AGI; doesn't matter if it comes in 2028, or 2035, or 2040. Though interest rates are a pretty bad way to arb this; I would just buy call options on the Nasdaq.
Hmm well at least you're consistent.
They're not that useful yet.
Certainly I can see why you expect them to become more useful, I still feel like there's some circularity here. Do you expect the current paradigm to continue advancing because LLM agents are somewhat useful now (as you said, for things like coding)? Unless that effect is currently negligible (and will undergo a sharp transition at some point) it seems we should expect it to already be reflected in the exponential growth rate claimed by METR.
You're right, but creating unexpected new knowledge is not a PhD requirement. I expect it's pretty rare that a PhD students achieves that level of research.
I do weakly expect it to be necessary to reach AGI though. Also, I personally wouldn't want to do a PhD that didn't achieve this!
It wasn't a great explanation, sorry, and there are definitely some leaps, digressions, and hand-wavy bits. But basically: Even if current AI research were all blind mutation and selection, we already know that that can yield general intelligence from animal-level-intelligence because evolution did it. And we already have various examples of how human research can apply much greater random and non-random mutation, larger individual changes, higher selection pressure in a preferred direction, and more horizontal transfer of traits than evolution can, enabling (very roughly estimated) ~3-5 OOMs greater progress per generation with fewer individuals and shorter generation times.
Okay, then I understand the intuition but I think it needs a more rigorous analysis to even make an educated guess either way.
Saw your edit above, thanks.
No, thank you!
Thanks for engaging in detail with my post. It seems there were a few failures of communication that are worth clarifying.
It's (the outside view) not really a well-defined thing, which is why the standard on this site is to taboo those words and just explain what your lines of evidence are, or the motivation for any special priors if you have them.
I thought it was clear that I'm not confident in any outside view prediction of AGI timelines, from various statements/phrasings here (including the sentence you're quoting, which questions the well-definedness of "the outside view") and the fact that the central focus of the post is disputing an outside view argument. Apparently I did not communicate this clearly, because many commenters have objected to my vague references to possible outside views as if I were treating them as solid evidence, when in fact they aren't really a load bearing part of my argument here. Possibly the problem is that I don't think anyone has a good inside view either! But in fact I am just "radically" uncertain about AGI timelines - my uncertainty is ~in the exponent.
Still, I find your response a little ironic since this site is practically the only place I've seen the term "outside view" used. It does seem to be less common over the last year or two, since this post which you're probably referring to.
So, your claim is that interest rates would be very high if AGI were imminent, and they're not so it's not. The last time someone said this, if the people arguing in the comment section had simply made a bet on interest rates changing, they would have made a lot of money! Ditto for buying up AI-related stocks or call options on those stocks.
Interesting, but non-sequitur. That is, either you believe that interest rates will predictably increase and there's free money on the table, and you should just say so, or not, and this anecdote doesn't seem to be relevant (similarly, I made money buying NVDA around that time, but I don't think that proves anything).
You could say that the "inventing important new ideas" part is going to be such a heavy bottleneck, that this speedup won't amount to much. But I think that's mostly wrong, and that if you asked ML researchers at OpenAI, a drop in remote worker that could "only" be directed to do things that otherwise took 12 hours would speed up their work by a lot.
Perhaps, but shouldn't LLMs already be speeding up AI progress? And if so, shouldn't that already be reflected in METR's plot? Are you predicting superexponential growth here?
It's actually not circular at all. "Current AI research" has taken us from machines that can't talk to machines that can talk, write computer programs, give advice, etc. in about five years. That's the empirical evidence that you can make research progress doing "random" stuff. In the absence of further evidence, people are just expecting the thing that has happened over the last five years to continue. You can reject that claim, but at this point I think the burden of proof is on the people that do.
It seems to me that progress has been slowing for the last couple of years. If this trend continues, progress will stall.
I imagine you could reliably complete a PhD in many fields with a week-long time horizon, as long as you get good enough weekly feedback from a competent advisor. 1: Talk to advisor about what it takes to get a PhD. 2: Divide into a list of <1 week-long tasks. 3) Complete task 1, get feedback, revise list. 4) Either repeat the current task or move on to the new next task, depending on feedback. 5) Loop until complete. 5a) Every ten or so loops, check overall progress to date against the original requirements. Evaluate whether overall pace of progress is acceptable. If not, come up with possible new plans and get advisor feedback
I think it's nearly impossible to create unexpected new knowledge this way.
As far as not believing the current paradigm could reach AGI, which paradigm do you mean? I don't think "random variation and rapid iteration" is a fair assessment of the current research process. But even if it were, what should I do with that information? Well, luckily we have a convenient example of what it takes for blind mutations with selection pressure to raise intelligence to human levels: us! I am pretty confident saying that current LLMs would outperform, say, Australopithecus, on any intellectual ability, but not Home sapiens. So that happens in a few million years, let's say 200k generations of 10-100k individuals each, in which intelligence was one of many, many factors weakly driving selection pressure with at most a small number of variations per generation. I can't really quantify how much human intelligence and directed effort speed up progress compared to blind chance, but consider that 1) a current biology grad student can do things with genetics in an afternoon that evolution needs thousands of generations and millions of individuals or more to do, and 2) the modern economic growth rate, essentially a sum of the impacts of human insight on human activity, is around 15000x faster than it was in the paleolithic. Naively extrapolated, this outside view would tell me that science and engineering can take us from Australopithecus-level to human-level in about 13 generations (unclear which generation we're on now). The number of individuals needed per generation is dependent on how much we vary each individual, but plausibly in the single or double digits.
I can't parse this.
My disagreement with your conclusion from your third objection is that scaling inference time compute increases performance within a generation, but that's not how the iteration goes between generations. We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models. So if you build the first superhuman coder and find it's expensive to run, what's the most obvious next step in the chain? Follow the same process as we've been following for reasoning models and if straight lines on graphs hold, then six months later we'll plausibly have one that's a tenth the cost to run. Repeat again for the next six months after that.
You're probably right about distilling CoT.
Upskilling bright young people “to do alignment” is tricky to do in a systematic way, since bright young people want / need to do whatever they’re curious about.
Similarly, if the METR trend continues I will become very worried that AGI is near.
Yes. Though, I find it a bit hard to visualize a 4 hour software engineering task that can’t be done in 1 hour, so I’m more clear on there not being a 16x or so improvement in 2 years
I don’t think I have anything better than a guess about any of those questions.
I think this is at least superficially a reasonable interpretation, and if the new linear relationship continues then I’d be convinced it’s right, but I wish you had engaged more with the arguments I made in the post or could be a bit more explicit about which you don’t follow?
Basically, I just have very low confidence in putting a line through these points because I don’t see a principled reason to expect a linear relationship to hold, and I see some reasons to expect that it should not.
Sure, I suppose that now I’ve started recklessly speculating about the future I might as well follow through.
I expect the departure to be pretty clear though, because we won’t see superhuman ai engineers before 2030. Even that prediction needs to be operationalized a bit of course.
Thanks for the links, I’ll look into it.
I agree that the S&P is pretty much reading tea leaves, the author of the interest rates post @basil.halperin has separately argued it is not reliable.
Does the METR report then update you towards later timelines?
(I am not saying it necessarily should, since obviously I have some doubts about the strength of its evidence)
I agree, but my experience of doing a PhD actually feels more like an integrated project of discovery and creation.
Unfortunately, those are two importantly different questions. While it is certainly possible that AGI will arrive before 2030, I am not sure that it would be predictable in advance without a "textbook of the (far) future" on deep learning. Of course, it is much more likely to be predictable a couple of days in advance than a year in advance, but even the former may conceivably be quite challenging depending on situational awareness of near-human-level models in training.
What would you expect the world to look like if AGI < 2030?
In many respects, similar to how it does look (e.g. AI passes the Turing test pretty easily / convincingly, is competent at some coding tasks, etc. etc.).
It's more "interesting" to discuss the departures:
- Claude would probably be better at Pokemon
- Maybe self-driving vehicles are more common / robust
- I think the METR plot would look different. I don't think task length should be the bottleneck. I think that indicates that something isn't scalable. But I am not sure.
- There would perhaps be a few novel insights or discoveries from LLMs in a few domains where they have an advantage over us. I am not convinced this is true and in fact absence of evidence is evidence of absence: https://www.lesswrong.com/posts/GADJFwHzNZKg2Ndti/have-llms-generated-novel-insights
- I would weakly expect to find AI a little more useful, but more strongly, I would expect to be finding it increasingly useful over the last couple of years, and I don't.
- "Hallucinations" (I think @Abram Demski has a better term I can't remember) would be easier to get rid of.
Each point is capable of shifting my opinion to a greater or lesser degree. Mainly seeing novel insights would change my mind. If METR's results hold up that will also change my mind.
Yeah.
I remember enjoying that post (perhaps I even linked it somewhere?) and I think it’s probably the case that the inefficiency in task length scaling has to do with LLMs having only a subset of cognitive abilities available. I’m not really committed to a view on that here though.
The links don’t seem to prove that the points are “inaccurate.”
I think it was a mistake to focus particularly on AI 2027 when the METR report does not seem to play a prominent role there.
This all seems right - this is probably my most (and only?) “combative” post and I wish I’d toned it down a bit.
I don’t think either is reliable. It’s just an example of a reasonable baseline outside view.
Interesting, but still apparently on a significantly slower doubling time than the reasoning models?
I have the same experience, whenever I try to explain AI X-risk to a “layman” they want a concrete story about how AGI could take over.
My main disagreement is the speed, but not because I expect everything to happen more slowly by some constant factor. Instead I think there’s a missing mood here regarding the obstacles to building AGI, and the time to overcome those obstacles is not clear (which is why my timeline uncertainty is still ~in the exponent).
In particular, I think the first serious departure from my model of LLMs (linked above) is the neuralese section. It seems to me that for this to really work (in a way comparable to how human brains have recurrence) would require another breakthrough at least on the level of transformers if not harder. So, if the paper from Hao et al. is actually followed up on by future research that successfully scales, that would be a crux for me. Your explanation that the frontier labs haven’t adopted this for GPU utilization reasons seems highly implausible to me. These are creative people who want to ready AGI, and it seems obvious that the kind of tasks that arent conquered yet look a lot like the ones that need recurrence. Do you really think none of them have significantly invested in this (starting years ago when it become obvious this was a bottleneck)? The fact that we still need CoT at all tells me neuralese is not happening because we don’t know how to do it. Please refer to my post for more details on this intuition and its implications. In particular, I am not convinced this is the final bottleneck.
I also depart from certain other details latter, for instance I think we’ll have better theory by the time we need to align human level AI and “muddling through” by blind experimentation probably won’t work or be the actual path taken by surviving worlds.
My other points of disagreement seem less cruxy and are mostly downstream.
I agree with all of this.
I haven't heard this said explicitly before but it helps me understand your priorities a lot better.
I usually don't think of "building a safer LLM agent" as a viable route to aligned AI
I agree that building a safer LLM agent is an incredibly fraught path that probably doesn't work. My comment is in the context of Abram's first approach, developing safer AI tech that companies might (apparently voluntarily) switch to, and specifically the route of scaling up IB to compete with LLM agents. Note that Abram also seems to be discussing the AI 2027 report, which if taken seriously requires all of this to be done in about 2 years. Conditioning on this route, I suggest that most realistic paths look like what I described, but I am pretty pessimistic that this route will actually work. The reason is that I don't see explicitly Bayesian glass-box methods competing with massive black-box models at tasks like natural language prediction any time soon. But who knows, perhaps with the "true" (IB?) theory of agency in hand much more is possible.
More importantly, I believe that we need to complete the theory of agents first, before we can have strong confidence about which approaches are more promising.
I'm not sure it's possible to "complete" the theory of agents, and I am particularly skeptical that we can do it any time soon. However, I think we agree locally / directionally, because it also seems to me that a more rigorous theory of agency is necessary for alignment.
As to heuristic implementations of infra-Bayesianism, this is something I don't want to speculate about in public, it seems exfohazardous.
Fair enough, but in that case, it seems impossible for this conversation to meaningfully progress here.
EDIT: I think that I miscommunicated a bit initially and suggest reading my response to Vanessa before this comment for necessary context.
I hesitate to call this a mesa-optimizer. Although good epistemics involves agency in principle (especially time-bounded epistemics), I think we can sensibly differentiate between mesa-optimizers and mere mesa-induction. But perhaps you intended this stronger reading, in support of your argument. If so, I'm not sure why you believe this. (No, I don't find "planning ahead" results to be convincing -- I feel this can still be purely epistemic in a relevant sense.)
I am fine with using the term mesa-induction. I think induction is a restricted type of optimization, but I suppose you associate the term mesa-optimizer with agency, and that is not my intended message.
I think there are neural architectures close to the current paradigm which don't directly train whole chains-of-thought on a reinforcement signal to achieve agenticness. This paradigm is analogous to model-free reinforcement learning. What I would suggest is more analogous to model-based reinforcement learning, with corresponding benefits to transparency. (Super speculative, of course.)
I don't think the chain of thought is necessary, but routing through pure sequence prediction in some fashion seems important for the current paradigm (that is what I call scaffolding). I expect that it is possible in principle to avoid this and do straight model-based RL, but forcing that approach to quickly catch up with LLMs / foundation models seems very hard and not necessarily desirable. In fact by default this seems bad for transparency, but perhaps some IB-inspired architecture is more transparent.
Can't she just try this to decide whether there seems to me mutual interest / test the waters and then fall back to asking if he doesn't ask first? These things are not mutually exclusive.
I didn't really "get it" but this paper may be interesting to you: https://arxiv.org/pdf/2502.15820