thane-ruthenis

Intuitively, this shouldn't matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods'. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they'd let elicit pass@800 capabilities instead of "just" pass@400, but it'd still be just pass@k elicitation for not-astronomical k.

Not strongly convinced of that, though.

Comment by Thane Ruthenis on Vladimir_Nesov's Shortform · 2025-04-22T01:21:54.271Z · LW · GW

Huh. This is roughly what I'd expected, but even I didn't expect it to be so underwhelming.^[1]

I weakly predict that the situation isn't quite as bad for capabilities as this makes it look. But I do think something-like-this is likely the case.

^{^}
Of course, moving a pass@400 capability to pass@1 isn't nothing, but it's clearly astronomically short of a Singularity-enabling technique that RL-on-CoTs is touted as.

Comment by Thane Ruthenis on Why Should I Assume CCP AGI is Worse Than USG AGI? · 2025-04-19T20:04:04.629Z · LW · GW

Since the US government is expected to treat other stakeholders in its previous block better than China treats members of it's block

At the risk of getting too into politics...

IMO, this was maybe-true for the previous administrations, but is completely false for the current one. All people making the argument based on something like this reasoning need to update.

Previous administrations were more or less dead inertial bureaucracies. Those actually might have carried on acting in democracy-ish ways even when facing outside-context events/situations, such as suddenly having access to overwhelming ASI power. Not necessarily because were particularly "nice", as such, but because they weren't agenty enough to do something too out-of-character compared to their previous democracy-LARP behavior.

I still wouldn't have bet on them acting in pro-humanity ways (I would've expected some more agenty/power-hungry governmental subsystem to grab the power, circumventing e. g. the inertial low-agency Presidential administration). But there was at least a reasonable story there.

The current administration seems much more agenty: much more willing to push the boundaries of what's allowed and deliberately erode the constraints on what it can do. I think it doesn't generalize to boring democracy-ish behavior out-of-distribution, I think it eagerly grabs and exploits the overwhelming power. It's already chomping at the bit to do so.

Comment by Thane Ruthenis on Training AGI in Secret would be Unsafe and Unethical · 2025-04-19T04:18:23.475Z · LW · GW

Mm, yeah, maybe. The key part here is, as usual, "who is implementing this plan"? Specifically, even if someone solves the the preference-agglomeration problem (which may be possible to do for a small group of researchers), why would we expect it to end up implemented at scale? There are tons of great-on-paper governance ideas which governments around the world are busy ignoring.

For things like superbabies (or brain-computer interfaces, or uploads), there's at least a more plausible pathway for wide adoption, similar motives for maximizing profit/geopolitical power as with AGI.

Comment by Thane Ruthenis on Training AGI in Secret would be Unsafe and Unethical · 2025-04-18T22:33:17.888Z · LW · GW

I also think there is a genuine alternative in which power never concentrates to such an extreme degree.

I don't see it.

The distribution of power post-ASI depends on the constraint/goal structures instilled into the (presumed-aligned) ASI. That means the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI, in the time prior to the ASI's existence. What people could those be?

By default, it's the ASI's developers, e. g., the leadership of the AGI labs. "They will be nice and put in goals/constraints that make the ASI loyal to humanity, not to them personally" is more or less isomorphic to "they will make the ASI loyal to them personally, but they're nice and loyal to humanity"; in both cases, they have all the power.^[1]
If the ASI's developers go inform the US's President about it in a faithful way^[2], the overwhelming power will end up concentrated in the hands of the President/the extant powers that be. Either by way of ham-fisted nationalization (with something isomorphic to putting guns to the developers' (families') heads), or by subtler manipulation where e. g. everyone is forced to LARP believing in the US' extant democratic processes (which the President would be actively subverting, especially if that's still Trump), with this LARP being carried far enough to end up in the ASI's goal structure.
- The stories in which the resultant power struggles shake out in a way that leads to the humanity-as-a-whole being given true meaningful input in the process (e. g., the slowdown ending in AI-2027) seem incredibly fantastical to me. (Again, especially given the current US administration.)
- Yes, acting in ham-fisted ways would be precarious and have various costs. But I expect the USG to be able to play it well enough to avoid actual armed insurrection (especially given that the AGI concerns are currently not very legible to the public), and inasmuch as they actually "feel the AGI", they'd know that nothing less than that would ultimately matter.
If the ASI's developers somehow go public with the whole thing, and attempt to unilaterally set up some actually-democratic process for negotiating on the ASI goal/constraint structures, then either (1) the US government notices it, realizes what's happening, takes control, and subverts the process, (2) they set up some very broken process – as broken as the US electoral procedures which end up with Biden and Trump as Top 2 choice of president – and those processes output some basically random, potentially actively harmful results (again, something as bad as Biden vs. Trump).

Fundamentally, the problem is that there's currently no faithful mechanism of human preference agglomeration that works at scale. That means, both, that (1) it's currently impossible to let humanity-as-a-whole actually weigh in on the process, (2) there are no extant outputs of that mechanism around, all people and systems that currently hold power aren't aligned to humanity in a way that generalizes to out-of-distribution events (such as being given godlike power).

Thus, I could only see three options:

Power is concentrated in some small group's hands, with everyone then banking on that group acting in a prosocial way, perhaps by asking the ASI to develop a faithful scalable preference-agglomeration process. (I. e., we use a faithful but small-scale human-preference-agglomeration process.)
Power is handed off to some random, unstable process. (Either a preference agglomeration system as unfaithful as US' voting systems, or "open-source the AGI and let everyone in the world fight it out", or "sample a random goal system and let it probably tile the universe with paperclips".)
ASI development is stopped and some different avenue of intelligence enhancement (e. g., superbabies) is pursued; one that's more gradual and is inherently more decentralized.

^{^}
A group of humans that compromises on making the ASI loyal to humanity is likely more realistic than a group of humans which is actually loyal to humanity. E. g., because the group has some psychopaths and some idealists, and all psychopaths have to individually LARP being prosocial in order to not end up with the idealists ganging up against them, with this LARP then being carried far enough to end up in the ASI's goals. But this still involves that small group having ultimate power; still involves the future being determined by how the dynamics within that small group shake out.
^{^}
Rather than keeping him in the dark or playing him, which reduces to Scenario 1.

Comment by Thane Ruthenis on AI #112: Release the Everything · 2025-04-17T18:33:02.424Z · LW · GW

"you can create conversation categories and memory will only apply to other conversations in that category"

Yeah, that's what I'd meant.

Comment by Thane Ruthenis on AI #112: Release the Everything · 2025-04-17T17:38:14.220Z · LW · GW

I disagree. People don't find it confusing to sort files into folders.

Comment by Thane Ruthenis on AI #112: Release the Everything · 2025-04-17T16:15:11.213Z · LW · GW

ChatGPT memory now extends to the full contents of all your conversations. You can opt out of this. You can also do incognito windows that won’t interact with your other chats. You can also delete select conversations.

The way they should actually set this up is to let users create custom "memory categories", e. g. "professional context", "personal context", "legal-advice context", "hobby#1 context", and let people choose in which category (if any!) a given conversation goes.

It seems obvious and trivial to implement. I'm confused why they haven't done that yet. (Clashes with their "universal AI assistant" ideas?)

Comment by Thane Ruthenis on johnswentworth's Shortform · 2025-04-17T01:05:36.453Z · LW · GW

46 agreement/diasgreement-votes, 0 net agreement score

Gotta love how much of a perfect Scissor statement this is. (Same as my "o3 is not that impressive".)

Comment by Thane Ruthenis on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-16T21:50:53.187Z · LW · GW

OpenAI's o3 and o4-mini models are likely to become accessible for $20000 per month

Already available for $20/month.

The $20,000/month claims seems to originate from that atrocious The Information article, which threw together a bunch of unrelated sentences at the end to create the (false) impression that o3 and o4-mini are innovator-agents which will become available for $20,000/month this week. In actuality, the sentences "OpenAI believes it can charge $20,000 per month for doctorate-level AI", "new AI aims to resemble inventors", and "OpenAI is preparing to launch [o3 and o4-mini] this week" are separately true, but have nothing to do with each other.

Comment by Thane Ruthenis on johnswentworth's Shortform · 2025-04-16T16:25:13.095Z · LW · GW

The most commonly proposed substantive proxy is to ban models over a certain size, which would likely slow down timelines by a factor of 2-3 at most

... or, if we do live in a world in which LLMs are not AGI-complete, it might accelerate timelines. After all, this would force the capabilities people to turn their brains on again instead of mindlessly scaling, and that might lead to them stumbling on something which is AGI-complete. And it would, due to a design constraint, need much less compute for committing omnicide.

How likely would that be? Companies/people able to pivot like this would need to be live players, capable of even conceiving of new ideas that aren't "scale LLMs". Naturally, that means 90% of the current AI industry would be out of the game. But then, 90% of the current AI industry aren't really pushing the frontier today either; that wouldn't be much of a loss.

To what extent are the three AGI labs alive vs. dead players, then?

OpenAI has certainly been alive back in 2022. Maybe the coup and the exoduses killed it and it's now a corpse whose apparent movement is just inertial (the reasoning models were invented prior to the coup, if Q* rumors are to be trusted, so it's little evidence that OpenAI was still alive in 2024). But maybe not.
Anthropic houses a bunch of the best OpenAI researchers now, and it's apparently capable of inventing some novel tricks (whatever's the mystery behind Sonnet 3.5 and 3.6).
DeepMind is even now consistently outputting some interesting non-LLM research.

I think there's a decent chance that they're alive enough. Currently, they're busy eating the best AI researchers and turning them into LLM researchers. If they stop focusing people's attention on the potentially-doomed paradigm, if they're forced to correct the mistake (on this model) that they're making...

This has always been my worry about all the proposals to upper-bound FLOPs, complicated by my uncertainty regarding whether LLMs are or are not AGI-complete after all.

One major positive effect this might have is memetic. It might create the impression of an (artificially created) AI Winter, causing people to reflexively give up. In addition, not having an (apparent) in-paradigm roadmap to AGI would likely dissolve the race dynamics, both between AGI companies and between geopolitical entities. If you can't produce straight-line graphs suggesting godhood by 2027, and are reduced to "well we probably need a transformer-sized insight here...", it becomes much harder to generate hype and alarm that would be legible to investors and politicians.

But then, in worlds in which LLMs are not AGI-complete, how much actual progress to AGI is happening due to the race dynamic? Is it more or less progress than would be produced by a much-downsized field in the counterfactual in which LLM research is banned? How much downsizing would it actually cause, now that the ideas of AGI and the Singularity have gone mainstream-ish? Comparatively, how much downsizing would be caused by the chilling effect if the presumably doomed LLM paradigm is let to run its course of disappointing everyone by 2030 (when the AGI labs can scale no longer)?

On balance, upper-bounding FLOPs is probably still a positive thing to do. But I'm not really sure.

Comment by Thane Ruthenis on OpenAI #13: Altman at TED and OpenAI Cutting Corners on Safety Testing · 2025-04-15T17:33:08.708Z · LW · GW

So, since Altman asked so nicely, what are the most prominent examples of Altman potentially being corrupted by The Ring of Power? Here is an eightfold path.

Missing the "non-disparagement clauses enforced by PPUs allowing OpenAI to withhold equity it paid out as compensation" debacle. This is, IMO, a qualitatively different flavor of corruption. Maneuvering against Musk, stealing the non-profit, outplaying the board, being quieter on AI x-risk than he should be, not supporting regulations, the jingoist messaging – okay, none of this is good. But from a certain perspective, those are fair-play far-mode strategic moves.

Trying to get a death grip on the throats of your own employees and coworkers? That's near-mode antisocial behavior, and is, in some ways, more indicative of the underlying character than any of the above. And unlike with the coup, it wasn't even arguably in self-defense, it was proactive.

Comment by Thane Ruthenis on johnswentworth's Shortform · 2025-04-15T04:31:39.082Z · LW · GW

Oh, if you're in the business of compiling a comprehensive taxonomy of ways the current AI thing may be fake, you should also add:

Vibe coders and "10x'd engineers", who (on this model) would be falling into one of the failure modes outlined here: producing applications/features that didn't need to exist, creating pointless code bloat (which helpfully show up in productivity metrics like "volume of code produced" or "number of commits"), or "automatically generating" entire codebases in a way that feels magical, then spending so much time bugfixing them it eats up ~all perceived productivity gains.
e/acc and other Twitter AI fans, who act like they're bleeding-edge transhumanist visionaries/analysts/business gurus/startup founders, but who are just shitposters/attention-seekers who will wander off and never look back the moment the hype dies down.

Comment by Thane Ruthenis on Zach Stein-Perlman's Shortform · 2025-04-14T04:40:47.587Z · LW · GW

Interesting. Source? Last I heard, they're not hiring anyone because they expect SWE to be automated soon.

Comment by Thane Ruthenis on An Unbiased Evaluation of My Debate with Thane Ruthenis - Run It Yourself · 2025-04-14T02:17:27.757Z · LW · GW

Really, man?

> Thane is fairly open about his views, although he does engage in some dismissive language (e.g., “thoughtless kneejerk reaction”)
> Thane's comments, particularly calling funnyfranco’s response a "thoughtless kneejerk reaction,"

I give ChatGPT a C- on reading comprehension.^[1] I suggest that you stop taking LLMs' word as gospel. If it can misunderstand something that clear-cut this severely, how can you trust any other conclusions it draws? How can you even post an "unbiased evaluation" with an error this severe, not acknowledge its abysmal quality, and then turn around and lecture people about truth-seeking?

I definitely advice against going to LLMs for social validation.

Here's Claude 3.7 taking my side, lest you assume I'm dismissing LLMs because they denounce me. For context, Anthropic doesn't pass the user's name to Claude and has no cross-session memory, so it didn't know my identity, there was no system prompt, and the .pdfs were generated by just "right-click -> print -> save as PDF" on the relevant LW pages.

^{^}
For context, if someone else stumbles on this trainwreck: I was sarcastically calling my response a "thoughtless kneejerk reaction". ChatGPT apparently somehow concluded I'd been referring to funnyfranco's writing. I wonder if it didn't read the debate properly and just skimmed it? I mean, all the cool kids were doing it.

Comment by Thane Ruthenis on Why are neuro-symbolic systems not considered when it comes to AI Safety? · 2025-04-12T20:46:50.486Z · LW · GW

Let's assume you or anyone else really did have a proposed path to AGI/ASI that would be in some important senses safer than our current path. Who is the entity for whom this would or would not be a "viable course?"

A new startup created specifically for the task. Examples: one, two.

Like, imagine that we actually did discover a non-DL AGI-complete architecture with strong safety guarantees, such that even MIRI would get behind it. Do you really expect that the project would then fail at the "getting funded"/"hiring personnel" stages?

tailcalled's argument is the sole true reason: we don't know of any neurosymbolic architecture that's meaningfully safer than DL. (The people in the examples above are just adding to the AI-risk problem.) That said, I think the lack of alignment research going into it is a big mistake, mainly caused by the undertaking seeming too intimidating/challenging to pursue / by the streetlighting effect.

Comment by Thane Ruthenis on What is autism? · 2025-04-12T18:24:10.096Z · LW · GW

@Steven Byrnes' intense-world theory of autism seems like the sort of thing you're looking for.

Comment by Thane Ruthenis on On Google’s Safety Plan · 2025-04-12T15:41:56.064Z · LW · GW

I agree that this isn't an obviously unreasonable assumption to hold. But...

I don't think the assumption is so likely to hold that one can assume it as part of a safety case for AI

... that.

Comment by Thane Ruthenis on Short Timelines Don't Devalue Long Horizon Research · 2025-04-12T15:18:54.270Z · LW · GW

The idea that all the labs focus on speeding up their own research threads rather than serving LLMs to customers is already pretty dubious. Developing LLMs and using them are two different skillsets; it would make economic sense for different entities to specialize in those things

I can maybe see it. Consider the possibility that the decision to stop providing public access to models past some capability level is convergent: e. g., the level at which they're extremely useful for cyberwarfare (with jailbreaks still unsolved) such that serving the model would drown the lab in lawsuits/political pressure, or the point at which the task of spinning up an autonomous business competitive with human businesses, or making LLMs cough up novel scientific discoveries, becomes trivial (i. e., such that the skill level required for using AI for commercial success plummets – which would start happening inasmuch as AGI labs are successful in moving LLMs to the "agent" side of the "tool/agent" spectrum).

In those cases, giving public access to SOTA models would stop being the revenue-maximizing thing to do. It'd either damage your business reputation^[1], or it'd simply become more cost-effective to hire a bunch of random bright-ish people and get them to spin up LLM-wrapper startups in-house (so that you own 100% stake in them).

Some loose cannons/open-source ideologues like DeepSeek may still provide free public access, but those may be few and far between, and significantly further behind. (And getting progressively scarcer; e. g., the CCP probably won't let DeepSeek keep doing it.)

Less extremely, AGI labs may move to a KYC-gated model of customer access, such that only sufficiently big, sufficiently wealthy entities are able to get access to SOTA models. Both because those entities won't do reputation-damaging terrorism, and because they'd be the only ones able to pay the rates (see OpenAI's maybe-hype maybe-real whispers about $20,000/month models).^[2] And maybe some EA/R-adjacent companies would be able to get in on that, but maybe not.

Also,

no lab has a significant moat, and the cutting edge is not kept private for long, and those facts look likely to remain true for a while

This is a bit flawed, I think. I think the situation is that runner-ups aren't far behind the leaders in wall-clock time. Inasmuch as the progress is gradual, this translates to runner-ups being not-that-far-behind the leaders in capability level. But if AI-2027-style forecasts come true, with the capability progress accelerating, a 90-day gap may become a "GPT-2 vs. GPT-4"-level gap. In which case alignment researchers having privileged access to true-SOTA models becomes important.

(Ideally, we'd have some EA/R-friendly company already getting cozy with e. g. Anthropic so that they can be first-in-line getting access to potential future research-level models so that they'd be able to provide access to those to a diverse portfolio of trusted alignment researchers...)

^{^}
Even if the social benefits of public access would've strictly outweighed the harms on a sober analysis, the public outcry at the harms may be significant enough to make the idea commercially unviable. Asymmetric justice, etc.
^{^}
Indeed, do we know it's not already happening? I can easily imagine some megacorporations having had privileged access to o3 for months.

Comment by Thane Ruthenis on Zach Stein-Perlman's Shortform · 2025-04-12T14:15:44.195Z · LW · GW

Orrr he's telling comforting lies to tread the fine line between billion-dollar hype and nationalization-worthy panic.

Could realistically be either, but it's probably the comforting-lies thing. Whatever the ground-truth reality may be, the AGI labs are not bearish.

Comment by Thane Ruthenis on A Bear Case: My Predictions Regarding AI Progress · 2025-04-12T12:06:52.303Z · LW · GW

Indeed, and maintaining this release schedule is indeed a bit impressive. Though note that "a model called o4 is released" and "the pace of progress from o1 to o3 is maintained" are slightly different. Hopefully the release is combined with a proper report on o4 (not just o4-mini), so we get actual data regarding how well RL-on-CoTs scales.

Comment by Thane Ruthenis on AI 2027: What Superintelligence Looks Like · 2025-04-11T19:57:27.214Z · LW · GW

FWIW, that's not a crux for me. I can totally see METR's agency-horizon trend continuing, such that 21 months later, the SOTA model beats METR's 8-hour tests. What I expect is that this won't transfer to real-world performance: you wouldn't be able to plop that model into a software engineer's chair, prompt it with the information in the engineer's workstation, and get one workday's worth of output from it.

At least, not reliably and not in the generel-coding setting. It's possible this sort of performance would be achieved in some narrow domains, and that this would happen once in a while on any task. (Indeed, I think that's already the case?) And I do expect nonzero extension of general-purpose real-world agency horizons. But what I expect is slower growth, with the real-world performance increasingly lagging behind the performance on the agency-horizon benchmark.

Comment by Thane Ruthenis on On Google’s Safety Plan · 2025-04-11T16:41:49.660Z · LW · GW

Fair point. The question of the extent to which those documents can be taken seriously as statements of company policy (as opposed to only mattering in signaling games) is still worthwhile, I think.

Comment by Thane Ruthenis on On Google’s Safety Plan · 2025-04-11T16:03:31.916Z · LW · GW

I can never tell how seriously to take those types of documents.

On one hand, AGI labs obviously have employees, including senior employees, who genuinely take the risks seriously (most notably, some very well-respected LW users, e. g. some of this specific document's authors). I'm sure the people writing them are writing them in good faith.

On the other hand, the documents somehow never end up containing recommendations that would be drastically at odds with "race full steam ahead" (see the rather convenient Core Assumption 5 here, and subsequent "just do the standard thing plus amplified oversight" alignment plan) or opinions that could cause significant concern (see "not feeling the AGI/Singularity" in "3.6. Benefits of AGI"). And I have a nagging suspicion that if there's ever a situation where the capability-maximizing thing to do would end up at odds with a recommendation from a published safety plan, the safety plan would be unceremoniously ignored/loopholed-around/amended. I think we saw instances of that already, and not only from OpenAI.

My current instinct is to just tune them out, on the assumption that the AGI lab in question (as opposed to the people writing the document) views them as just some nice-sounding non-binding PR.^[1] Am I wrong to view it this way?

^{^}
Poking holes in which is still important, kudos, Zvi.

Comment by Thane Ruthenis on What 2026 looks like · 2025-04-08T15:04:23.098Z · LW · GW

Trying to evaluate this forecast in order to figure out how update on the newer one.

It certainly reads as surprisingly prescient. Notably, it predicts both the successes and the failures of the LLM paradigm: the ongoing discussion regarding how "shallow" or not their understanding is, the emergence of the reasoning paradigm, the complicated LLM bureaucracies/scaffolds, lots of investment in LLM-wrapper apps which don't quite work, the relative lull of progress in 2024, troubles with agency and with generating new ideas, "scary AI" demos being dismissed because LLMs do all kinds of whimsical bullshit...

And it was written in the base-GPT-3 era, before ChatGPT, before even the Instruct models. I know I couldn't have come close to calling any of this back then. Pretty wild stuff.

In comparison, the new "AI 2027" scenario is very... ordinary. Nothing that's in it is surprising to me, it's indeed the "default" "nothing new happens" scenario in many ways.

But perhaps the difference is in the eye of the beholder. Back in 2021, I barely knew how DL worked, forget being well-versed in deep LLM lore. The real question is, if I had been as immersed in the DL discourse in 2021 as I am now, would this counterfactual 2021!Thane have considered this forecast as standard as the AI 2027 forecast seems to 2025!Thane?

More broadly: "AI 2027" seems like the reflection of the default predictions regarding AI progress in certain well-informed circles/subcultures. Those circles/subcultures are fairly broad nowadays; e. g., significant parts of the whole AI Twitter. Back in 2021, the AI subculture was much smaller... But was there, similarly, an obviously maximally-well-informed fraction of that subculture which would've considered "What 2026 Looks Like" the somewhat-boring default prediction?

Reframing: @Daniel Kokotajlo, do you recall how wildly speculative you considered "What 2026 Looks Like" at the time of writing, and whether it's more or less speculative than "AI 2027" feels to you now? (And perhaps the speculativeness levels of the pre-2027 and post-2027 parts of the "AI 2027" report should be evaluated separately here.)

Another reframing: To what extent do you think your alpha here was in making unusually good predictions, vs. in paying attention to the correct things at a time when no-one focused on them, then making fairly basic predictions/extrapolations? (Which is important for evaluating how much your forecasts should be expected to "beat the (prediction) market" today, now that (some parts of) that market are paying attention to the right things as well.)

Comment by Thane Ruthenis on AI 2027: What Superintelligence Looks Like · 2025-04-07T17:28:34.868Z · LW · GW

Notably, the trend in the last few years is that AI companies triple their revenue each year

Hm, I admittedly only skimmed the Compute Forecast article, but I don't think there's much evidence for a trend like this? The "triples every year" statement seems to be extrapolated from two data points about OpenAI specifically ("We use OpenAI’s 2023 revenue of $1B and 2024 revenue around $4B to to piece together a short term trend that we expect to slow down gradually", plus maybe this). I guess you can draw a straight line through two points, and the idea of this trend following straight lines doesn't necessarily seem unconvincing a-priori... But is there more data?

50% algorithmic progress

Yeah, I concur with all of that: some doubts about 50% in April 2026, some doubts about 13% today, but seems overall not implausible.

Comment by Thane Ruthenis on AI 2027: What Superintelligence Looks Like · 2025-04-07T15:16:20.494Z · LW · GW

Excellent work!

Why our uncertainty increases substantially beyond 2026

Notably, it's also the date at which my model diverges from this forecast's. That's surprisingly later than I'd expected.

Concretely,

OpenBrain doubles down on this strategy with Agent-2. It is qualitatively almost as good as the top human experts at research engineering (designing and implementing experiments), and as good as the 25th percentile OpenBrain scientist at “research taste” (deciding what to study next, what experiments to run, or having inklings of potential new paradigms).

I don't know that the AGI labs in early 2027 won't be on a trajectory to automate AI R&D. But I predict that a system trained the way Agent-2 is described to be trained here won't be capable of the things listed.

I guess I'm also inclined to disagree with parts of the world-state predicted by early 2026, though it's murkier on that front. Agent-1's set of capabilities seems very plausible, but what I'm skeptical regarding are the economic and practical implications (AGI labs' revenue tripling and 50% faster algorithmic progress). As in,

People naturally try to compare Agent-1 to humans, but it has a very different skill profile. It knows more facts than any human, knows practically every programming language, and can solve well-specified coding problems extremely quickly. On the other hand, Agent-1 is bad at even simple long-horizon tasks, like beating video games it hasn’t played before.

Does that not constitute just a marginal improvement on the current AI models? What's the predicted phase shift that causes the massive economic implications and impact on research?

I assume it's the jump from "unreliable agents" to "reliable agents" somewhere between 2025 to 2026. It seems kind of glossed over; I think that may be an earlier point at which I would disagree. Did I miss a more detailed discussion of it somewhere in the supplements?

Comment by Thane Ruthenis on AI 2027: What Superintelligence Looks Like · 2025-04-07T06:12:09.221Z · LW · GW

The latest generation of thinking models can definitely do agentic frontend development

But does that imply that they're general-purpose competent agentic programmers? The answers here didn't seem consistent with that. Does your experience significantly diverge from that?

My current model is that it's the standard "jagged capabilities frontier" on a per-task basis, where LLMs are good at some sufficiently "templated" projects, and then they fall apart on everything else. Their proficiency at frontend development is then mostly a sign of frontend code being relatively standardized^[1]; not of them being sufficiently agent-y.

I guess quantifying it as "20% of the way from an amateur to a human pro" isn't necessarily incorrect, depending on how you operationalize this number. But I think it's also arguable that they haven't actually 100%'d even amateur general-coding performance yet.

^{^}
I. e., that most real-world frontend projects have incredibly low description length if expressed in the dictionary of some "frontend templates", with this dictionary comprehensively represented in LLMs' training sets.
(To clarify: These projects' Kolmogorov complexity can still be high, but their cross-entropy relative to said dictionary is low.
Importantly, the cross-entropy relative to a given competent programmer's "template-dictionary" can also be high, creating the somewhat-deceptive impression of LLMs being able to handle complex projects. But that apparent capability would then fail to generalize to domains in which real-world tasks aren't short sentences in some pretrained dictionary. And I think we are observing that with e. g. nontrivial backend coding?)

Comment by Thane Ruthenis on Stephen Fowler's Shortform · 2025-04-06T17:42:32.279Z · LW · GW

Having a second Google account specifically for AI stuff seems like a straightforward solution to this? That's what I do, at least. Switching between them is easy.

Comment by Thane Ruthenis on AI #110: Of Course You Know… · 2025-04-04T09:16:24.744Z · LW · GW

Technological progress leading to ever-better, ever-more-flexible communication technology, which serves as an increasingly more efficient breeding ground for ever-more-viral memes – and since virality is orthogonal to things like "long-term wisdom", the society ends up taken over by unboundedly destructive ideas?

Comment by Thane Ruthenis on AI #110: Of Course You Know… · 2025-04-04T02:20:03.478Z · LW · GW

I mention this up top in an AI post despite all my efforts to stay out of politics, because in addition to torching the American economy and stock market and all of our alliances and trade relationships in general, this will cripple American AI in particular.

Are we in a survival-without-dignity timeline after all? Big if true.

(Inb4 we keep living in Nerd Hell and it somehow mysteriously fails to negatively impact AI in particular.)

Comment by Thane Ruthenis on On Downvotes, Cultural Fit, and Why I Won’t Be Posting Again · 2025-04-01T03:52:52.214Z · LW · GW

Competitive agents will not choose to in order to beat the competition

Competitive agents will chose to commit suicide, knowing it's suicide, to beat the competition? That suggests that we should observe CEOs mass-poisoning their employees, Jonestown-style, in a galaxy-brained attempt to maximize shareholder value. How come that doesn't happen?

Are you quite sure the underlying issue here is not that the competitive agents don't believe the suicide race to be a suicide race?

Comment by Thane Ruthenis on On Downvotes, Cultural Fit, and Why I Won’t Be Posting Again · 2025-04-01T01:52:09.057Z · LW · GW

alignment will be optimised away, because any system that isn’t optimising as hard as possible won’t survive the race

Off the top of my head, this post. More generally, this is an obvious feature of AI arms races in the presence of alignment tax. Here's a 2011 writeup that lays it out:

Given abundant time and centralized careful efforts to ensure safety, it seems very probable that these risks could be avoided: development paths that seemed to pose a high risk of catastrophe could be relinquished in favor of safer ones. However, the context of an arms race might not permit such caution. A risk of accidental AI disaster would threaten all of humanity, while the benefits of being first to develop AI would be concentrated, creating a collective action problem insofar as tradeoffs between speed and safety existed.

I assure you the AI Safety/Alignment field has been widely aware of it since at least that long ago.

Also,

alignment will be optimised away, because any system that isn’t optimising as hard as possible won’t survive the race

Any (human) system that is optimizing as hard as possible also won't survive the race. Which hints at what the actual problem is: it's not even that we're in an AI arms race, it's that we're in an AI suicide race which the people racing incorrectly believe to be an AI arms race. Convincing people of the true nature of what's happening is therefore a way to dissolve the race dynamic. Arms races are correct strategies to pursue under certain conditions; suicide races aren't.

Comment by Thane Ruthenis on On Downvotes, Cultural Fit, and Why I Won’t Be Posting Again · 2025-03-31T22:07:36.720Z · LW · GW

I've skimmed™ what I assume is your "main essay". Thoughtless Kneejerk Reaction™ follows:

You are preaching to the choir. Most of it are 101-level arguments in favor of AGI risk. Basically everyone on LW has already heard them, and either agrees vehemently, or disagrees with some subtler point/assumption which your entry-level arguments don't cover. The target audience for this isn't LWers, this is not content that's novel and useful for LWers. That may or may not be grounds for downvoting it (depending on one's downvote philosophy), but is certainly grounds for not upvoting it and for not engaging with it.
- The entry-level arguments have been reiterated here over and over and over and over again, and it's almost never useful, and everyone's sick of them, and you essay didn't signal that engaging with you on them would be somehow unusually productive.
- If I am wrong, prove me wrong: quote whatever argument of yours you think ranks the highest on novelty and importance, and I'll evaluate it.
The focus on capitalism likely contributed to the "this is a shallow low-insight take" impression. The problem isn't "capitalism", it's myopic competitive dynamics/Moloch in general. Capitalism exhibits lots of them, yes. But a bunch of socialist/communist states would fall into the same failure mode; a communist world government would fall into the same failure mode (inasmuch as it would still involve e. g. competition between researchers/leaders for government-assigned resources and prestige). Pure focus on capitalism creates the impression that you're primarily an anti-capitalism ideologue who's aiming co-opt the AGI risk for that purpose.
- A useful take along those lines might be to argue that we can tap into the general public's discontent with capitalism to more persuasively argue the case for the AGI risk, followed by an analysis regarding specific argument structures which would be both highly convincing and truthful.
Appending an LLM output at the end, as if it's of inherent value, likely did you no favors.

I'm getting the impression that you did not familiarize yourself with LW's culture and stances prior to posting. If yes, this is at the root of the problems you ran into.

Edit:

Imagine for a moment that an amateur astronomer spots an asteroid on a trajectory to wipe out humanity. He doesn’t have a PhD. He’s not affiliated with NASA. But the evidence is there. And when he contacts the people whose job it is to monitor the skies, they say: “Who are you to discover this?” And then refuse to even look in the direction he’s pointing.

A more accurate analogy would involve the amateur astronomer joining a conference for people discussing how to divert that asteroid, giving a presentation where he argues for the asteroid's existence using low-resolution photos and hand-made calculations (to a room full of people who've observed the asteroid through the largest international telescopes or programmed supercomputer simulations of its trajectory), and is then confused why it's not very well-received.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-03-30T01:39:18.399Z · LW · GW

It's been more than three months since o3 and still no o4, despite OpenAI researchers' promises.

Deep Learning has officially hit a wall. Schedule the funeral.

[/taunting_god]

Comment by Thane Ruthenis on Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle · 2025-03-30T01:26:10.450Z · LW · GW

o3 doesn't handle it either.

Comment by Thane Ruthenis on Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle · 2025-03-29T19:21:09.657Z · LW · GW

I don't think that's an issue here at all. Look at the CoTs: it has no trouble whatsoever splitting higher-level expressions into concatenations of blocks of nested expressions and figuring out levels of nesting.

Comment by Thane Ruthenis on Daniel Tan's Shortform · 2025-03-27T21:09:47.150Z · LW · GW

Counterargument: Doing it manually teaches you the skills and the strategies for autonomously attaining high levels of understanding quickly and data-efficiently. Those skills would then generalize to cases in which you can't consult anyone, such as cases where the authors are incommunicado, dead, or don't exist/the author is the raw reality. That last case is particularly important for doing frontier research: if you've generated a bunch of experimental results and derivations, the skills to make sense of what it all means have a fair amount of overlap with the skills for independently integrating a new paper into your world-models.

Of course, this is primarily applicable if you expect research to be a core part of your career, and it's important to keep in mind that "ask an expert for help" is an option. Still, I think independent self-studies can serve as good "training wheels".

Comment by Thane Ruthenis on AI #109: Google Fails Marketing Forever · 2025-03-27T20:43:01.774Z · LW · GW

Which is weird, if you are overwhelmed shouldn’t you also be excited or impressed? I guess not, which seems like a mistake, exciting things are happening.

"Impressed" or "excited" implies a positive/approving emotion towards the overwhelming news coming from the AI sphere. As an on-the-nose comparison, you would not be "impressed" or "excited" by a constant stream of reports covering how quickly an invading army is managing to occupy your cities, even if the new military hardware they deploy is "impressive" in a strictly technical sense.

Comment by Thane Ruthenis on A Bear Case: My Predictions Regarding AI Progress · 2025-03-25T21:22:36.255Z · LW · GW

When reading LLM outputs, I tend to skim them. They're light on relevant, non-obvious content. You can usually just kind of glance diagonally through their text and get the gist, because they tend to spend a lot of words saying nothing/repeating themselves/saying obvious inanities or extensions of what they've already said.

When I first saw Deep Research outputs, it didn't read to me like this. Every sentence seemed to be insightful, dense with pertinent information.

Now I've adjusted to the way Deep Research phrases itself, and it reads same as any other LLM output. Too many words conveying too few ideas.

Not to say plenty of human writing isn't similar kind of slop, and not to say some LLM outputs aren't actually information-dense. But well-written human stuff is usually information-dense, and could have surprising twists of thought or rhetoric that demand you to actually properly read it. And LLM outputs – including, as it turns out, Deep Research's – are usually very water-y.

Comment by Thane Ruthenis on On (Not) Feeling the AGI · 2025-03-25T15:14:46.279Z · LW · GW

Altman’s model of the how AGI will impact the world is super weird if you take it seriously as a physical model of a future reality

My instinctive guess is that these sorts of statements from OpenAI are Blatant Lies intended to lower the AGI labs' profile and ensure there's no widespread social/political panic. There's a narrow balance to maintain, between generating enough hype targeting certain demographics to get billions of dollars in investments from them ("we are going to build and enslave digital gods and take over the world, do you want to invest in us and get a slice of the pie, or miss out and end up part of the pie getting sliced up?") and not generate so much hype of the wrong type that the governments notice and nationalize you ("it's all totally going to be business-as-usual, basically just a souped-up ChatGPT, no paradigm shifts, no redistribution of power, Everything will be Okay").

Sending contradictory messages such that each demographic hears only what they want to hear is a basic tactic for this. The tech investors buy the hype/get the FOMO and invest, the politicians and the laymen dismiss it and do nothing.

They seem to be succeeding at striking the right balance, I think. Hundreds of billions of dollars going into it from the private sector while the governments herp-derp.

certainly possible that the first AGI-level product will come out – maybe it’s a new form of Deep Research, let’s say – and initially most people don’t notice or care all that much

My current baseline expectation is that it won't look like this (unless the AGI labs/the AGI will want to artificially make it look like this). Attaining actual AGI, instead of the current shallow facsimiles, will feel qualitatively different.

For me, with LLMs, there's a palatable sense that they need to be babied and managed and carefully slotted into well-designed templates or everything will fall apart. It won't be like that with an actual AGI, an actual AGI would be exerting optimization pressure from its own end to make things function.

Relevant meme

There'll be a palatable feeling of "lucidity" that's currently missing with LLMs. You wouldn't confuse the two if you had their chat windows open side by side, and the transformative effects will be ~instant.

Comment by Thane Ruthenis on johnswentworth's Shortform · 2025-03-25T12:06:04.484Z · LW · GW

Track record: My own cynical take seems to be doing better with regards to not triggering people (though it's admittedly less visible).

Any suggestions for how I can better ask the question to get useful answers without apparently triggering so many people so much?

First off, I'm kind of confused about how you didn't see this coming. There seems to be a major "missing mood" going on in your posts on the topic – and I speak as someone who is sorta-aromantic, considers the upsides of any potential romantic relationship to have a fairly low upper bound for himself^[1], and is very much willing to entertain the idea that a typical romantic relationship is a net-negative dumpster fire.

So, obvious-to-me advice: Keep a mental model of what topics are likely very sensitive and liable to trigger people, and put in tons of caveats and "yes, I know, this is very cynical, but it's my current understanding" and "I could totally be fundamentally mistaken here".

In particular, a generalization of an advice from here has been living in my head rent-free for years (edited/adapted):

Tips For Talking About Your Beliefs On Sensitive Topics
You want to make it clear that they're just your current beliefs about the objective reality, and you don't necessarily like that reality so they're not statements about how the world ought to be, and also they're not necessarily objectively correct and certainly aren't all-encompassing so you're not condemning people who have different beliefs or experiences. If you just say, "I don't understand why people do X," everyone will hear you as saying that everyone who does X is an untermensch who should be gutted and speared because in high-simulacrum-level environments disagreeing with people is viewed as a hostile act attempting to lower competing coalitions' status, and failing to furiously oppose such acts will get you depowered and killed. So be sure to be extra careful by saying something like, "It is my current belief, and I mean with respect to my own beliefs about the objective reality, that a typical romantic relationship seems flawed in lots of ways, but I stress, and this is very important, that if you feel or believe differently, then that too is a valid and potentially more accurate set of beliefs, and we don't have to OH GOD NOT THE SPEARS ARRRGHHHH!"

More concretely, here's how I would have phrased your initial post:

Rewrite

Here's a place where my model of the typical traditional romantic relationships seems to be missing something. I'd be interested to hear people's takes on what it might be.

Disclaimer: I'm trying to understand the general/stereotypical case here, i. e., what often ends up happening in practice. I'm not claiming that this is how relationships ought to be like, nor that all existing relationships are like this. But on my model, most people are deeply flawed, they tend to form deeply flawed relationships, and I'd like to understand why these relationships still work out. Bottom line is, this is going to be a fairly cynical/pessimistic take (with the validity of its cynicism being something I'm willing to question).

Background claims:

My model of the stereotypical/traditional long-term monogamous hetero relationship has a lot of downsides for men. For example:
- Financial costs: Up to 50% higher living costs (since in the "traditional" template, men are the breadwinners.)
- Frequent, likely highly stressful, arguments. See Aella's relationship survey data: a bit less than a third of respondents in 10-year relationships reported fighting multiple times a month or more.
- General need to manage/account for the partner's emotional issues. (My current model of the "traditional" relationship assumes the anxious attachment style for the woman, which would be unpleasant to manage.)
For hetero men, consistent sexual satisfaction is a major upside offered by a relationship, providing a large fraction of the relationship-value.
A majority of traditional relationships are sexually unsatisfying for the man after a decade or so. Evidence: Aella's data here and here are the most legible sources I have on hand; they tell a pretty clear story where sexual satisfaction is basically binary, and a bit more than half of men are unsatisfied in relationships of 10 years (and it keeps getting worse from there). This also fits with my general models of dating: women usually find the large majority of men sexually unattractive, most women eventually settle on a guy they don't find all that sexually attractive, so it should not be surprising if that relationship ends up with very little sex after a few years.

Taking on purely utilitarian lens, for a relationship to persist, the benefits offered by it should outweigh its costs. However, on my current model, that shouldn't be the case for the average man. I expect the stated downsides to be quite costly, and if we remove consistent sex from the equation, the remaining value (again, for a stereotypical man) seems comparatively small.

So: Why do these relationships persist? Obviously the men might not have better relationship prospects, but they could just not have any relationship. The central question which my models don't have a compelling answer to is: what is making these relationships net positive value for the men, relative to not having a romantic relationship at all?

Some obvious candidate answers:

The cultural stereotypes diverge from reality in some key ways, so my model is fundamentally mistaken. E. g.:
- I'm overestimating the downsides: the arguments aren't that frequent/aren't very stressful, female partners aren't actually "high-maintanance", etc.
- I'm overestimating the value of sex for a typical man.
- I'm underestimating how much other value relationships offers men. If so: what is that "other value", concretely? (Note that it'd need to add up to quite a lot to outweigh the emotional and financial costs, under my current model.)
Kids. This one makes sense for those raising kids, but what about everyone else? Especially as fertility goes down.
The wide tail. There's plenty of cases which make sense which are individually unusual - e.g. my own parents are business partners. Maybe in aggregate all these unusual cases account for the bulk.
Loneliness. Maybe most of these guys have no one else close in their life. In this case, they'd plausibly be better off if they took the effort they invested in their romantic life and redirected to friendships (probably mostly with other guys), but there's a lot of activation energy blocking that change.
Wanting a dependent. Lots of men are pretty insecure, and having a dependent to provide for makes them feel better about themselves. This also flips the previous objection: high maintenance can be a plus if it makes a guy feel wanted/useful/valuable.
Social pressure/commitment/etc making the man stick around even though the relationship is not net positive for him.
The couple are de-facto close mostly-platonic friends, and the man wants to keep that friendship.

I'm interested in both actual data and anecdata. What am I missing here? What available evidence points strongly to some of these over others?

Obvious way to A/B test this would be to find some group of rationalist-y people who aren't reading LW/your shortform, post my version there, and see the reactions. Not sure what that place would be. (EA forum? r/rational's Friday Open Threads? r/slatestarcodex? Some Discord/Substack group?)

Adapting it for non-rationalist-y audiences (e. g., r/AskMen) would require more rewriting. Mainly, coating the utilitarian language in more, ahem, normie terms.

^{^}
Given the choice between the best possible romantic relationship and $1m, I'd pick $1m. ~~Absent munchkinry like "my ideal girlfriend is a genius alignment researcher on the level of von Neumann and Einstein".~~

Comment by Thane Ruthenis on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-23T22:57:28.280Z · LW · GW

I buy this for the post-GPT-3.5 era. What's confusing me is that the rate of advancement in the pre-GPT-3.5 era was apparently the same as in the post-GPT-3.5 era, i. e., doubling every 7 months.

Why would we expect there to be no distribution shift once the AI race kicked into high gear? GPT-2 to GPT-3 to GPT-3.5 proceeded at a snail's pace by modern standards. How did the world happen to invest in them just enough for them to fit into the same trend?

Comment by Thane Ruthenis on Fabien's Shortform · 2025-03-23T18:50:19.547Z · LW · GW

I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too

Maaaybe. Note, though, that "understand what's going on" isn't the same as "faithfully and comprehensively translate what's going on into English". Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL'd model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).

Comment by Thane Ruthenis on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-23T18:15:00.580Z · LW · GW

Hm, that's a very good point.

I think the amount of money-and-talent invested into the semiconductor industry has been much more stable than in AI though, no? Not constant, but growing steadily with the population/economy/etc. In addition, Moore's law being so well-known potentially makes it a self-fulfilling prophecy, with the industry making it a target to aim for.

Comment by Thane Ruthenis on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-23T17:31:51.396Z · LW · GW

Indeed. That seems incredibly weird. It would be one thing if it were a function of parameter size, or FLOPs, or data, or at least the money invested. But the release date?

The reasons why GPT-3, GPT-3.5, GPT-4o, Sonnet 3.6, and o1 improved on the SOTA are all different from each other, ranging from "bigger scale" to "first RLHF'd model" to "first multimodal model" to "algorithmic improvements/better data" to "???" (Sonnet 3.6) to "first reasoning model". And it'd be one thing if we could at least say that "for mysterious reasons, billion-dollar corporations trying incredibly hard to advance the frontier can't do better than doubling the agency horizon every 7 months using any method", but GPTs from -2 to -3.5 were developed in a completely different socioeconomic situation! There wasn't an AI race dynamics, AGI companies were much poorer, etc. Yet they're still part of the pattern.

This basically only leaves teleological explanations, implies a divine plan for the rate of human technological advancement.

Which makes me suspect there's some error in the data, or the methodology was (accidentally) rigged to produce this result^[1]. Or perhaps there's a selection bias where tons of people were trying various different ways to forecast AI progress, all methodologies which failed to produce a neat trend weren't published, and we're looking at a methodology that chanced upon a spurious correlation.

Or I'm missing something obvious and it actually makes way more sense. Am I missing something obvious?

^{^}
For example: Was the benchmarking of how long a given type of task takes a human done prior to evaluating AI models, or was it done simultaneously with figuring out which models can do which tasks? I'd assume the methodology was locked in first, but if not...

Comment by Thane Ruthenis on Richard Ngo's Shortform · 2025-03-23T08:27:19.652Z · LW · GW

So maybe we need to think about systematization happening separately in system 1 and system 2?

I think that's right. Taking on the natural-abstraction lens, there is a "ground truth" to the "hierarchy of values". That ground truth can be uncovered either by "manual"/symbolic/System-2 reasoning, or by "automatic"/gradient-descent-like/System-1 updates, and both processes would converge to the same hierarchy. But in the System-2 case, the hierarchy would be clearly visible to the conscious mind, whereas the System-1 route would make it visible only indirectly, by the impulses you feel.

I don't know about the conflict thing, though. Why do you think System 2 would necessarily oppose System 1's deepest motivations?

Comment by Thane Ruthenis on Fabien's Shortform · 2025-03-23T07:51:22.194Z · LW · GW

What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode?

Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that's particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can't easily reverse-engineer it.

"Switching between languages every sentence" is a trivial first step. Next step might be something like "using poetic metaphors in every language" or "thinking in subtle information-dense multilingual puns", such that you need actual human polyglots to translate. Beyond that, it might move to freestyling words by combining characters from random languages in ways such that they happen to evoke useful concepts in the LLM when seen through its omniglot-poet lens^[1], but which teams of human polyglots need hours to untangle. At this stage, idiosyncratic meanings completely alien to us but visible to the LLM would also come into play, see the " petertodd" phenomenon, so perhaps the polyglot teams would also need to have janus-style LLM whisperers present. And then it gets worse from there.

Edit: Here, I dug up an example. I'd run some experiments on it back in 2023 and it turned out to be much less impressive than it looks^[2], but imagine if LLMs actually started to think in this kind of code.

Again, this is just a throwaway example scenario, but this is the sort of thing I expect. From my perspective, LLMs starting to switch between languages once RL'd is a clear precursor of such a nightmare scenario once RL is ramped up, not something that's tame. (Though I admit it's been less bad than I'd expected so far. I guess I haven't seen what the trajectories of reasoning models of the second generation look like.)

^{^}
E. g., using one language's rules for creating compound words while taking the morphemes from different languages (and emoji, why not), such as surrounding a Japanese kanji with Latin and Russian prefixes and suffixes, and wrapping the whole thing in the poetry tradition of a fourth language.
^{^}
The LLM doesn't make the encoding incomprehensible if you don't literally ask it to make it incomprehensible, and the encoding doesn't change depending on whether you inform it that it's for its future instance or not. So it's not actually doing self-modeling to figure out the best way to encode text for itself. Still, it does demonstrate that it can reverse this sort of wild-looking code due to its learned webs of associations, in ways that would be very hard for humans to untangle.

Comment by Thane Ruthenis on Fabien's Shortform · 2025-03-22T19:01:40.093Z · LW · GW

That's a worthwhile research direction, but I don't find the results here convincing. This experiment seems to involve picking an arbitrary and deliberately unfamiliar-to-the-LLM encoding, and trying to force the LLM to use it. That's not the threat model with RL causing steganography, the idea there is the opposite: that there is some encoding which would come natural to the model, more natural than English, and that RL would beeline for it.

"LLMs are great at learning to think in arbitrary encodings" was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.

Comment by Thane Ruthenis on johnswentworth's Shortform · 2025-03-21T09:35:14.069Z · LW · GW

I see two explanations: the boring wholesome one and the interesting cynical one.

The wholesome one is: You're underestimating how much other value the partner offers and how much the men care about the mostly-platonic friendship. I think that's definitely a factor that explains some of the effect, though I don't know how much.

The cynical one is: It's part of the template. Men feel that are "supposed to" have wives past a certain point in their lives; that it's their role to act. Perhaps they even feel that they are "supposed to" have wives they hate, see the cliché boomer jokes.

They don't deviate from this template, because:

It's just something that is largely Not Done. Plans such as "I shouldn't get married" or "I should get a divorce" aren't part of the hypothesis space they seriously consider.
- In the Fristonian humans-are-prediction-error-minimizers frame: being married is what the person expects, so their cognition ends up pointed towards completing the pattern, one way or another. As a (controversial) comparison, we can consider serial abuse victims, which seem to somehow self-select for abusive partners despite doing everything in their conscious power to avoid them.
- In your parlance: The "get married" life plan becomes the optimization target, rather than a prediction regarding how a satisfying life will look like.
- More generally: Most humans most of the time are not goal-optimizers, but adaptation-executors (or perhaps homeostatic agents). So "but X isn't conductive to making this human happier" isn't necessarily a strong reason to expect the human not to do X.
Deviation has social costs/punishments. Being viewed as a loser, not being viewed as a reliable "family man", etc. More subtly: this would lead to social alienation, inability to relate. Consider the cliché "I hate my wife" boomer jokes again. If everyone in your friend group is married and makes these jokes all the time, and you aren't, that would be pretty ostracizing.
Deviation has psychological costs. Human identities (in the sense of "characters you play") are often contextually defined. If someone spent ten years defining themselves in relation to their partner, and viewing their place in the world as part of a family unit, exiting the family unit would be fairly close to an identity death/life losing meaning. At the very least, they'd spend a fair bit of time adrift and unsure who they are/how to relate to the world anew – which means there are friction costs/usual problems with escaping a local optimum.
Not-deviation has psychological benefits. The feeling of "correctness", coming to enjoy the emotional labor, enjoying having a dependent, etc.

I don't know which of the two explains more of the effect. I'm somewhat suspicious of the interesting satisfyingly cynical one, simply because it's satisfyingly cynical and this is a subject for which people often invent various satisfyingly cynical ideas. It checks out to me at the object level, but it doesn't have to be the "real" explanation. (E. g., the "wholesome" reasons may be significant enough that most of the men wouldn't divorce even if the template dynamics were magically removed.)

User info

Posts

Comments