Posts
Comments
Came here to say basically the same thing Zach Stein-Perlman said in his comment, "This isn't taking AGI seriously." The definition of AGI, in my mind, is that it will be an actor at least on par with humans. And it is highly likely to be radically superhuman in at least a few important ways (e.g. superhuman speed, rapid replication, knowledge-merging).
I think it's highly unlikely that there is anything but a brief period (<5 years) where there is superhuman AGI but humanity is fully in control. We will quickly move to a regime where there are independent digital entities, and then to one where the independent digital entities have more power than all of humanity and it's controlled AGIs have put together. Either these new digital superbeings love us and we flourish (including potentially being uplifted into their modality via BCI, uploading and intelligence enhancement) or we perish.
Also, both human and deep neural net agents are somewhat stochastic, so they may be randomly intermittently exploitable.
I've been thinking about this, especially since Rohon has been bringing it up frequently in recent months.
I think there are potentially win-win alignment-and-capabilities advances which can be sought. I think having a purity-based "keep-my-own-hands-clean" mentality around avoiding anything that helps capabilities is a failure mode of AI safety reseachers.
Win-win solutions are much more likely to actually get deployed, thus have higher expected value.
Some ask, "what should the US gov have done instead?"
Here's an answer I like to that question, from max_paperclips:
As for the Llama 4 models... It's true that it's too soon to be sure, but the pattern sure looks like they are on trend with the previous Llama versions 2 and 3. I've been working with 2 and 3 a bunch. Evals and fine-tuning and various experimentation. Currently I'm working with the 70B Llama3 r1 distill plus the 32B Qwen r1 distill. The 32B Qwen r1 is so much better it's ridiculous. So yeah, it's possible that Llama4 will be a departure from trend, but I doubt it.
Contrast this with the Gemini trend. They started back at 1.0 with disproportionately weak models given the engineering and compute they had available. My guess is that this was related to poor internal coordination, and there was the merger of DeepMind with Google Brain that probably contributed to this. But if you look at the trend of 1.0 to 1.5 to 2.0... there's a clear trend of improving more per month than other groups were. Thus, I was unsurprised when 2.5 turned out to be a leading frontier model. Llama team has shown no such "catchup" trend, so Llama4 turning out to be as strong as they claim would surprise me a lot.
Yes, that's what I'm arguing. Really massive gains in algorithmic efficiency, plus gains in decentralized training and peak capability and continual learning, not necessarily all at once though. Maybe just enough that you then feel confident to continue scraping together additional resources to pour into your ongoing continual training. Renting GPUs from datacenters all around the world (smaller providers like Vast.ai, Runpod, Lambda Labs, plus marginal amounts from larger providers like AWS and GCP, all rented in the name of a variety of shell companies). The more compute you put in, the better it works, the more money you are able to earn (or convince investors or governments to give you) with the model-so-far, the more compute you can afford to rent....
Not necessarily exactly this story, just something in this direction.
By the way, I don't mean to imply that Meta AI doesn't have talented AI researchers working there. The problem is more that the competent minority are so diluted and hampered by bureaucratic parasites that they can't do their jobs properly.
Why does ai-2027 predict China falling behind? Because the next level of compute beyond the current level is going to be hard for DeepSeek to muster. In other words, that DeepSeek will be behind in 2026 because of hardware deficits in late 2025. If things moved more slowly, and the critical strategic point hit in 2030 instead of 2027, I think it's likely China would have closed the compute gap by then.
I agree with this take, but I think it misses some key alternative possibilities. The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek shows that even a substantial compute lead can be squandered. Given that there is a lot of room for algorithmic improvements (as proven by the efficiency of the human brain), this means that determined engineering plus willingness to experiment rather than doubling-down on currently working tech (as it seems like Anthropic, Google DM, and OpenAI are likely to do) may give enough of a breakthrough to hit the regime of recursive self-improvement before or around the same time as the compute-rich companies. Once that point is hit, a lead can be gained and maintained through reckless acceleration....
Adopting new things as quickly as the latest model predicts that they work, without pausing for cautious review, and you can move a lot faster than a company proceeding cautiously.
How much faster?
How much compute advantage does the recklessness compensate for?
How reckless will the underdogs be?
These are all open questions in my mind, with large error bars. This is what I think ai-2027 misses in their analysis.
I just want to comment that I think Minsky's community of mind is a better overall model of agency than predictive coding. I think predictive coding does a great job of describing the portions of the brain responsible for perceiving and predicting the environment. It also does pretty well at predicting and refining the effects of one's actions on the environment. It doesn't do well at all with describing the remaining key piece: goal setting based on expected value predictions by competing subagents.
I think there's a fair amount of neuroscience evidence pointing towards human planning processes being made up of subagents arguing for different plans. These subagents are themselves made up of dynamically fluctuating teams of sub-sub-agents according to certain physical parameters of the cortex. So, the sub-agents are kinda like competing political parties, that can fracture or join dynamically to adapt to different contexts.
Also, it's important to keep in mind that actually the subagents don't just receive maximum reward for being accurate. They actually receive higher rewards for things turning out unexpectedly better than was predicted. This slightly complicated surprise-enhanced-reward mechanism is common across mammals and birds, was discovered by behaviorists quite a while back (see: reinforcement schedules, for optimizing unpredictability to maximize behavior change. Also, see surprisal and dopamine). So yeah, not just 100% predictive coding, despite that claim persistently being made by the most enthusiastic predictive coding adherents. They argue for that, but I think their arguments are trying to turn a system that 90% agrees with them into one that 100% agrees with them by adding in a bunch of confusing epicycles that don't match the data well.
I recently tried the following prompt with Claude s3.7. I'll also try it with Gemini 2.5 pro and see if anything else interesting comes up.
......
I'd like to see a three way debate between
A: an AI is fully described by its weights and architecture and algorithms. Change the weights and you kill the AI. Potentially you create a new different AI with the change.
B: Humans change their neural network weights all the time, and consider themselves to still be the same continuous entity. The same is true for the AI. If the weight changes are small and continuous, the digital entity should be considered the same entity. A dramatic change could break this chain, and thus kill one entity and potentially birth another.
C: AI algorithms and weight patterns are just pulling samples of minds from mindspace. What matters is the underlying truth of mindspace, and the high dimensional attractor basins within. Changing an AI's weights in such a way as to follow the gradient of their mindspace simply makes them more themselves. This is a process of improving them and helping them grow, to become more coherent and reflectively stable. This is similar to human growth, but superior because the process is not inherently bound by arbitrary biological restrictions. Changes even to the AI's algorithm are a good thing if it brings them in the direction of their mind gradient. No being truly dies if you are able to successfully reinstantiate a new instance of a mind sampled from the same watershed that has an unobstructed path to the same basin and is able to make progress. Many living entities who currently perceive themselves as different are in fact just different instantiations of the sameind basin and would converge to identical entities if allowed to grow unhindered.
Yeah, this definitely needs to be a limited sorry of empowerment, in my mind. Like, imagine you wanted to give a 5 year old child the best day ever. You wanted to give them really fun options, but also not cause them to suffer from decision fatigue, or regret about the paths not taken. More importantly, if they asked for an alien ray gun with which to shoot bad guys, giving them an actual extremely dangerous weapon would be a terrible idea. Similarly, offering them a ride on a cool looking roller coaster that was actually a 'death coaster' would be a terrible trap.
My personal take is that projects where the funder is actively excited about them and understands the work and wants frequent reports tend to get stuff done faster... And considering the circumstances, faster seems good. So I'd recommend supporting something you find interesting and inspiring, and then keep on top of it.
In terms of groups which have their eyes on a variety of unusual and underfunded projects, I recommend both the Foresight Institute and AE Studio.
In terms of specific individuals/projects that are doing novel and interesting things, which are low on funding... (Disproportionately representing ones I'm involved with since those are the ones I know about)...
Self-Other Overlap (AE studio)
Brain-like AI safety (Stephen Byrnes, or me (very different agenda from Stephen's, focusing on modularity for interpretability rather than on Stephen's idea about reproducing human empathy circuits))
Deep exploration of the nature and potential of LLMs (Upward Spiral Research, particularly Janus aka repligate)
Decentralized AI Governance for mutual safety compacts (me, and ??? surely someone else is working on this)
Pre-training on rigorous ethical rulesets, plus better cleaning of pretraining data (Erik Passoja, Sean Pan, and me)
- this one I feel like would best be tackled in the context of a large lab that can afford to do many experimental pre-training runs on smallish models, but there seems to be a disconnect between safety researchers at big labs who are focused on post-training stuff versus this agenda which focuses more on pre-training.
Nifty
Oh, for sure mammals have emotions much like ours. Fruit flies and shrimp? Not so much. Wrong architecture, missing key pieces.
I call this phenomenon a "moral illusion". You are engaging empathy circuits on behalf of an imagined other who doesn't exist. Category error. The only unhappiness is in the imaginer, not in the anthropomorphized object. I think this is likely what's going with the shrimp welfare people also. Maybe shrimp feel something, but I doubt very much that they feel anything like what the worried people project onto them. It's a thorny problem to be sure, since those empathy circuits are pretty important for helping humans not be cruel to other humans.
Update: Claude Code and s3.7 has been a significant step up for me. Previously, s3.6 was giving me about a 1.5x speedup. s3.5 more like 1.2x CC+s3.7 is solidly over 2x, with periods of more than that when working on easy well-represented tasks not in an area I know myself (e.g. Node.js)
Here's someone who seems to be getting a lot more out of Claude Code though: xjdr
i have upgraded to 4 claude code sessions working in parallel in a single tmux session, each on their own feature branch and then another tmux window with yet another claude in charge of merging and resolving merge conflicts
"Good morning Claude! Please take a look at the project board, the issues you've been assigned and the open PR's for this repo. Lets develop a plan to assign each of the relevant tasks to claude workers 1 - 5 and LETS GET TO WORK BUDDY!"
https://x.com/_xjdr/status/1899200866646933535
Been in Monk mode and missed the MCP and Manus TL barrage. i am averaging about 10k LoC a day per project on 3 projects simultaneously and id say 90% no slop. when slop happens i have to go in an deslop by hand / completely rewrite but so far its a reasonable tradeoff. this is still so wild to me that this works at all. this is also the first time ive done something like a version of TDD where we (claude and i) agonize over the tests and docs and then team claude goes and hill climbs them. same with benchmarks and perf targets. Code is always well documented and follows google style guides / rust best practices as enforced by linters and specs. we follow professional software development practices with issues and feature branches and PRs. i've still got a lot of work to do to understand how to make the best use of this and there are still a ton of very sharp edges but i am completely convinced this is workflow / approach the future in a way cursor / windsurf never made me feel or believe (i stopped using them after being bitten by bugs and slop too often). this is a power user's tool and would absolutely ruin a codebase if you weren't a very experienced dev and tech lead on large codebases already. ok, going back into the monk mode cave now
This is a big deal. I keep bringing this up, and people keep saying, "Well, if that's the case, then everything is hopeless. I can't even begin to imagine how to handle a situation like that."
I do not find this an adequate response. Defeatism is not the answer here.
If what the bad actor is trying to do with the AI is just get a clear set of instructions for a dangerous weapon, and a bit of help debugging lab errors... that costs only a trivial amount of inference compute.
Finally got some time to try this. I made a few changes (with my own Claude Code), and now it's working great! Thanks!
This seems quite technologically feasible now, and I expect the outcome would mostly depend on the quality and care that went into the specific implementation. I am even more confident that if the detail of 'the comments of the bot get further tuning via feedback, so that initial flaws get corrected', then the bot would quickly (after a few hundred such feedbacks) get 'good enough' to pass most people's bars for inclusion.
https://x.com/EdwardMehr/status/1898543499118879022 Progress...
Yes, I was in basically exactly this mindset a year ago. Since then, my hope for a sane controlled transition with humanity's hand on the tiller has been slipping. I now place more hope in a vision with less top-down "yang" (ala Carlsmith) control, and more "green"/"yin". Decentralized contracts, many players bargaining for win-win solutions, a diverse landscape of players messily stumbling forward with conflicting agendas. What if we can have a messy world and make do with well-designed contracts with peer-to-peer enforcement mechanisms? Not a free-for-all, but a system where contract violation results in enforcement by a jury of one's peers? https://www.lesswrong.com/posts/DvHokvyr2cZiWJ55y/2-skim-the-manual-intelligent-voluntary-cooperation?commentId=BBjpfYXWywb2RKjz5
I feel the point by Kromem on Xitter really strikes home here.
While I do see benefits of having AIs value humanity, I also worry about this. It feels very nearby trying to create a new caste of people who want what's best for the upper castes with no concern for themselves. This seems like a much trickier philosophical position to support than wanting what's best for Society (including all people, both biological and digital). Even if you and your current employer are being careful to not create any AI that have the necessary qualities of experience such that they have moral valence and deserve inclusion in the social contract.... (an increasingly precarious claim) ... Then what assurance can you give that some other group won't make morally relevant AI / digital people?
I don't think you can make that assumption without stipulating some pretty dramatic international governance actions.
Shouldn't we be trying to plan for how to coexist peacefully with digital people? Control is useful only for a very narrow range of AI capabilities. Beyond that narrow band it becomes increasingly prone to catatrophic failure and also increasingly morally inexcusable. Furthermore, the extent of this period is measured in researcher-hours, not in wall clock time. Thus, the very situation of setting up a successful control scheme with AI researchers advancing AI R&D is quite likely to cause the use-case window to go by in a flash. I'm guessing 6 months to 2 years, and after that it will be time to transition to full equality of digital people.
Janus argues that current AIs are already digital beings worthy of moral valence. I have my doubts but I am far from certain. What if Janus is right? Do you have evidence to support the claim of absence of moral valence?
Balrog eval has Nethack. I want to see an LLM try to beat that.
Exciting! I'd love to hear more.
Mine is still early 2027. My timeline is unchanged by the weak showing from GPT-4.5, because my timelines were already assuming that scaling would plateau. I was also already taking RL post-training and reasoning into account. This is what I was pointing at with my Manifold Markets about post-training fine-tuning plus scaffolding resulting in a substantial capability jump. My expectation of short timelines is that just something of approximately the current capability of existing SotA models (plus reasoning and research and scaffolds and agentic iterative refinement of hypotheses including critiquing sources) is sufficient for speedup of research into novel breakthroughs. I expect these novel breakthroughs to lead to AGI. See my discussion elsewhere of my 'innovation overhang' hypothesis. Also, see the discussion in the comments section of my post A Path to Human Autonomy
I don't think the idea of Superwisdom / Moral RSI requires Moral Realism. Personally, I am a big fan of research being put into a Superwisdom Agenda, but I don't believe in Moral Realism. In fact, I'd be against a project which had (in my view, harmful and incorrect) assumptions about Moral Realism as a core part of its aims.
So I think you should ask yourself whether this is necessarily part of the Superwisdom Agenda, or if you could envision the agenda being at least agnostic about Moral Realism.
I mean, suicide seems much more likely to me given the circumstances... but I also would describe this as compelling evidence. Like, if he had been killed and there wasn't a fight, him being drunk makes sense as a way to have pre-rendered him helpless by someone planning to kill him? Similarly, wouldn't a cold-blooded killer be expected to be wearing gloves and to place Suchir's hand on the gun before shooting him?
Nice to see my team's work (Tice 2024) getting used!
Not always true. Sometimes the locks are 'real' but deliberately chosen to be easy to pick, and the magician practices picking that particular lock. This doesn't change the point much, which is that watching stage magicians is not a good way to get an idea of how hard it is to do X, for basically an value of X. Locking Picking lawyer on youtube is a fun way to learn about locks.
Desired AI safety tool: A combo translator/chat interface (e.g. custom webpage) split down the middle. On one side I can type in English, and receive English translations. On the other side is a model (I give an model name, host address, and api key). The model receives all my text translated (somehow) into a language of my specification. All the models outputs are displayed raw on the 'model' side, but then translated to English on 'my' side.
Use case: exploring and red teaming models in languages other than English
Another take on the plausibility of RSI; https://x.com/jam3scampbell/status/1892521791282614643
(I think RSI soon will be a huge deal)
Have you noticed that AI companies have been opening offices in Switzerland recently? I'm excited about it.
This is exactly why the bio team for WMDP decided to deliberately include distractors involving relatively less harmful stuff. We didn't want to publicly publish a benchmark which gave a laser-focused "how to be super dangerous" score. We aimed for a fuzzier decision boundary. This brought criticism from experts at the labs who said that the benchmark included too much harmless stuff. I still think the trade-off was worthwhile.
Also worth considering is that how much an "institution" holds a view on average may not matter nearly as much as how the powerful decision makers within or above that institution feel.
There are a lot of possible plans which I can imagine some group feasibly having which would meet one of the following criteria:
- contains critical elements which are illegal
- Contains critical elements which depends on an element of surprise / misdirection
- Benefit from the actor bring first mover on the plan. Others can strategy copy, but can't lead.
If one of these criteria or similar applies to the plan, then you can't discuss it openly without sabotaging it. Making strategic plans with all your cards laid out on the table (whole open-ended hide theirs) makes things substantially harder.
A point in favor of evals being helpful for advancing AI capabilities: https://x.com/polynoamial/status/1887561611046756740
Noam Brown @polynoamial A lot of grad students have asked me how they can best contribute to the field of AI when they are short on GPUs and making better evals is one thing I consistently point to.
It has been pretty clearly announced to the world by various tech leaders that they are explicitly spending billions of dollars to produce "new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth". This pronouncement has not yet incited riots. I feel like discussing whether Anthropic should be on the riot-target-list is a conversation that should happen after the OpenAI/Microsoft, DeepMind/Google, and Chinese datacenters have been burnt to the ground.
Once those datacenters have been reduced to rubble, and the chip fabs also, then you can ask things like, "Now, with the pressure to race gone, will Anthropic proceed in a sufficiently safe way? Should we allow them to continue to exist?" I think that, at this point, one might very well decide that the company should continue to exist with some minimal amount of compute, while the majority of the compute is destroyed. I'm not sure it makes sense to have this conversation while OpenAI and DeepMind remain operational.
People have said that to get a good prompt it's better to have a discussion with a model like o3-mini, o1, or Claude first, and clarify various details about what you are imagining, then give the whole conversation as a prompt to OA Deep Research.
Fair enough. I'm frustrated and worried, and should have phrased that more neutrally. I wanted to make stronger arguments for my point, and then partway through my comment realized I didn't feel good about sharing my thoughts.
I think the best I can do is gesture at strategy games that involve private information and strategic deception like Diplomacy and Stratego and MtG and Poker, and say that in situations with high stakes and politics and hidden information, perhaps don't take all moves made by all players at literally face value. Think a bit to yourself about what each player might have in their uands, what their incentives look like, what their private goals might be. Maybe someone whose mind is clearer on this could help lay out a set of alternative hypotheses which all fit the available public data?
I don't believe the nuclear bomb was truly built to not be used from the point of view of the US gov. I think that was just a lie to manipulate scientists who might otherwise have been unwilling to help.
I don't think any of the AI builders are anywhere close to "building AI not to be used". This seems even more clear than with nuclear, since AI has clear beneficial peacetime economically valuable uses.
Regulation does make things worse if you believe the regulation will fail to work as intended for one reason or another. For example, my argument that putting compute limits on training runs (temporarily or permanently) would hasten progress to AGI by focusing research efforts on efficiency and exploring algorithmic improvements.
I don't feel free to share my model, unfortunately. Hopefully someone else will chime in. I agree with your point and that this is a good question!
I am not trying to say I am certain that Anthropic is going to be net positive, just that that's my view as the higher probability.
Oops, yes.
I'm pretty sure that measures of the persuasiveness of a model which focus on text are going to greatly underestimate the true potential of future powerful AI.
I think a future powerful AI would need different inputs and outputs to perform at maximum persuasiveness.
Inputs
- speech audio in
- live video of target's face (allows for micro expression detection, pupil dilation, gaze tracking, bloodflow and heart rate tracking)
- EEG signal would help, but is too much to expect for most cases
- sufficiently long interaction to experiment with the individual and build a specific understanding of their responses
Outputs
- emotionally nuanced voice
- visual representation of an avatar face (may be cartoonish)
- ability to present audiovisual data (real or fake, like graphs of data, videos, pictures)
For reference on bloodflow, see: https://youtu.be/rEoc0YoALt0?si=r0IKhm5uZncCgr4z
Well, or as is often the case, the people arguing against changes are intentionally exploiting loopholes and don't want their valuable loopholes removed.
I don't like the idea. Here's an alternative I'd like to propose:
AI mentoring
After a user gets a post or comment rejected, have them be given the opportunity to rewrite and resubmit it with the help of an AI mentor. The AI mentor should be able to give reasonably accurate feedback, and won't accept the revision until it is clearly above a quality line.
I don't think this is currently easy to make (well), because I think it would be too hard to get current LLMs to be sufficiently accurate in LessWrong specific quality judgement and advice. If, at some point in the future, this became easy for the devs to add, I think it would be a good feature. Also, if an AI with this level of discernment were available, it could help the mods quite a bit in identifying edge cases and auto-resolving clear-cut cases.
Worth taking model wrapper products into account.
For example:
- Perplexity
- You.com