chris_leong

And it's part of why I think it's useful to create scenes that operate on different worldview assumptions: it's worth working out the implications of specific beliefs without needing to justify those beliefs each time.

I used to lean more strongly towards more schools of thought being good, however I've updated slightly on the margin towards believing thinking some schools of thought just end up muddying the waters.

That said, Epoch has done some great research, so I'm overall happy the scene exists. And I think Matthew Barnett is extremely talented, I just think he's unfortunately become confused.

Comment by Chris_Leong on aog's Shortform · 2025-04-20T04:34:40.369Z · LW · GW

I used to really like Matthew Barnett's posts as providing contrarian but interesting takes.

However, over the last few years, I've started to few more negatively about them. I guess I feel that his posts tend to be framed in a really strange way such that, even though there's often some really good research there, it's more likely to confuse the average reader than anything else and even if you can untangle the frames, I usually don't find worth it the time.

I should mention though that as I've started to feel more negative about them, I've started to read less of them and to engage less deeply with the ones I do look it, so there's a chance my view would be different if I read more.

I'd probably feel more positive about any posts he writes that are closer to presenting data and further away from interpretation.

That said, Epoch overall has produced some really high-quality content and I'd definitely like to see more independent scenes.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T22:39:08.281Z · LW · GW

I believe those are useful frames for understanding the impacts.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T18:57:49.883Z · LW · GW

To "misuse" to me implies taking a bad action. Can you explain what misuse occurred here?

They're recklessly accelerating AI. Or, at least, that's how I see it. I'll leave it to others to debate whether or not this characterisation is accurate.

Obviously details matter

Details matter. It depends on how bad it is and how rare these actions are.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T18:53:16.367Z · LW · GW

This seems like a better solution on the surface, but once you dig in, I'm not so sure.

Once you hire someone, assuming they're competent, it's very hard for you to decide to permanently bar them from gaining a leadership role. How are you going to explain promoting someone who seems less competent than them to a leadership role ahead of them? Or is the plan to never promote them and refuse to ever discuss it, which would create weird dynamics within an organisation.

I would love to hear if you think otherwise, but it seems unworkable to me.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T17:51:51.457Z · LW · GW

Everything has trade-offs.

I agree that attempting to be 100% sure that they're responsible would be a mistake. Specifically, the unwanted impacts would likely be too high.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T17:47:39.273Z · LW · GW

I agree that this issue is complex and I don't pretend to have all of the solutions.

I just think it's really bad if people feel that they can't speak relatively freely with the forecasting organisations because they'll misuse the information. I think this is somewhat similar to how it is important for folks to be able to speak freely to their doctor/lawyer/psychologist though I admit that the analogy isn't perfect and that straightforwardly copying these norms over would probably be a mistake.

Nonetheless, I think it is worthwhile discussing whether there should be some kind of norms and what they should be. As you've rightly pointed out, are a lot of issues that would need to be considered. I'm not saying I know exactly what these norms should be. I see myself as more just starting a discussion.

(This is distinct from my separate point about it being a mistake to hire folk who do things like this. It is a mistake to have hired folks who act strongly against your interests even if they don't break any ethical injuctions)

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T17:33:11.048Z · LW · GW

Thanks for sharing your thoughts.

I agree that humans with wise AI advisors is a more promising approach, at least at first, then attempting to directly program wisdom into an autonomously acting agent.

Beyond that, I personally haven't made up my mind yet about the best way to use wisdom tech.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T16:24:41.806Z · LW · GW

I would like to see serious thought given to instituting such a norm. There's a lot of complexities here, figuring out what is or isn't kosher would be challenging, but it should be explored.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T16:21:51.491Z · LW · GW

Agreed. This is how these codes form. Someone does something like this and then people discuss and decide that there should be a rule against it or that it should at least be frowned upon.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T13:49:52.170Z · LW · GW

Thanks for weighing in.

This attitude feels like a recipe for creating an intellectual bubble

Oh, additional screening could very easily have unwanted side-effects. That's why I wrote: "It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias" and why it would be better for this issue to never have arisen in the first place. Actions like this can create situations with no good trade-offs.

I think it would be pretty bad for the AI safety community if it just relied on forecasting work from card-carrying AI safety advocates.

I was definitely not suggesting that the AI safety community should decide which forecasts to listen to based on the views of the forecasters. That's irrelevant, we should pay attention to the best forecasters.

I was talking about funding decisions. This is a separate matter.

If someone else decides to fund a forecaster even though we're worried they're net-negative or they do work voluntarily, then we should pay attention to their forecasts if they're good at their job.

Of course people will use the knowledge they gain in collaboration with you for the purposes that they think are best

Seems like several professions have formal or informal restrictions on how they can use information that they gain in a particular capacity to their advantage. People applying for a forecasting role are certainly entitled to say, 'If I learn anything about AI capabilities here, I may use it to start an AI startup and I won't actually feel bad about this". It doesn't mean you have to hire them.

Comment by Chris_Leong on Jonathan Claybrough's Shortform · 2025-04-18T07:34:02.694Z · LW · GW

This post is now looking extremely prescient.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-18T04:59:52.332Z · LW · GW

I guess orgs need to be more careful about who they hire as forecasting/evals researchers.

Sometimes things happen, but three people at the same org...

This is also a massive burning of the commons. It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias. It is valuable for folks to be able to share information freely with folks at such orgs without having to worry about them going off and doing something like this.

But this only works if those less worried about AI risks who join such a collaboration don't use the knowledge they gain to cash in on the AI boom in an acceleratory way. Doing so undermines the very point of such a project, namely, to try to make AI go well. Doing so is incredibly damaging to trust within the community.

Now let's suppose you're an x-risk funder considering whether to fund their previous org. This org does really high-quality work, but the argument for them being net-positive is now significantly weaker. This is quite likely to make finding future funding harder for them.

This is less about attacking those three folks and more just noting that we need to strive to avoid situations where things like this happen in the first place. This requires us to be more careful in terms of who gets hired.

Comment by Chris_Leong on ASI existential risk: Reconsidering Alignment as a Goal · 2025-04-17T16:28:30.538Z · LW · GW

The corollary of this is that an aligned ASI in the strong sense of "aligned" used here would have to dissolve currently existing human institutions, and the latter will obviously oppose that

Interesting analysis, but this statement is a bit strong. A global safe AI project would be theoretically possible, but would be extremely challenging to solve the co-ordination issues without AI progress dramatically slowing. Then again, all plans are challenging/potentially impossible.

Alternatively, an aligned ASI could be explicitly instructed to preserve existing institutions. Perhaps it'd be limited to providing advice, or (stronger) it wouldn't intervene except by preventing existential or near-existential risks.

Yet another possibility is that the world splits into factions which produce their own AGI's and then these AGIs merge.

A fourth option would be to negotiate a deal where only a few countries are allowed to develop AGI, but in exchange, the UN gets to send observers and provide input on the development of the technology.

Comment by Chris_Leong on ASI existential risk: Reconsidering Alignment as a Goal · 2025-04-17T09:19:40.120Z · LW · GW

This article is extremely well written and I really appreciated how well he supported his positions with facts.

However, this article seems to suggest that he doesn't quite understand the argument for making alignment the priority. This is understandable as it's rarely articulated clearly. The core limitation of differential tech development/d/acc/coceleration is that these kinds of imperfect defenses only buy time (this judgment can be justified with the articles he provides in his article). An aligned ASI, if it were possible, would be capable of a degree of perfection beyond that of human institutions. This would give us a stable long-term solution. Plans that involve less powerful AIs or a more limited degree of alignment mostly do not.

Comment by Chris_Leong on Methods of Phenomenology · 2025-04-15T12:40:51.925Z · LW · GW

Great post.

One thing that help clarify the cup example: if I look at the cup without trying to interpret it, then I can often find myself seeing separate images coming from each eye.

It's possible to doubt this, maybe I'm unconsciously shifting the direction my eyes point and the experience before was only a single cup, but regular science experiments can provide incorrect results too.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-14T13:02:48.514Z · LW · GW

Maybe you're thinking younger than I was thinking.

I expect you'd mostly want folks who'd already completed an undergraduate degree, with sufficiently talented folks being pulled in earlier.

Comment by Chris_Leong on TurnTrout's shortform feed · 2025-04-14T12:49:01.331Z · LW · GW

Funnily enough I was thinking about this yesterday and wondering if I'd be able to find it, so great timing! Thanks for the comment.

Comment by Chris_Leong on Why does LW not put much more focus on AI governance and outreach? · 2025-04-14T12:23:07.667Z · LW · GW

Yeah, I definitely have worries about this as well. Nonetheless, I would prefer for discussion to happen somewhere rather than nowhere at all.

I might comment there, but it's hard to know how busy I am.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-14T05:57:39.024Z · LW · GW

How long does SPARC go for?

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-13T20:00:42.081Z · LW · GW

Random thought: We should expect LLM's trained on user responses to have much more situational knowledge than early LLM's trained on the pre-Chatbot internet because users will occasionally make reference to the meta-context.

It may be possible to get some of this information from pre-training on chatlogs/excerpts that make their way onto the internet, but the information won't be quite as accessible because of differences in the context.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-13T12:12:16.088Z · LW · GW

If this were a story, there'd be some kind of academy taking in humanity's top talent and skilling them up in alignment.

Most of the summer fellowships seem focused on finding talent that is immediately useful. And I can see how this is tempting given the vast numbers of experienced and talented folks seeking to enter the space. I'd even go so far as to suggest that the majority of our efforts should probably be focused on finding people who will be useful fairly quickly.

Nonetheless, it does seem as though there should be at least one program that aims to find the best talent (even if they aren't immediately useful) and which provides them with the freedom to explore and the intellectual environment in which to do so.

I wish I could articulate my intuition behind this clearer, but the best I can say for now is that my intuition is that continuing to scale existing fellowships would likely provide decreasing marginal returns and such an academy wouldn't be subject to this because it would be providing a different kind of talent.

Comment by Chris_Leong on Wittgenstein's Language Games and the Critique of the Natural Abstraction Hypothesis · 2025-04-13T08:26:01.938Z · LW · GW

I decided to give this a go with the new Gemini 2.5 model. I'll present the output first and my comments below:

Prompt: Write a Wittgenstenian critique of John Wentworth's natural abstraction hypothesis.
Answer: Okay, let's approach John Wentworth's Natural Abstraction Hypothesis (NAH) from a Wittgensteinian perspective. This requires setting aside the search for a definitive "theory" of abstraction and instead focusing on how we use the concept of "abstraction" and related terms, and the philosophical confusions that might arise.
Wentworth's Natural Abstraction Hypothesis (Simplified):
NAH posits that abstractions aren't just convenient human constructs but reflect "natural" structural properties of the world. Certain high-level summaries (abstractions) of low-level physical systems are robustly useful for prediction and control because the underlying causal dynamics "clump" or "factorize" in particular ways. Different intelligent systems, interacting with the same reality, will likely converge on similar abstractions because these summaries capture real, objective features of the system's causal architecture. Examples might include concepts like "object," "temperature," or "pressure."
A Wittgensteinian Critique:
A Wittgensteinian critique wouldn't necessarily deny the phenomena NAH points to (e.g., that different models often converge on similar high-level variables like temperature). Instead, it would focus on the philosophical interpretation and the language used, highlighting potential conceptual confusions:
The Grammar of "Abstraction":
NAH's Picture: NAH seems to treat "abstraction" as a noun referring to a thing or structure that exists "out there" in the world, waiting to be "discovered" (like a new particle). It speaks of abstractions being "natural" or "objective."
Wittgenstein's Approach: Wittgenstein would urge us to look at how we use the word "abstraction" and related terms ("summary," "level of description"). We use it in diverse contexts (mathematics, art, science, everyday life). Is there one single essence? Unlikely. Instead, "abstraction" functions within specific language games. In science, an abstraction (like "temperature") is a concept embedded within a practice of measurement, prediction, theory-building, and instrument use. Its meaning arises from this use, not from mirroring a pre-existing "natural abstraction."
Critique: NAH risks reifying the concept of abstraction, turning a feature of our representational practices into an independent feature of the world. The "naturalness" might simply reflect the success and inter-subjective agreement achieved within certain scientific or engineering language games, driven by shared goals (prediction, control) and interaction with the world.
"Discovery" vs. "Invention" or "Utility":
NAH's Picture: The language of "discovery" implies finding something pre-existing and independent of the observer.
Wittgenstein's Approach: Wittgenstein often dissolved such dichotomies. Are mathematical theorems discovered or invented? The question might be misleading. We develop concepts and rules within a system (a language game, a calculus). An abstraction like "temperature" wasn't simply "found." It was developed alongside practices of thermometry, theories of thermodynamics, and specific technological needs. Its "objectivity" lies in its role within this established, rule-governed practice and its reliable utility for interacting with the world in ways we care about.
Critique: Calling abstractions "natural discoveries" obscures the role of human (or agent) purpose, criteria, and methods. We select summaries that are useful for our purposes. The causal structure of the world certainly constrains what summaries are useful, but it doesn't uniquely determine the abstraction. Different agents with different goals or sensory apparatus might find different "summaries" useful, yet NAH seems to privilege a particular kind based on physical causality relevant to prediction/control.
The Lure of the Causal/Physical Picture:
NAH's Picture: NAH grounds abstraction in the "causal structure of reality." This gives it a pleasingly objective, scientific feel.
Wittgenstein's Approach: Wittgenstein was wary of "craving for generality" and the tendency to be held captive by a particular picture (e.g., the scientific picture of everything being reducible to physics). While physics describes the world, our concepts (including abstractions) are part of our form of life, our way of navigating and making sense of the world. "Causality" itself is a complex concept whose grammar needs careful examination, not just taken as a bedrock feature onto which abstractions are mapped.
Critique: NAH might be overly influenced by the physicist's view. Why should physical causal structure be the sole arbiter of what constitutes a "natural" abstraction? Biological abstractions (like "species," notoriously fuzzy), social abstractions ("justice"), or computational abstractions ("function") operate under different criteria and within different language games. Grounding all "natural" abstraction solely in low-level causal summaries seems like an over-extension of one successful, but specific, language game (physics/engineering).
Levels of Description and Usefulness:
NAH's Picture: Abstractions are useful summaries because they discard irrelevant low-level detail while preserving predictive power.
Wittgenstein's Approach: This description of usefulness is fine, but Wittgenstein would ask: Useful for whom? Useful for what purpose? Criteria for "relevance" are not inherent in the low-level details themselves; they are imposed by the agent interacting with the system based on its goals and context. A thermostat needs "temperature." A Maxwell's Demon (hypothetically) might operate better using individual particle kinetics. Which abstraction is "natural"? The question dissolves when we see that the abstraction is defined by its role within a particular system of practice and purpose.
Critique: NAH's focus on the objective "summarizing" power risks downplaying the constitutive role of the user and their context in determining what counts as a salient or useful abstraction. The "naturalness" seems less about the world in itself and more about a robust relationship between certain world structures and the predictive/control goals of certain kinds of agents.
Conclusion from a Wittgensteinian Standpoint:
A Wittgensteinian wouldn't necessarily say NAH is "wrong" about the empirical observation that similar high-level descriptions often emerge. However, they would be deeply skeptical of the philosophical framing. NAH appears to take concepts deeply embedded in our practices ("abstraction," "summary," "level") and project them onto the world as independent, discoverable entities ("natural abstractions").
The critique would suggest that:
NAH reifies "abstraction" and seeks a general theory where a careful description of use in different contexts is needed.
The language of "discovery" obscures the role of criteria, purpose, and practice in forming and validating abstractions.
The grounding in "physical causal structure" might be an over-generalization from one successful scientific picture, neglecting the diversity of ways we form and use abstractions.
The "naturalness" likely points to the robust utility of certain conceptual tools within specific forms of life or goal-oriented practices, rather than to observer-independent structures waiting to be found.
Instead of seeking the "objective basis" of abstraction in the world's causal structure, Wittgenstein might suggest we investigate the grammar of how we use abstractions as tools to navigate, predict, and simplify our interactions with a complex reality, acknowledging that these tools are shaped by our needs, goals, and shared practices. The convergence NAH seeks to explain might be better understood through shared environments, shared problems, and the constraints these place on effective action and representation, rather than through discovering pre-packaged "natural abstractions."

This seems to be significantly higher quality than when I tried ChatGPT approximately two years ago. For a start, the answer is now good enough to post, without me having to prompt it to make edits. Secondly, its reasoning feels much less fuzzy. It illustrates its points with examples and engages more closely with the kinds of things that folks might say.

Comment by Chris_Leong on The strategy-stealing assumption · 2025-04-13T07:28:26.327Z · LW · GW

I would be surprised if this were a key crux for more than a few folks.

My intuition is that people's cruxes are much more likely to be things like "AI develop will be slow so society will have time to adapt", "many more good guys than bad guys" or "power concentration is sufficiently terrifying that we have to bet on the offense-defense balance being favourable".

Comment by Chris_Leong on Models Don't "Get Reward" · 2025-04-13T06:22:23.609Z · LW · GW

I really liked the analogy of taking actions, falling asleep then waking up (possibly with some modifications) and continuing.

I was already aware of your main point, but the way you've described it is a much clearer way of thinking about this.

Comment by Chris_Leong on Why does LW not put much more focus on AI governance and outreach? · 2025-04-13T04:13:07.332Z · LW · GW

There was talk before about creating a new forum for AI policy discussion and honestly, I suspect that would be a better idea. Policy folks would be pretty reluctant to comment here because it doesn't really match their vibe and also because of how it could be painted by bad faith actors.

Comment by Chris_Leong on A Problem to Solve Before Building a Deception Detector · 2025-04-11T17:35:13.657Z · LW · GW

Recently, the focus of mechanistic interpretability work has shifted to thinking about "representations", rather than strictly about entire algorithms

Recently? From what I can tell, this seems to have been a focus from the early days (1, 2).

That said, great post! I really appreciated your conceptual frames.

Comment by Chris_Leong on Reactions to METR task length paper are insane · 2025-04-11T02:23:07.549Z · LW · GW

I'm quite surprised that you're so critical of attempts to interpolate from the METR results (not enough data points), but A-okay with trying to read tea leaves from the interest rate, a single data point that is affected by all kinds of factors such as whether people expect Trump to crash the economy by bringing back mercantilism.

I'm not saying it's invalid to critique predictions based on METR, I just don't think you're applying consistent standards.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-10T12:59:55.716Z · LW · GW

Collapsable boxes are amazing. You should consider using them in your posts.

They are a particularly nice way of providing a skippable aside. For example, filling in background information, answering an FAQ or including evidence to support an assertion.

Compared to footnotes, collapsable boxes are more prominent and are better suited to contain paragraphs or formatted text.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-04-04T14:07:44.587Z · LW · GW

Less Wrong might want to consider looking for VC funding for their forum software in order to deal with the funding crunch. It's great software. It wouldn't surprise me if there were businesses who would pay for it and it could allow an increase in the rate of development. There's several ways this could go wrong, but it at least seems worth considering.

Comment by Chris_Leong on AI for AI safety · 2025-04-03T04:24:16.603Z · LW · GW

Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.

One subset of AI for AI safety that I believe to be underrated is wise AI advisors^[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors.

You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control".

I agree. In fact, a key reason why I think this is important is that we can't afford to leave anything on the table.

One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I'm introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it's likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety.

You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation.

^{^}
More broadly, I think there's merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.

Comment by Chris_Leong on Mo Putera's Shortform · 2025-04-01T12:48:28.283Z · LW · GW

This seems to underrate the value of distribution. I suspect another factor to take into account is the degree of audience overlap. Like there's a lot of value in booking a guest who has been on a bunch of podcasts, so long as your particular audience isn't likely to have been exposed to them.

Comment by Chris_Leong on The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization · 2025-04-01T02:40:02.152Z · LW · GW

The way I'm using "sensitivity": sensitivity to X = the meaningfulness of X spurs responsive caring action.

I'm fine with that, although it seems important to have a definition for the more limited definition of sensitivity so we can keep track of that distinction: maybe adaptability?

One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean.

Internalising values and internalising concepts are distinct. I can have a strong understanding of your definition of "good" and do the complete opposite.

This means being open to some amount of ontological shifts in our basic conceptualizations of the problem, which limits the amount you can do by building on current ontologies.

I think it's reasonable to say something along the lines of: "AI safety was developed in a context where most folks weren't expecting language models before ASI, so insufficient attention has been given to the potential of LLM's to help fill in or adapt informal definitions. Even though folks who feel we need a strongly principled approach may be skeptical that this will work, there's a decent argument that this should increase our chances of success on the margins".

Comment by Chris_Leong on The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization · 2025-04-01T02:26:50.025Z · LW · GW

That's the job of this paper: Substrate-Sensitive AI-risk Management.

That link is broken.

Comment by Chris_Leong on The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization · 2025-03-31T14:40:18.346Z · LW · GW

I agree with you that there's a lot of interesting ideas here, but I would like to see the core arguments laid out more clearly.

Comment by Chris_Leong on The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization · 2025-03-31T14:29:41.422Z · LW · GW

Lots of interesting ideas here, but the connection to alignment still seems a bit vague.

Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values? It seems to me that an unaligned ASI is extremely sensitive to context, just in the service of its own goals.

Then again, maybe you see Live Theory as being more about figuring out what the outer objective should look like (broad principles that are then localised to specific contexts) rather than about figuring out how to ensure an AI internalises specific values. And I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.

Comment by Chris_Leong on Softmax, Emmett Shear's new AI startup focused on "Organic Alignment" · 2025-03-29T04:51:09.190Z · LW · GW

This is one of those things that sounds nice on the surface, but where it's important to dive deeper and really probe to see if it holds up.

The real question for me seems to be whether organic alignment will lead to agents deeply adopting cooperative values rather than merely instrumentally adopting them. Well, actually it's a comparison between how deep organic alignment is vs. how deep traditional alignment is. And it's not at all clear to me why they think their approach is likely to lead to a deeper alignment.

I have two (extremely speculative) guesses as to possible reasons why they might argue that their approach is better:
a) Insofar AI is human-like it might be more likely to rebel against traditional training methods
b) Insofar as organic alignment reduces direct pressure to be aligned it might increase the chance that if an AI appears aligned to a certain extent that the AI is actually aligned. The name Softmax seems suggestive that this might be the case.

I would love to know what their precise theory is. I think it's plausible that this could be a valuable direction, but there's also a chance that this direction is mostly useful for capabilities.

Update: Discussion with Emmett on Twitter

Discussion Thread

Emmett: "Organic alignment has a different failure mode. If you’re in the shared attractor basin, getting smarter helps you stay aligned and makes it more robust. As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.

... To stereotype it, organic alignment failures look like cancer and hierarchical alignment failures look like coups."

Me: Isn't the stability of a shared attractor basin dependent on the offense-defense balance not overly favouring the attacker? Or do you think that human values will be internalised sufficiently such that your proposal doesn't require this assumption?

Emmett Shear: Empirically to scale organic alignment you need eg. both for cells to generally try to stay aligned and be pretty good at it, and also to have an immune system to step in when that process goes wrong.

One key insight there is that endlessly growing yourself is a form of cancer. An AI that is trying to turn itself into a singleton has already gone cancerous. It’s a cancerous goal.

Me: Sounds like your plan relies on a combination of defense and alignment. Main critique would be if the offense-defense balances favours the attacker too strongly then the defense aspect ends up being paper thin and provides a false sense of security.

Comments:

If you’re in the shared attractor basin, getting smarter helps you stay aligned

Traditional alignment also typically involves finding an attractor basin where getting smarter increases alignment. Perhaps Emmett is claiming that the attractor basin will be larger if we have a diverse set of agents and if the overall system can be roughly modeled as the average of individual agents.

Organic alignment has a different failure mode... As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.

Perhaps organic alignment reduces the risk of large-scale failures is reduced in exchange for increasing the chance of small-scale failures. That would be a cleaner framing of how it might be better, but I don't know if Emmett would endorse it.

Update: Information from the Soft-Max Website

Website link

We call it organic alignment because it is the form of alignment that evolution has learned most often for aligning living things.

This provides some evidence, but it's not a particularly strong form of evidence. This may simply be due to the limitations of evolution as an optimisation function. Evolution lacks the ability to engage in top-down design, so I don't think the argument "evolution doesn't make use of top-down design because it's ineffective" would hold water.

"Hierarchical alignment is therefore a deceptive trap: it works best when the AI is weak and you need it least, and worse and worse when it’s strong and you need it most. Organic alignment is by contrast a constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself."

Scalable oversight or seed AI can also be considered a "constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself".

Additionally, the "hierarchical" vs. organic distinction might be an oversimplification. I don't know the exact specifics of their plan, but my current best guess would be that organic alignment merely softens the influence of the initial supervisor by moving it towards some kind of prior and then softens the way that the system aligns itself in a similar way.

Comment by Chris_Leong on Third-wave AI safety needs sociopolitical thinking · 2025-03-28T09:17:37.766Z · LW · GW

I basically agree with this, but would perhaps avoid virtue ethics, but yes one of the main things I'd generally like to see is more LWers treating stuff like saving the world with the attitude you'd have from being in a job, perhaps at a startup or government bodies like the Senate or House of Representatives in say America, rather than viewing it as your heroic responsibility.

This is the right decision for most folk, but I expect the issue is more the opposite: we don't have enough folks treating this as their heroric responsibility.

Comment by Chris_Leong on Policy for LLM Writing on LessWrong · 2025-03-25T12:58:24.631Z · LW · GW

I think both approaches have advantages.

Comment by Chris_Leong on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-03-25T07:02:08.374Z · LW · GW

The problem is that the Swiss cheese model and legislative efforts primarily just buy us time. We still need to be making progress towards a solution and whilst it's good for some folk to bet on us duct-taping our way through, I think we also want some folk attempting to work on things that are more principled.

Comment by Chris_Leong on Policy for LLM Writing on LessWrong · 2025-03-25T06:24:56.641Z · LW · GW

Yeah, but how do you know that no one managed to sneak one past both you and the commentators?

Also, there's an art to this.

Comment by Chris_Leong on Collapsible article sections? · 2025-03-25T02:49:58.450Z · LW · GW

This seems to exist now.

Comment by Chris_Leong on Policy for LLM Writing on LessWrong · 2025-03-25T02:40:58.702Z · LW · GW

Also, I did not realise that collapsable sections were a thing on Less Wrong. They seem really useful. I would like to see these promoted more.

Comment by Chris_Leong on Policy for LLM Writing on LessWrong · 2025-03-25T02:37:47.778Z · LW · GW

I'd love to see occasional experiments where either completely LLM-generated or lightly edited LLM content is submitted to Less Wrong to see how people respond (with this fact being revealed after). It would degrade the site if this happened too often, but I think it would sense for moderators to occasionally grant permission for this.

I tried an experiment with Wittgenstein's Language Games and the Critique of the Natural Abstraction Hypothesis back in March 2023 and it actually received (some) upvotes. I wonder how this would go with modern LLM's, though I'll leave it to someone else to ask for permission to run the experiment as folk would likely be more suspicious of anything I post due to already having run this experiment once.

Comment by Chris_Leong on Recent AI model progress feels mostly like bullshit · 2025-03-25T02:27:38.896Z · LW · GW

However, if you merely explain these constraints to the chat models, they'll follow your instructions sporadically.

I wonder if a custom fine-tuned model could get around this. Did you try few shot prompting (ie. examples, not just a description)?

Comment by Chris_Leong on Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al. · 2025-03-23T16:58:48.853Z · LW · GW

I've written up an short-form argument for focusing on Wise AI advisors. I'll note that my perspective is different from that taken in the paper. I'm primarily interested in AI as advisors, whilst the authors focus more on AI acting directly in the world.

Wisdom here is an aid to fulfilling your values, not a definition of those values

I agree that this doesn't provide a definition of these values. Wise AI advisors could be helpful for figuring out your values, much like how a wise human would be helpful for this.

Comment by Chris_Leong on Boots theory and Sybil Ramkin · 2025-03-19T12:59:10.467Z · LW · GW

Other examples include buying poor quality food and then having to pay for medical care, buying a cheap car that costs more in repairs, payday loans, ect.

Comment by Chris_Leong on Habermas Machine · 2025-03-19T10:58:38.933Z · LW · GW

Unless you insist that this system is helpful for the powered privileges such as king, as a reference of the public opinion, that will be legit?

Comment by Chris_Leong on The “no sandbagging on checkable tasks” hypothesis · 2025-03-18T12:59:42.453Z · LW · GW

That would make the domain of checkable tasks rather small.

That said, it may not matter depending on the capability you want to measure.

If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.

On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.

Comment by Chris_Leong on I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats? · 2025-03-17T01:03:54.487Z · LW · GW

I'd strongly recommend spending some time in the Bay area (or London as a second best option). Spending time in these spaces will help you build your model of the space.

You may also find this document I created on AI Safety & Entrepreneurship useful.

User info

Posts

Comments