Posts

Biden administration unveils global AI export controls aimed at China 2025-01-14T01:01:13.927Z
Higher and lower pleasures 2024-12-05T13:13:46.526Z
Linkpost: Rat Traps by Sheon Han in Asterisk Mag 2024-12-03T03:22:45.424Z
Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al. 2024-11-11T16:13:26.504Z
Some Preliminary Notes on the Promise of a Wisdom Explosion 2024-10-31T09:21:11.623Z
Linkpost: Hypocrisy standoff 2024-09-29T14:27:19.175Z
On the destruction of America’s best high school 2024-09-12T15:30:20.001Z
The Bar for Contributing to AI Safety is Lower than You Think 2024-08-16T15:20:19.055Z
Michael Streamlines on Buddhism 2024-08-09T04:44:52.126Z
Have people given up on iterated distillation and amplification? 2024-07-19T12:23:04.625Z
Politics is the mind-killer, but maybe we should talk about it anyway 2024-06-03T06:37:57.037Z
Does reducing the amount of RL for a given capability level make AI safer? 2024-05-05T17:04:01.799Z
Link: Let's Think Dot by Dot: Hidden Computation in Transformer Language Models by Jacob Pfau, William Merrill & Samuel R. Bowman 2024-04-27T13:22:53.287Z
"You're the most beautiful girl in the world" and Wittgensteinian Language Games 2024-04-20T14:54:54.503Z
The argument for near-term human disempowerment through AI 2024-04-16T04:50:53.828Z
Reverse Regulatory Capture 2024-04-11T02:40:46.474Z
On the Confusion between Inner and Outer Misalignment 2024-03-25T11:59:34.553Z
The Best Essay (Paul Graham) 2024-03-11T19:25:42.176Z
Can we get an AI to "do our alignment homework for us"? 2024-02-26T07:56:22.320Z
What's the theory of impact for activation vectors? 2024-02-11T07:34:48.536Z
Notice When People Are Directionally Correct 2024-01-14T14:12:37.090Z
Are Metaculus AI Timelines Inconsistent? 2024-01-02T06:47:18.114Z
Random Musings on Theory of Impact for Activation Vectors 2023-12-07T13:07:08.215Z
Goodhart's Law Example: Training Verifiers to Solve Math Word Problems 2023-11-25T00:53:26.841Z
Upcoming Feedback Opportunity on Dual-Use Foundation Models 2023-11-02T04:28:11.586Z
On Having No Clue 2023-11-01T01:36:10.520Z
Is Yann LeCun strawmanning AI x-risks? 2023-10-19T11:35:08.167Z
Don't Dismiss Simple Alignment Approaches 2023-10-07T00:35:26.789Z
What evidence is there of LLM's containing world models? 2023-10-04T14:33:19.178Z
The Role of Groups in the Progression of Human Understanding 2023-09-27T15:09:45.445Z
The Flow-Through Fallacy 2023-09-13T04:28:28.390Z
Chariots of Philosophical Fire 2023-08-26T00:52:45.405Z
Call for Papers on Global AI Governance from the UN 2023-08-20T08:56:58.745Z
Yann LeCun on AGI and AI Safety 2023-08-06T21:56:52.644Z
A Naive Proposal for Constructing Interpretable AI 2023-08-05T10:32:05.446Z
What does the launch of x.ai mean for AI Safety? 2023-07-12T19:42:47.060Z
The Unexpected Clanging 2023-05-18T14:47:01.599Z
Possible AI “Fire Alarms” 2023-05-17T21:56:02.892Z
Google "We Have No Moat, And Neither Does OpenAI" 2023-05-04T18:23:09.121Z
Why do we care about agency for alignment? 2023-04-23T18:10:23.894Z
Metaculus Predicts Weak AGI in 2 Years and AGI in 10 2023-03-24T19:43:18.522Z
Wittgenstein's Language Games and the Critique of the Natural Abstraction Hypothesis 2023-03-16T07:56:18.169Z
The Law of Identity 2023-02-06T02:59:16.397Z
What is the risk of asking a counterfactual oracle a question that already had its answer erased? 2023-02-03T03:13:10.508Z
Two Issues with Playing Chicken with the Universe 2022-12-31T06:47:52.988Z
Decisions: Ontologically Shifting to Determinism 2022-12-21T12:41:30.884Z
Is Paul Christiano still as optimistic about Approval-Directed Agents as he was in 2018? 2022-12-14T23:28:06.941Z
How is the "sharp left turn defined"? 2022-12-09T00:04:33.662Z
What are the major underlying divisions in AI safety? 2022-12-06T03:28:02.694Z
AI Safety Microgrant Round 2022-11-14T04:25:17.510Z

Comments

Comment by Chris_Leong on Arbital has been imported to LessWrong · 2025-02-20T06:50:26.782Z · LW · GW

Interesting idea. Will be interesting to see if this works out.

Comment by Chris_Leong on Arbital has been imported to LessWrong · 2025-02-20T01:22:17.560Z · LW · GW

Lenses are... tabs.  Opinionated tabs

Could you explain the intended use further?

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-02-18T11:49:38.834Z · LW · GW

Acausal positive interpretation

Comment by Chris_Leong on What is a decision theory as a mathematical object? · 2025-02-12T19:00:02.077Z · LW · GW

My take: Counterfactuals are Confusing because of an Ontological Shift:

"In our naive ontology, when we are faced with a decision, we conceive of ourselves as having free will in the sense of there being multiple choices that we could actually take. These choices are conceived of as actual and we when think about the notion of the "best possible choice" we see ourselves as comparing actual possible ways that the world could be. However, we when start investigating the nature of the universe, we realise that it is essentially deterministic and hence that our naive ontology doesn't make sense. This forces us to ask what it means to make the "best possible choice" in a deterministic ontology where we can't literally take a choice other than the one that we make. This means that we have to try to find something in our new ontology that roughly maps to our old one."

We expect a straightforward answer to "What is a decision theory as a mathematical object?", since we automatically tend to assume our ontology is consistent, but if this isn't the case and we actually have to repair our ontology, it's unsurprising that we end up with different kinds of objects.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-02-10T17:51:50.274Z · LW · GW

Well, we're going to be training AI anyway. If we're just training capabilities, but not wisdom, I think things are unlikely to go well. More thoughts on this here.

Comment by Chris_Leong on evhub's Shortform · 2025-02-10T17:21:28.825Z · LW · GW

I believe that Anthropic should be investigating artificial wisdom:

I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.

I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

Comment by Chris_Leong on Nonpartisan AI safety · 2025-02-10T17:19:14.559Z · LW · GW

By Wise AI Advisors, I mean training an AI to provide wise advice. BTW, I've now added a link to a short-form post in my original comment where I detail the argument for wise AI advisors further.

Comment by Chris_Leong on Nonpartisan AI safety · 2025-02-10T17:02:20.749Z · LW · GW

Props for proposing a new and potentially fruitful framing.

I would like to propose training Wise AI Advisors as something that could potentially meet your two criteria:

• Even if AI is pretty much positive, wise AI advisors would allow us get closer to maximising these benefits

• We can likely save the world if we make sufficiently wise decisions[1]

  1. ^

    There's also a chance that we're past the point of no return, but if that's the case, we're screwed no matter what we do. Okay, it's slightly more complicated because there's a chance that we aren't yet past the point of no return, but if we pursue wise AI advisors, instead of redirecting these resources to another project that we will be past the point of no return by the time we produce such advisors. This is possible, but my intuition is that it's worth pursuing anyway.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-02-10T16:54:00.795Z · LW · GW

Why the focus on wise AI advisors?[1]

I'll be writing up a proper post to explain why I've pivoted towards this, but it will still take some time to produce a high quality post, so I decided it was worthwhile releasing a short-form description in the mean time.

By Wise AI Advisors, I mean training an AI to provide wise advice.

a) AI will have a massive impact on society given the infinite ways to deploy such a general technology
b) There are lots of ways this could go well and lots of ways that this could go extremely poorly (election interference, cyber attacks, development of bioweapons, large-scale misinformation, automated warfare, catastrophic malfunctions, permanent dictatorships, mass unemployment ect.)
c) There is massive disagreement on best strategy (decentralization vs. limiting proliferation, universally accelerating AI vs winning the arms race vs pausing, incremental development of safety vs principled approaches, offence-defence balance favoring the attacker or defender) or even what we expect the development of AI to look like (overhyped bubble vs machine god, business as usual vs this changes everything). Making the wrong call could prove catastrophic.
d) The AI is developing incredibly rapidly (no wall, see o3 crushing the ARC challenge!). We have limited time to act and to figure out how to act. 
e) Given both the difficulty and the number of different challenges and strategic choices we'll be facing in short order, humanity needs to rapidly improve its capability to navigate such situations
f) Whilst we can and should be developing top governance and strategy talent, this is unlikely to be sufficient by itself. We need every advantage we can get, we can't afford to leave anything on the table.
g) Another way of framing this: Given the potential of AI development to feed back into itself, if it isn't also feeding back into increased wisdom in how we navigate the world, our capabilities are likely to far outstrip our ability to handle them.

For these reasons, I think it is vitally important for society to be working on training these advisors now.

Why frame this in terms of a vague concept like wisdom rather than specific capabilities?

I think the chance of us being able to steer the world towards a positive direction is much higher if we're able to combine multiple capabilities, so it makes sense to have a handle for the broader project, as well as individual handles for individual sub-projects.

Isn't training AI to be wise intractable?

Possibly, though I'm not convinced it's harder than any of the other ambitious agendas and we won't know how far we can go without giving it a serious effort. Is training an AI to be wise really harder than aligning it?

Compare:
• Ambitious mechanistic interpretability aims to perfectly understand how a neural network works at the level of individual weights
• Agent foundations attempting to truly understand what concepts like agency, optimisation, decisions are values are at a fundamental level
• Davidad's Open Agency architecture attempting train AI's that come with proof certificates that an AI has less than a certain probability of having unwanted side-effects

Is it obvious that any of these are easier?

In terms of making progress, my initial focus is on investigating the potential of amplified imitation learning, that is training imitation agents on wise people then enhancing them with techniques like RAG or trees of agents.

Does anyone else think wise AI advisors are important?

Going slightly more general to training wise AI rather than specifically advisors[2], there was the competition on the Automation of Wisdom and Philosophy organised by Owen Cotton-Barrett and there's this paper (summary) by Samuel Johnson and others incl. Yoshua Bengio, Melanie Mitchell and Igor Grossmann.

LintzA listed Wise AI advisors for governments as something worth considering in The Game Board Has Been Flipped[3].

Further Discussion:

You may also interested in reading my 3rd prize-winning entry to the AI Impacts Competition on the Automation of Wisdom and Philosophy. It's divided in two parts:

An Overview of “Obvious” Approaches to Training Wise AI Advisors

Some Preliminary Notes on the Promise of a Wisdom Explosion
 

  1. ^

    I previously described my agenda as Wise AI Advisors via Imitation Learning. I now see that as overly narrow. The goal is to produce Wise AI Advisors via any means and I think that Imitation Learning is underrated, but I'm sure there's lots of other approaches that are underrated as well.

  2. ^

    One key reason why I favour AI advisors rather than directly training wisdom into AI is that the human users can compensate for weaknesses in the advisors. For example, it only has to inspire the humans to make the correct choice rather than make the correct choice. We may take the harder step of training systems that don't have a human in the loop later, but this will be easier if we have AI advisors to help us with this.

  3. ^

    No argument included sadly.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-02-05T18:43:59.985Z · LW · GW

Thanks, seems pretty good on a quick skim, I'm a bit less certain on the corrigibility section, also more issues might become apparent if I read through it more slowly.

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-02-04T23:43:15.534Z · LW · GW

How about "Please summarise Eliezer Yudkowsky's views on decision theory and its relevance to the alignment problem".

Comment by Chris_Leong on Chris_Leong's Shortform · 2025-02-04T01:21:53.363Z · LW · GW

Someone should see how good Deep Research is at generating reports on AI Safety content.

Comment by Chris_Leong on Planning for Extreme AI Risks · 2025-01-30T05:13:47.621Z · LW · GW

Nice article, I especially love the diagrams!

In Human Researcher Obsolescence you note that we can't completely hand over research unless we manage to produce agents that are at least as "wise" as the human developers.

I agree with this, though I would love to see a future version of this plan include an expanded analysis of the role that wise AI plays would play in the strategy of Magma, as I believe that this could be a key aspect of making this plan work.

In particular:

• We likely want to be developing wise AI advisors to advise us during the pre-hand-off period. In fact, I consider this likely to be vital to successfully navigating this period given the challenges involved.

• It's possible that we might manage to completely automate the more objective components of research without managing to completely automating the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can't defer to them.

• When developing AI capabilities, there's an additional lever in terms of how much Magma focuses on direct capabilities vs. focusing on wisdom.

Comment by Chris_Leong on Quinn's Shortform · 2025-01-24T06:32:40.269Z · LW · GW

Fellowships are typically only for a few month and even if you're in India, you'd likely have to move for the fellowship unless it happened to be in your exact city.

Comment by Chris_Leong on Quinn's Shortform · 2025-01-24T02:33:46.178Z · LW · GW

Impact Academy was doing this, before they pivoted towards the Global AI Safety Fellowship. It's unclear whether any further fellowships should be in India or a country that is particularly generous with its visas.

Comment by Chris_Leong on AI #90: The Wall · 2025-01-22T15:49:34.881Z · LW · GW

I posted this comment on Jan's blog post

Underelicitation assumes a "maximum elicitation" rather than a never-ending series of more and more layers of elicitation that could be discovered.

You've undoubtedly spent much more time thinking about this than I have, but I'm worried that attempts to maximise elicitation merely accelerate capabilities without actually substantially boosting safety.

Comment by Chris_Leong on Who is marketing AI alignment? · 2025-01-20T05:12:47.160Z · LW · GW

In terms of infrastructure, it would be really cool to have a website collecting the more legible alignment research (papers, releases from major labs or non-profits).

Comment by Chris_Leong on Charlie Steiner's Shortform · 2025-01-17T15:18:04.115Z · LW · GW

I think I saw someone arguing that their particular capability benchmark was good for evaluating the capability, but of limited use for training the capability because their task only covered a small fraction of that domain.

Comment by Chris_Leong on Everywhere I Look, I See Kat Woods · 2025-01-17T07:41:50.896Z · LW · GW

(Disclaimer: I previously interned at Non-Linear)

Different formats allow different levels of nuance. Memes aren't essays and they shouldn't try to be.

I personally think these memes are fine and that outreach is too. Maybe these posts oversimplify things a bit too much for you, but I expect that average person on these subs probably improves the level of their thinking from seeing these memes.

If, for example, you think r/EffectiveAltruism should ban memes, then I recommend talking to the mods.

Comment by Chris_Leong on My Model of Epistemology · 2025-01-12T08:39:51.408Z · LW · GW

Well done for managing to push something out there. It's a good start, I'm sure you'll fill in some of the details with other posts over time.

Comment by Chris_Leong on Is AI Alignment Enough? · 2025-01-11T11:30:47.611Z · LW · GW

What if the thing we really need the Aligned AI to engineer for us is... a better governance system?


I've been arguing for the importance of having wise AI advisors. Which isn't quite the same thing as a "better governance system", since they could advise us about all kinds of things, but feels like it's in the same direction.

Comment by Chris_Leong on Dialogue introduction to Singular Learning Theory · 2025-01-11T07:33:04.309Z · LW · GW

Excellent post. It helped clear up some aspects of SLT for me. Any chance you could clarify why this volume is called the "learning coefficient?"

Comment by Chris_Leong on How can humanity survive a multipolar AGI scenario? · 2025-01-10T13:07:31.251Z · LW · GW

I suspect that this is will be an incredibly difficult scenario to navigate and that our chances will be better if we train wise AI advisors.

I think our chances would be better still if we could pivot a significant fraction of the talent towards developing WisdomTech rather than IntelligenceTech.

On a more concrete level, I suspect the actual plan looks like some combination of alignment hacks, automated alignment research, control, def/acc, limited proliferation of AI, compute governance and the merging of actors. Applied wisely, the combination of all of these components may be enough. But figuring out the right mix isn't going to be easy.

Comment by Chris_Leong on Discursive Warfare and Faction Formation · 2025-01-10T13:02:58.155Z · LW · GW

Committed not to Eliezer's insights but to exaggerated versions of his blind spots

 

My guess would be that this is an attempt to apply a general critique of what tends to happen in community's in general to the LW community without accounting for its specifics.

Most people in the LW community would say that Eliezer is overconfident or even arrogant (sorry Eliezer!).

The incentive gradient for status hungry folk is not to double-down on Eliezer's views, but to double-down on your idiosyncratic version of rationalism, different enough from the community's to be interesting, but similar enough to be legible.

(Also, I strongly recommend the post this is replying to. I was already aware that discourse functioned in the way described, but it helped me crystallised some of the phenomena much more clearly).

Comment by Chris_Leong on Bryce Robertson's Shortform · 2025-01-09T03:56:16.218Z · LW · GW

Perhaps you'd be interested in adding a page on AI Safety & Entrepreneurship?

Comment by Chris_Leong on Alignment Is Not All You Need · 2025-01-04T05:16:25.034Z · LW · GW

It will be the goverment(s) who decides how AGI is used, not a benevolent coalition of utilitarian rationalists.

 

Even so, the government still needs to weigh up opposing concerns, maintain ownership of the AGI, set up the system in such a way that they have trust in it and gain some degree of buy-in from society for the plan[1].
 

  1. ^

    Unless their plan is to use the AGI to enforce their will

Comment by Chris_Leong on 3. Improve Cooperation: Better Technologies · 2025-01-03T19:18:25.097Z · LW · GW

I suspect that AI advisors could be one of the most important technologies here.


Two main reasons:
a) A large amount of disagreement is the result of different beliefs about how the world works, rather than a difference in values
b) Often folk want to be able to co-operate, but they can't figure out a way to make it work

Comment by Chris_Leong on Alignment Is Not All You Need · 2025-01-03T13:14:31.587Z · LW · GW

Whilst interesting, this analysis doesn't seem to quite hit the nail on the head for me.

Power Distribution: First-movers with advanced AI could gain permanent military, economic, and/or political dominance.

This framing both merges multiple issues and almost assumes a particular solution (that of power distribution).

Instead, I propose that this problem be broken into:

a) Distributive justice: Figuring out how to fairly resolve conflicting interests

b) Stewardship: Ensuring that no-one can seize control of any ASI's and that such power isn't transferred to a malicious or irresponsible actor

c) Trustworthiness: Designing the overall system (both human and technological components) in such a way that different parties have rational reasons to trust that conflicting interests will be resolved fairly and that proper stewardship will be maintained over the system

d) Buy-in: Gaining support from different actors for a particular system to be implemented. This may involve departing from any distributive ideal

Of course, broadly distributing power can be used to address any of these issues, but we shouldn't assume that it is necessarily the best solution.

Economics Transition: When AI generates all wealth, humans have no leverage to ensure they are treated well... It’s still unclear how we get to a world where humans have any economic power if all the jobs are automated by advanced AI.

This seems like a strange framing to me. Maybe I'm reading too much into your wording, but it seems to almost assume that the goal is to maintain a broad distribution of "economic" power through the AGI transition. Whilst this would be one way of ensuring the broad distribution of benefits, it hardly seems like the only, or even most promising route. Why should we assume that the world will have anything like a traditional economy after AGI?


Additionally, alignment can refer to either "intent alignment" or "alignment with human values"[1]. Your analysis seems to assume the former, I'd suggest flagging this explicitly if that's what you mean. Where this most directly matters is the extent to which we are telling these machines what to do vs. autonomously making their own decisions, which affects the importance of solving problems manually.

  1. ^

    Whatever that means

Comment by Chris_Leong on Shallow review of technical AI safety, 2024 · 2024-12-29T12:55:04.953Z · LW · GW

Just wanted to chime in with my current research direction.
 

Comment by Chris_Leong on When AI 10x's AI R&D, What Do We Do? · 2024-12-28T02:14:40.866Z · LW · GW

I don't exactly know the most important capabilities yet, but things like, advising on strategic decisions, improving co-ordination and non-manipulative communication seem important.

Comment by Chris_Leong on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T02:12:08.411Z · LW · GW

Thanks for clarifying. Still feels narrow as a primary focus.

Comment by Chris_Leong on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T17:17:54.023Z · LW · GW

Agreed. Simply focusing on physics post-docs feels too narrow to me.

Then again, just as John has a particular idea of what good alignment research looks like, I have my own idea: I would lean towards recruiting folk with both a technical and a philosophical background. It's possible that my own idea is just as narrow.

Comment by Chris_Leong on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T01:52:01.760Z · LW · GW

If you wanted to create such a community, you could try spinning up a Discord server?

Comment by Chris_Leong on What I expected from this site: A LessWrong review · 2024-12-22T16:52:52.646Z · LW · GW

Echoing others: turning Less Wrong into Manifold would be a mistake. Manifold already exists. However, maybe you should suggest to them that they add a forum independent of any particular market.

Comment by Chris_Leong on When AI 10x's AI R&D, What Do We Do? · 2024-12-22T14:20:12.078Z · LW · GW

I've said this elsewhere, but I think we need to also be working on training wise AI advisers in order to help us navigate these situations.

Comment by Chris_Leong on Kaj's shortform feed · 2024-12-22T14:18:36.656Z · LW · GW

Do you think there's any other updates you should make as well?

Comment by Chris_Leong on Announcement: AI for Math Fund · 2024-12-21T04:08:21.168Z · LW · GW

Well, does this improve automated ML research and kick off an intelligence explosion sooner?

Comment by Chris_Leong on Talent Needs of Technical AI Safety Teams · 2024-12-14T16:35:07.771Z · LW · GW

"Funders of independent researchers we’ve interviewed think that there are plenty of talented applicants, but would prefer more research proposals focused on relatively few existing promising research directions" - Would be curious to hear why this is. Is it that if there is too great a profusion of research directions that there won't be enough effort behind each individual one?

Comment by Chris_Leong on Communications in Hard Mode (My new job at MIRI) · 2024-12-14T13:38:19.759Z · LW · GW

I'd love to hear some more specific advice about how to communicate in these kinds of circumstances when it's much easier for folk not to listen.

Comment by Chris_Leong on Announcement: AI for Math Fund · 2024-12-06T13:12:16.509Z · LW · GW

Just going to put it out there, it's not actually clear that we actually should want to advance AI for maths.

Comment by Chris_Leong on Should there be just one western AGI project? · 2024-12-04T10:15:56.043Z · LW · GW

I maintain my position that you're missing the stakes if you think that's important. Even limiting ourselves strictly to concentration of power worries, risks of totalitarianism dominate these concerns.

Comment by Chris_Leong on Should there be just one western AGI project? · 2024-12-03T15:59:10.419Z · LW · GW

My take - lots of good analysis, but makes a few crucial mistakes/weaknesses that throw the conclusions into significant doubt:

The USG will be able and willing to either provide or mandate strong infosecurity for multiple projects.

I simply don't buy that the infosec for multiple such projects will be anywhere near the infosec of a single project because the overall security ends up being that of the weakest link.

Additionally, the more projects there are with a particular capability, the more folk there are who can leak information either by talking or by being spies.

The probability-weighted impacts of AI takeover or the proliferation of world-ending technologies might be high enough to dominate the probability-weighted impacts of power concentration.

Comment: We currently doubt this, but we haven’t modelled it out, and we have lower p(doom) from misalignment than many (<10%).

Seems entirely plausible to me that either one could dominate. Would love to see more analysis around this.

Reducing access to these services will significantly disempower the rest of the world: we’re not talking about whether people will have access to the best chatbots or not, but whether they’ll have access to extremely powerful future capabilities which enable them to shape and improve their lives on a scale that humans haven’t previously been able to.

If you're worried about this, I don't think you quite realise the stakes. Capabilities mostly proliferate anyway. People can wait a few more years.

Comment by Chris_Leong on Linkpost: Rat Traps by Sheon Han in Asterisk Mag · 2024-12-03T03:24:50.245Z · LW · GW

My take: Bits of this review come off as a bit too status-oriented to me. This is ironic, because the best part of the review is towards the end when it talks about the risk of rationality becoming a Fandom.

Comment by Chris_Leong on Chris_Leong's Shortform · 2024-11-29T15:32:42.915Z · LW · GW

Sharing this resource doc on AI Safety & Entrepreneurship that I created in case anyone finds this helpful:

https://docs.google.com/document/d/1m_5UUGf7do-H1yyl1uhcQ-O3EkWTwsHIxIQ1ooaxvEE/edit?usp=sharing 

Comment by Chris_Leong on New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters · 2024-11-28T12:36:54.634Z · LW · GW

If it works, maybe it isn't slop?

Comment by Chris_Leong on DanielFilan's Shortform Feed · 2024-11-14T13:48:58.083Z · LW · GW

I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.

Comment by Chris_Leong on DanielFilan's Shortform Feed · 2024-11-14T13:38:39.917Z · LW · GW

Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:

Looking at behaviour is conceptually straightforward, and valuable, and being done

I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.

Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.

Comment by Chris_Leong on AI Craftsmanship · 2024-11-13T11:29:58.322Z · LW · GW

Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector? That was cool! Where's the stuff like that these days? 


Activation vectors are a thing. So it's totally happening.

Comment by Chris_Leong on Thomas Kwa's Shortform · 2024-11-13T06:11:11.571Z · LW · GW

"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.

Comment by Chris_Leong on Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al. · 2024-11-12T11:42:11.183Z · LW · GW

I guess I was thinking about this in terms of getting maximal value out of wise AI advisers. The notion that comparisons might be unfair didn't even enter my mind, even though that isn't too many reasoning steps away from where I was.