Posts

Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al. 2024-11-11T16:13:26.504Z
Some Preliminary Notes on the Promise of a Wisdom Explosion 2024-10-31T09:21:11.623Z
Linkpost: Hypocrisy standoff 2024-09-29T14:27:19.175Z
On the destruction of America’s best high school 2024-09-12T15:30:20.001Z
The Bar for Contributing to AI Safety is Lower than You Think 2024-08-16T15:20:19.055Z
Michael Streamlines on Buddhism 2024-08-09T04:44:52.126Z
Have people given up on iterated distillation and amplification? 2024-07-19T12:23:04.625Z
Politics is the mind-killer, but maybe we should talk about it anyway 2024-06-03T06:37:57.037Z
Does reducing the amount of RL for a given capability level make AI safer? 2024-05-05T17:04:01.799Z
Link: Let's Think Dot by Dot: Hidden Computation in Transformer Language Models by Jacob Pfau, William Merrill & Samuel R. Bowman 2024-04-27T13:22:53.287Z
"You're the most beautiful girl in the world" and Wittgensteinian Language Games 2024-04-20T14:54:54.503Z
The argument for near-term human disempowerment through AI 2024-04-16T04:50:53.828Z
Reverse Regulatory Capture 2024-04-11T02:40:46.474Z
On the Confusion between Inner and Outer Misalignment 2024-03-25T11:59:34.553Z
The Best Essay (Paul Graham) 2024-03-11T19:25:42.176Z
Can we get an AI to "do our alignment homework for us"? 2024-02-26T07:56:22.320Z
What's the theory of impact for activation vectors? 2024-02-11T07:34:48.536Z
Notice When People Are Directionally Correct 2024-01-14T14:12:37.090Z
Are Metaculus AI Timelines Inconsistent? 2024-01-02T06:47:18.114Z
Random Musings on Theory of Impact for Activation Vectors 2023-12-07T13:07:08.215Z
Goodhart's Law Example: Training Verifiers to Solve Math Word Problems 2023-11-25T00:53:26.841Z
Upcoming Feedback Opportunity on Dual-Use Foundation Models 2023-11-02T04:28:11.586Z
On Having No Clue 2023-11-01T01:36:10.520Z
Is Yann LeCun strawmanning AI x-risks? 2023-10-19T11:35:08.167Z
Don't Dismiss Simple Alignment Approaches 2023-10-07T00:35:26.789Z
What evidence is there of LLM's containing world models? 2023-10-04T14:33:19.178Z
The Role of Groups in the Progression of Human Understanding 2023-09-27T15:09:45.445Z
The Flow-Through Fallacy 2023-09-13T04:28:28.390Z
Chariots of Philosophical Fire 2023-08-26T00:52:45.405Z
Call for Papers on Global AI Governance from the UN 2023-08-20T08:56:58.745Z
Yann LeCun on AGI and AI Safety 2023-08-06T21:56:52.644Z
A Naive Proposal for Constructing Interpretable AI 2023-08-05T10:32:05.446Z
What does the launch of x.ai mean for AI Safety? 2023-07-12T19:42:47.060Z
The Unexpected Clanging 2023-05-18T14:47:01.599Z
Possible AI “Fire Alarms” 2023-05-17T21:56:02.892Z
Google "We Have No Moat, And Neither Does OpenAI" 2023-05-04T18:23:09.121Z
Why do we care about agency for alignment? 2023-04-23T18:10:23.894Z
Metaculus Predicts Weak AGI in 2 Years and AGI in 10 2023-03-24T19:43:18.522Z
Wittgenstein's Language Games and the Critique of the Natural Abstraction Hypothesis 2023-03-16T07:56:18.169Z
The Law of Identity 2023-02-06T02:59:16.397Z
What is the risk of asking a counterfactual oracle a question that already had its answer erased? 2023-02-03T03:13:10.508Z
Two Issues with Playing Chicken with the Universe 2022-12-31T06:47:52.988Z
Decisions: Ontologically Shifting to Determinism 2022-12-21T12:41:30.884Z
Is Paul Christiano still as optimistic about Approval-Directed Agents as he was in 2018? 2022-12-14T23:28:06.941Z
How is the "sharp left turn defined"? 2022-12-09T00:04:33.662Z
What are the major underlying divisions in AI safety? 2022-12-06T03:28:02.694Z
AI Safety Microgrant Round 2022-11-14T04:25:17.510Z
Counterfactuals are Confusing because of an Ontological Shift 2022-08-05T19:03:46.925Z
Getting Unstuck on Counterfactuals 2022-07-20T05:31:15.045Z
Which AI Safety research agendas are the most promising? 2022-07-13T07:54:30.427Z

Comments

Comment by Chris_Leong on DanielFilan's Shortform Feed · 2024-11-14T13:48:58.083Z · LW · GW

I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.

Comment by Chris_Leong on DanielFilan's Shortform Feed · 2024-11-14T13:38:39.917Z · LW · GW

Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:

Looking at behaviour is conceptually straightforward, and valuable, and being done

I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.

Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.

Comment by Chris_Leong on AI Craftsmanship · 2024-11-13T11:29:58.322Z · LW · GW

Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector? That was cool! Where's the stuff like that these days? 


Activation vectors are a thing. So it's totally happening.

Comment by Chris_Leong on Thomas Kwa's Shortform · 2024-11-13T06:11:11.571Z · LW · GW

"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.

Comment by Chris_Leong on Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al. · 2024-11-12T11:42:11.183Z · LW · GW

I guess I was thinking about this in terms of getting maximal value out of wise AI advisers. The notion that comparisons might be unfair didn't even enter my mind, even though that isn't too many reasoning steps away from where I was.

Comment by Chris_Leong on Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al. · 2024-11-12T04:56:57.348Z · LW · GW

That's a fascinating perspective.

Comment by Chris_Leong on What TMS is like · 2024-10-31T05:38:27.902Z · LW · GW

Fascinating. Sounds related to the Yoga concept of kryias.

Comment by Chris_Leong on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-31T00:41:28.312Z · LW · GW

I would suggest adopting a different method of interpretation, one more grounded in what was actually said. Anyway, I think it's probably best that we leave this thread here.

Comment by Chris_Leong on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-29T22:51:16.328Z · LW · GW

Sadly, cause-neutral was an even more confusing term, so this is better than the comparative. I also think that the two notions of principles-first are less disconnected than you think, but through somewhat indirect effects.

Comment by Chris_Leong on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-29T16:45:36.273Z · LW · GW

We're mostly working on stuff to stay afloat rather than high level navigation.

 

Why do you think that this is the case?

Comment by Chris_Leong on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-29T16:38:24.407Z · LW · GW

I recommend rereading his post. I believe his use of the term makes sense.

Comment by Chris_Leong on The Alignment Trap: AI Safety as Path to Power · 2024-10-29T16:30:35.598Z · LW · GW

I don't think I agree with this post, but I thought it provided a fascinating alternative perspective.

Comment by Chris_Leong on Winners of the Essay competition on the Automation of Wisdom and Philosophy · 2024-10-28T19:41:32.926Z · LW · GW

Just wanted to mention that if anyone liked my submissions (3rd prize,  An Overview of “Obvious” Approaches to Training Wise AI Advisors, Some Preliminary Notes on the Promise of a Wisdom Explosion), 
I'll be running a project related to this work as part of AI Safety Camp. Join me if you want to help innovate a new paradigm in AI safety.

Comment by Chris_Leong on avturchin's Shortform · 2024-10-27T15:36:03.108Z · LW · GW

What's ABBYY?

Comment by Chris_Leong on Brief analysis of OP Technical AI Safety Funding · 2024-10-26T16:42:49.095Z · LW · GW

That’s useful analysis. Focusing so heavily on evals seems like a mistake given how AI Safety Institutes are focused on evals.

Comment by Chris_Leong on Linkpost: Memorandum on Advancing the United States’ Leadership in Artificial Intelligence · 2024-10-25T11:01:33.852Z · LW · GW

I guess Leopold was right[1].  AI arms race it is.

  1. ^

    I suppose it is possible that it was a self-fulfilling prophecy, but I'm skeptical given how fast it's happened.

Comment by Chris_Leong on Big tech transitions are slow (with implications for AI) · 2024-10-24T23:54:22.987Z · LW · GW

The problem is that accepting this argument involves ignoring how AI keeps on blitzing past supposed barrier after barrier. At some point, a rational observer needs to be willing to accept that their max likelihood model is wrong and consider other possible ways the world could be instead.

Comment by Chris_Leong on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-24T16:07:34.315Z · LW · GW

I thought that this post on strategy and this talk were well done. Obviously, I'll have to see how this translates into practise.

Comment by Chris_Leong on Introducing Transluce — A Letter from the Founders · 2024-10-24T06:55:23.265Z · LW · GW

One thing I would love to know is how it'll work on Claude 3.5 Sonnet or GPT 4o given that these models aren't open-weights. Is it that you have access to some reduced level of capabilities for these?

Comment by Chris_Leong on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-23T16:06:12.526Z · LW · GW

That was an interesting conversation.

I do have some worries about the EA community.

At the same, I'm excited to see that Zach Robison has taken the reins as CEA and I'm looking forward to seeing how things develop under his leadership. The early signs have been promising.

Comment by Chris_Leong on Chris_Leong's Shortform · 2024-10-20T14:49:53.533Z · LW · GW

There is a world that needs to be saved. Saving the world is a team sport.  All we can do is to contribute our part of the puzzle, whatever that may be and no matter how small, and trust in our companions to handle the rest. There is honor in that, no matter how things turn out in the end.

Comment by Chris_Leong on My motivation and theory of change for working in AI healthtech · 2024-10-12T12:58:34.406Z · LW · GW

I'd strongly bet that when you break this down in more concrete detail, a flaw in your plan will emerge. 

The balance of industries serving humans vs. AI's is a suspiciously high level of abstraction.

Comment by Chris_Leong on wassname's Shortform · 2024-10-11T06:26:33.837Z · LW · GW

It’s an interesting thought.

I can see regularisation playing something of a role here, but it’s hard to say.

I would love to see a project here with philosophers and technical folk collaborating to make progress on this question.

Comment by Chris_Leong on sarahconstantin's Shortform · 2024-10-11T03:26:21.826Z · LW · GW

I honestly feel that the only appropriate response is something along the lines of "fuck defeatism"[1].

This comment isn't targeted at you, but at a particular attractor in thought space.

Let me try to explain why I think rejecting this attractor is the right response rather than engaging with it.

I think it's mostly that I don't think that talking about things at this level of abstraction is useful. It feels much more productive to talk about specific plans. And if you have a general, high-abstraction argument that plans in general are useless, but I have a specific argument why a specific plan is useful, I know which one I'd go with :-).

Don't get me wrong, I think that if someone struggles for a certain amount of time to try to make a difference and just hits wall after wall, then at some point they have to call it. But "never start" and "don't even try" are completely different.

It's also worth noting, that saving the world is a team sport. It's okay to pursue a plan that depends on a bunch of other folk stepping up and playing their part.

  1. ^

    I would also suggest that this is the best way to respond to depression rather than "trying to argue your way out of it".

Comment by Chris_Leong on TurnTrout's shortform feed · 2024-10-10T04:09:38.159Z · LW · GW

Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time!

AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other

This confuses me.

I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.

Comment by Chris_Leong on Overview of strong human intelligence amplification methods · 2024-10-09T04:13:44.887Z · LW · GW

Not really.

Comment by Chris_Leong on Overview of strong human intelligence amplification methods · 2024-10-08T23:28:16.608Z · LW · GW

I think you're underestimating meditation.

Since I've started meditating I've realised that I've been much more sensitive to vibes.

There's a lot of folk who would be scarily capable if the were strong in system 1, in addition to being strong in system 2.

Then there's all the other benefits that mediation can provide if done properly: additional motivation, better able to break out of narratives/notice patterns.

Then again, this is dependent on their being viable social interventions, rather than just aiming for 6 or 7 standard deviations of increase in intelligence.

Comment by Chris_Leong on Overview of strong human intelligence amplification methods · 2024-10-08T23:25:08.677Z · LW · GW

I think you're underestimating meditation.

Since I've started meditating I've realised that I've been much more sensitive to vibes.

There's a lot of folk who would be scarily capable if the were strong in system 1, in addition to being strong in system 2.

Then there's all the other benefits that mediation can provide if done properly: additional motivation, better able to break out of narratives/notice patterns.

Comment by Chris_Leong on Compelling Villains and Coherent Values · 2024-10-07T09:32:11.513Z · LW · GW

A Bayesian cultivates lightness, but a warrior monk has weight. Can these two opposing and perhaps contradictory natures be united to create some kind of unstoppable Kwisatz Haderach?

 

There are different ways of being that are appropriate to different times and/or circumstances. There are times for doubt and times for action.

Comment by Chris_Leong on Mark Xu's Shortform · 2024-10-06T03:49:07.843Z · LW · GW

I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment). 

Comment by Chris_Leong on Shapley Value Attribution in Chain of Thought · 2024-09-17T10:45:04.138Z · LW · GW

I’m confused by your use of Shapley values. Shapley values assume that the “coalition” can form in any order, but that doesn’t seem like a good fit for language models where order is important.

Comment by Chris_Leong on On the destruction of America’s best high school · 2024-09-16T03:57:10.038Z · LW · GW

I don't think these articles should make up a high proportion of the content on Less Wrong, but I think it's good if things like this are occasionally discussed.

Comment by Chris_Leong on How difficult is AI Alignment? · 2024-09-15T09:05:46.289Z · LW · GW

Great article.

One point of disagreement: I suspect that the difficulty of the required high-impact tasks likely relates more to what someone thinks about the offense-defense balance than the alignment difficulty per se.

Comment by Chris_Leong on The “mind-body vicious cycle” model of RSI & back pain · 2024-09-13T04:34:32.300Z · LW · GW

Just to add to this:

Beliefs can be self-reinforcing in predictive processing theory because the higher level beliefs can shape the lower level observations. So the hypersensitisation that Delton has noted can reinforce itself.

Comment by Chris_Leong on TurnTrout's shortform feed · 2024-09-13T04:27:54.296Z · LW · GW

Steven Byrnes provides an explanation here, but I think he's neglecting the potential for belief systems/systems of interpretation to be self-reinforcing.

Predictive processing claims that our expectations influence what we observe, so experiencing pain in a scenario can result in the opposite of a placebo effect where the pain sensitizes us.  Some degree of sensitization is evolutionary advantageous - if you've hurt a part of your body, then being more sensitive makes you more likely to detect if you're putting too much strain on it. However, it can also make you experience pain as the result of minor sensations that aren't actually indicative of anything wrong. In the worst case, this pain ends up being self-reinforcing.

https://www.lesswrong.com/posts/BgBJqPv5ogsX4fLka/the-mind-body-vicious-cycle-model-of-rsi-and-back-pain

Comment by Chris_Leong on AI Constitutions are a tool to reduce societal scale risk · 2024-09-11T15:41:18.400Z · LW · GW

Interesting work.

This post has made me realise that constitutional design is surprisingly neglected in the AI safety community.

Designing the right constitution won't save the world by itself, but it's a potentially easy win that could put us in a better strategic situation down the line.

Comment by Chris_Leong on Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities · 2024-09-11T15:39:10.669Z · LW · GW

I guess I'm worried that allowing insurance for disasters above a certain size could go pretty badly if it increases the chance of labs being reckless.

Comment by Chris_Leong on t14n's Shortform · 2024-09-04T02:42:08.105Z · LW · GW

Thank you for your service!

For what it's worth, I feel that the bar for being a valuable member of the AI Safety Community, is much more attainable than the bar of working in AI Safety full-time.

Comment by Chris_Leong on sudo's Shortform · 2024-09-02T22:14:15.709Z · LW · GW

If the strong AI has knowledge of the benchmarks (or can make correct guesses about how these were structured), then it might be able to find heuristics that work well on them, but not more generally, Some of these heuristics might seem more likely than not to humans.

Still seems like a useful technique if the more powerful model isn't much more powerful.

Comment by Chris_Leong on Principles for the AGI Race · 2024-08-30T17:28:13.731Z · LW · GW

I really like the way that you've approached this pragmatically, "If you do X, which may be risky or dubious, at least do Y".

I suspect that there's a lot of alpha in taking a similar approach to other issues.

Comment by Chris_Leong on the gears to ascenscion's Shortform · 2024-08-30T11:24:48.125Z · LW · GW

The second paper looks interesting.

(Having read through it, it's actually really, really good).

Comment by Chris_Leong on the gears to ascenscion's Shortform · 2024-08-30T11:20:28.915Z · LW · GW

My take is that you can't define term X until you know why you're trying to define term X.

For example, if someone asks what "language" is, instead of trying to jump in with an answer, it's better to step back and ask why the person is asking the question.

For example, if someone asks "How many languages do you know?", they probably aren't asking about simple schemes like "one click = yes, two clicks = no". On the other hand, it may make sense to talk about such simple schemes in an introductory course on "human languages".

Asking "Well what really is language?" independent of any context is naive.

Comment by Chris_Leong on Ruby's Quick Takes · 2024-08-30T11:09:07.112Z · LW · GW

I'd like access.

TBH, if it works great I won't provide any significant feedback, apart from "all good"

But if it annoys me in any way I'll let you know.

For what it's worth, I have provided quite a bit of feedback about the website in the past.

I want to see if it helps me with my draft document on proposed alignment solutions:

https://docs.google.com/document/d/1Mis0ZxuS-YIgwy4clC7hKrKEcm6Pn0yn709YUNVcpx8/edit#heading=h.u9eroo3v6v28

Comment by Chris_Leong on the gears to ascenscion's Shortform · 2024-08-29T05:10:58.289Z · LW · GW

I think the benefits are adequately described in the post.

Comment by Chris_Leong on Wei Dai's Shortform · 2024-08-29T05:09:54.943Z · LW · GW

But I don't know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.


The FLI Pause letter didn't achieve a pause, but it dramatically shifted the Overton Window.

Comment by Chris_Leong on the gears to ascenscion's Shortform · 2024-08-28T11:31:02.661Z · LW · GW

Helps people avoid going down pointless rabbit holes.

Comment by Chris_Leong on the gears to ascenscion's Shortform · 2024-08-28T08:00:21.400Z · LW · GW

I highly recommend this post. Seems like a more sensible approach to philosophy than conceptual analysis:

https://www.lesswrong.com/posts/9iA87EfNKnREgdTJN/a-revolution-in-philosophy-the-rise-of-conceptual

Comment by Chris_Leong on One person's worth of mental energy for AI doom aversion jobs. What should I do? · 2024-08-26T16:25:04.452Z · LW · GW

If you can land a job in government, it becomes much easier to land other jobs in government.

Comment by Chris_Leong on One person's worth of mental energy for AI doom aversion jobs. What should I do? · 2024-08-26T16:12:21.114Z · LW · GW

Pliny's approach?

Comment by Chris_Leong on Please do not use AI to write for you · 2024-08-22T07:45:45.050Z · LW · GW

I’ll admit to recently having written a post myself, asked an AI to improve my writing, made a few edits myself, posted it and then come back later and “thought omg how did I let some of these AI edits through”. Hopefully the post in question is best now.