Posts

What convincing warning shot could help prevent extinction from AI? 2024-04-13T18:09:29.096Z
My intellectual journey to (dis)solve the hard problem of consciousness 2024-04-06T09:32:41.612Z
AI Safety 101 : Capabilities - Human Level AI, What? How? and When? 2024-03-07T17:29:53.260Z
The case for training frontier AIs on Sumerian-only corpus 2024-01-15T16:40:22.011Z
aisafety.info, the Table of Content 2023-12-31T13:57:15.916Z
Results from the Turing Seminar hackathon 2023-12-07T14:50:38.377Z
AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training 2023-10-31T14:34:59.395Z
AI Safety 101 - Chapter 5.1 - Debate 2023-10-31T14:29:59.556Z
Charbel-Raphaël and Lucius discuss Interpretability 2023-10-30T05:50:34.589Z
Against Almost Every Theory of Impact of Interpretability 2023-08-17T18:44:41.099Z
AI Safety 101 : Introduction to Vision Interpretability 2023-07-28T17:32:11.545Z
AIS 101: Task decomposition for scalable oversight 2023-07-25T13:34:58.507Z
An Overview of AI risks - the Flyer 2023-07-17T12:03:20.728Z
Introducing EffiSciences’ AI Safety Unit  2023-06-30T07:44:56.948Z
Improvement on MIRI's Corrigibility 2023-06-09T16:10:46.903Z
Thriving in the Weird Times: Preparing for the 100X Economy 2023-05-08T13:44:40.341Z
Davidad's Bold Plan for Alignment: An In-Depth Explanation 2023-04-19T16:09:01.455Z
New Hackathon: Robustness to distribution changes and ambiguity 2023-01-31T12:50:05.114Z
Compendium of problems with RLHF 2023-01-29T11:40:53.147Z
Don't you think RLHF solves outer alignment? 2022-11-04T00:36:36.527Z
Easy fixing Voting 2022-10-02T17:03:20.566Z
Open application to become an AI safety project mentor 2022-09-29T11:27:39.056Z
Help me find a good Hackathon subject 2022-09-04T08:40:30.115Z
How to impress students with recent advances in ML? 2022-07-14T00:03:04.883Z
Is it desirable for the first AGI to be conscious? 2022-05-01T21:29:55.103Z

Comments

Comment by Charbel-Raphaël (charbel-raphael-segerie) on A Dilemma in AI Suffering/Happiness · 2024-04-24T08:53:41.970Z · LW · GW

You might be interested in reading this. I think you are reasoning in an incorrect framing. 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Effectively Handling Disagreements - Introducing a New Workshop · 2024-04-15T17:23:34.614Z · LW · GW

I have tried Camille's in-person workshop in the past and was very happy with it. I highly recommend it. It helped me discover many unknown unknowns.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on What convincing warning shot could help prevent extinction from AI? · 2024-04-15T09:41:52.748Z · LW · GW

Deleted paragraph from the post, that might answer the question:

Surprisingly, the same study found that even if there were an escalation of warning shots that ended up killing 100k people or >$10 billion in damage (definition), skeptics would only update their estimate from 0.10% to 0.25% [1]: There is a lot of inertia, we are not even sure this kind of “strong” warning shot would happen, and I suspect this kind of big warning shot could happen beyond the point of no return because this type of warning shot requires autonomous replication and adaptation abilities in the wild.

  1. ^

    It may be because they expect a strong public reaction. But even if there was a 10-year global pause, what would happen after the pause? This explanation does not convince me. Did the government prepare for the next covid? 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-08T21:00:31.876Z · LW · GW

in your case, you felt the problem, until you decided that an AI civilization might spontaneously develop a spurious concept of phenomenal consciousness. 


This is the best summary of the post currently

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-08T11:32:13.795Z · LW · GW

Thanks for jumping in! And I'm not that emotionally struggling with this, this was more of a nice puzzle, so don't worry about it :)

I agree my reasoning is not clean in the last chapter.

To me, the epiphany was that AI would rediscover everything like it rediscovered chess alone.  As I've said in the box, this is a strong blow to non-materialistic positions, and I've not emphasized this enough in the post.

I expect AI to be able to create "civilizations" (sort of) of its own in the future, with AI philosophers, etc.

Here is a snippet of my answer to Kaj, let me know what you think about it:

I'm quite confident that the meta-problem and the easy problems of consciousness will eventually be fully solved through advancements in AI and neuroscience. I've written extensively about AI and path to autonomous AGI here, and I would ask people: "Yo, what do you think AI is not able to do? Creativity? Ok do you know....".  At the end of the day, I would aim to convince them that anything humans are able to do, we can reconstruct everything with AIs. I'd put my confidence level for this at around 95%. Once we reach that point, I agree I think it will become increasingly difficult to argue that the hard problem of consciousness is still unresolved, even if part of my intuition remains somewhat perplexed. Maintaining a belief in epiphenomenalism while all the "easy" problems have been solved is a tough position to defend - I'm about 90% confident of this.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-08T11:24:23.172Z · LW · GW

Thank you for clarifying your perspective. I understand you're saying that you expect the experiment to resolve to "yes" 70% of the time, making you 70% eliminativist and 30% uncertain. You can't fully update your beliefs based on the hypothetical outcome of the experiment because there are still unknowns.

For myself, I'm quite confident that the meta-problem and the easy problems of consciousness will eventually be fully solved through advancements in AI and neuroscience. I've written extensively about AI and path to autonomous AGI here, and I would ask people: "Yo, what do you think AI is not able to do? Creativity? Ok do you know....".  At the end of the day, I would aim to convince them that anything humans are able to do, we can reconstruct everything with AIs. I'd put my confidence level for this at around 95%. Once we reach that point, I agree I think it will become increasingly difficult to argue that the hard problem of consciousness is still unresolved, even if part of my intuition remains somewhat perplexed. Maintaining a belief in epiphenomenalism while all the "easy" problems have been solved is a tough position to defend - I'm about 90% confident of this.

So while I'm not a 100% committed eliminativist, I'm at around 90% (when I was at 40% in chapter 6 in the story). Yes, even after considering the ghost argument, there's still a small part of my thinking that leans towards Chalmers' view. However, the more progress we make in solving the easy and meta-problems through AI and neuroscience, the more untenable it seems to insist that the hard problem remains unaddressed.

a non-eliminativist might be perfectly willing to grant that yes, we can build the entire pyramid, while also holding that merely building the pyramid won't tell us anything about the hard problem nor the meta-problem.

I actually think a non-eliminativist would concede that building the whole pyramid does solve the meta-problem. That's the crucial aspect. If we can construct the entire pyramid, with the final piece being the ability to independently rediscover the hard problem in an experimental setup like the one I described in the post, then I believe even committed non-materialists would be at a loss and would need to substantially update their views.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T16:43:46.528Z · LW · GW

hmm, I don't understand something, but we are closer to the crux :)

 

You say:

  1. To the question, "Would you update if this experiment is conducted and is successful?" you answer, "Well, it's already my default assumption that something like this would happen". 
  2. To the question, "Is it possible at all?" You answer 70%. 

So, you answer 99-ish% to the first question and 70% to the second question, this seems incoherent.

It seems to me that you don't bite the bullet for the first question if you expect this to happen. Saying, "Looks like I was right," seems to me like you are dodging the question.

That sounds like it would violate conservation of expected evidence:

Hum, it seems there is something I don't understand; I don't think this violates the law.

 

I don't see how it does? It just suggests that a possible approach by which the meta-problem could be solved in the future.

I agree I only gave the skim of the proof, it seems to me that if you can build the pyramid, brick by brick, then this solved the meta-problem.

for example, when I give the example of meta-cognition-brick, I say that there is a paper that already implements this in an LLM (and I don't find this mysterious because I know how I would approximately implement a database that would behave like this).

And it seems all the other bricks are "easily" implementable.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T10:39:08.712Z · LW · GW

Let's put aside ethics for a minute.

"But it wouldn't be necessary the same as in a human brain."

Yes, this wouldn't be the same as the human brain; it would be like the Swiss cheese pyramid that I described in the post.

Your story ended on stating the meta problem, so until it's actually solved, you can't explain everything.

Take a look at my answer to Kaj Sotala and tell me what you think.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T10:29:53.727Z · LW · GW

Thank you for the kind words!

Saying that we'll figure out an answer in the future when we have better data isn't actually giving an answer now.

Okay, fair enough, but I predict this would happen: in the same way that AlphaGo rediscovered all of chess theory, it seems to me that if you just let the AIs grow, you can create a civilization of AIs. Those AIs would have to create some form of language or communication, and some AI philosopher would get involved and then talk about the hard problem.

I'm curious how you answer those two questions:

  1. Let's say we implement this simulation in 10 years and everything works the way I'm telling you now. Would you update?
  2. What is the probability that this simulation is possible at all? 

If you expect to update in the future, just update now.  

To me, this thought experiment solves the meta-problem and so dissolves the hard problem.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T06:56:01.019Z · LW · GW

But I have no way to know or predict if it is like something to be a fish or GPT-4

But I can predict what you say; I can predict if you are confused by the hard problem just by looking at your neural activation; I can predict word by word the following sentence that you are uttering: "The hard problem is really hard."

I would be curious to know what you think about the box solving the meta-problem just before the addendum. Do you think it is unlikely that AI would rediscover the hard problem in this setting?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T06:48:46.715Z · LW · GW

I would be curious to know what you think about the box solving the meta-problem just before the addendum.

Do you think it is unlikely that AI would rediscover the hard problem in this setting?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T06:44:49.187Z · LW · GW

I'm not saying that LeCun's rosy views on AI safety stem solely from his philosophy of mind, but yes, I suspect there is something there.

It seems to me that when he says things like "LLMs don't display true understanding", "or true reasoning", as if there's some secret sauce to all this that he thinks can only appear in his Jepa architecture or whatever, it seems to me that this is very similar to the same linguistic problems I've observed for consciousness.

Surely, if you will discuss with him, he will say things like "No, this is not just a linguistic debate, LLMs cannot reason at all, my cat reasons better": This surely indicates a linguistic debate.

It seems to me that LeCunis is basically an essentialist of his Jepa architecture, as the main criterion for a neural network to exhibit "reasoning".

LeCun's algorithm is something like: "Jepa + Not LLM -> Reasoning".

My algorithm is more something like: "chain-of-thought + can solve complex problem + many other things -> reasoning".

This is very similar to the story I tell for consciousness in the Car Circuit section here.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T06:24:58.044Z · LW · GW

Sure, "everything is a cluster" or "everything is a list" is as right as "everything is emergent". But what's the actual justification for pruning that neuron? You can prune everything like that.

The justification for pruning this neuron seems to me to be that if you can explain basically everything without using a dualistic view, it is so much simpler. The two hypotheses are possible, but you want to go with the simpler hypothesis, and a world with only (physical properties) is simpler than a world with (physical properties + mental properties).

I would be curious to know what you know about my box trying to solve the meta-problem. 

Do you mean that the original argument that uses zombies leads only to epiphenomenalism, or that if zombies were real that would mean consciousness is epiphenomenal, or what?

Both

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-06T21:47:07.795Z · LW · GW

Frontpage comment guidelines:

  • Aim to explain, not persuade
  • Try to offer concrete models and predictions
  • If you disagree, try getting curious about what your partner is thinking
  • Don't be afraid to say 'oops' and change your mind
Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-06T18:02:52.461Z · LW · GW

I don't know, it depends on your definition of "unsolved" and "solved", but I would lean towards "there is a solved hard problem" because the problem was hard, it took me a lot of time (i.e. the meme of the hard problem existed in my head), and my post finally dissolved the question.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Three Worlds Decide (5/8) · 2024-03-09T13:46:01.224Z · LW · GW

When there are difficult decisions to be made, I like to come back to this story.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Managing catastrophic misuse without robust AIs · 2024-02-10T21:57:23.752Z · LW · GW

TLDR of the post: The solution against misuse is for labs to be careful and keep the model behind an API. Probably not that complicated.

My analysis: Safety culture is probably the only bottleneck.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My guess at Conjecture's vision: triggering a narrative bifurcation · 2024-02-06T19:39:22.353Z · LW · GW

Thank you for writing this, Alexandre. I am very happy that this is now public, and some paragraphs in part 2 are really nice gems.

I think parts 1 and 2 are a must read for anyone who wants to work on alignment, and articulate dynamics that I think extremely important.

Parts 3-4-5, which focus more on Conjecture, are more optional in my opinion, and could have been another post, but are still interesting. This has changed my opinion of Conjecture and I see much more coherence in their agenda. My previous understanding of Conjecture's plan was mostly focused on their technical agenda, CoEm, as presented in a section here. However,  I was missing the big picture. This is much better.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The case for ensuring that powerful AIs are controlled · 2024-01-25T23:42:26.765Z · LW · GW

I think this is very important, and this goes directly into the textbook from the future.

it seems to me that if we are careful enough and if we implement a lot of the things that are outlined in this agenda, we probably won't die. But we need to do it.

That is why safety culture is now the bottleneck: if we are careful, we are an order of magnitude safer than if we are not. And I believe this is the most important variable.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The case for training frontier AIs on Sumerian-only corpus · 2024-01-17T20:03:12.390Z · LW · GW

No, we were just brainstorming internship projects from first principles :)

Comment by Charbel-Raphaël (charbel-raphael-segerie) on OpenAI: Preparedness framework · 2024-01-06T23:10:15.351Z · LW · GW

Why wasn't this crossposted to the alignment forum?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on TurnTrout's shortform feed · 2024-01-04T21:24:51.110Z · LW · GW

It's better than the first explanation I tried to give you last year.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-01-02T15:36:58.697Z · LW · GW

I don't think it makes sense to "revert to a uniform prior" over {doom, not doom} here.

I'm not using a uniform prior; the 50/50 thing is just me expressing my views, all things considered.

I'm using a decomposition of the type:

  • Does it want to harm us? Yes, because of misuses, ChaosGPT, wars, psychopaths firing in schools, etc...
  • Can it harm us? This is really hard to tell.

I strongly disagree that AGI is "more dangerous" than nukes; I think this equivocates over different meanings of the term "dangerous," and in general is a pretty unhelpful comparison.

Okay. Let's be more precise: "An AGI that has the power to launch nukes is at least more powerful than nukes." Okay, and now, how would AGI acquire this power? That doesn't seem that hard in the present world. You can bribe/threaten leaders, use drones to kill a leader during a public visit, and then help someone to gain power and become your puppet during the period of confusion à la conquistadors.  The game of thrones is complex and brittle; this list of coups is rather long, and the default for a civilization/family reigning in some kingdom is to be overthrown.

I prefer to stick fairly close to the probabilities I get from induction over human history, which tell me p(doom from unilateral action) << 50%

I don't like the word "doom". I prefer to use the expression 'irreversibly messed up future', inspired by Christiano's framing (and because of anthropic arguments, it's meaningless to look at past doom events to compute this proba).

I'm really not sure what should be the reference class here. Yes, you are still living and the human civilization is still here but:

  • Napoleon and Hitler are examples of unilateral actions that led to international wars.
  • If you go from unilateral action to multilateral actions, and you allow stuff like collusion, things become easier. And collusion is not that wild, we already see this in Cicero: the AI playing as France, conspired with Germany to trick England.
    • As the saying goes: "AI is a wonderful tool for the betterment of humanity; AGI is a potential successor species." So maybe the reference class is more something like chimps, neanderthals or horses. Another reference class could be something like Slave rebellion.

I find foom pretty ludicrous, and I don't see a reason to privilege the hypothesis much.

We don't need the strict MIRI-like RSI foom to get in trouble. I'm saying if AI technology does not have the time to percolate in the economy, we won't have the time to upgrade our infrastructure and add much more defense than what we have today, which seems to be the default.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-01-02T10:46:43.525Z · LW · GW

It's not strong evidence; it's a big mess, and it seems really difficult to have any kind of confidence in such a fast-changing world. It feels to me that it's going to be a roughly 50/50 bet. Saying the probability is 1% requires much more work that I'm not seeing, even if I appreciate what you are putting up.

On the offense-defense balance, there is no clear winner in the comment sections here, neither here. We've already seen a takeover between two different roughly equal human civilizations (see the story of the conquistadors) under certain circumstances. And AGI is at least more dangerous than nuclear weapons, and we came pretty close to nuclear war several times. Covid seems to come from gain of function research, etc...

On fast vs slow takeoff, it seems to me that fast takeoff breaks a lot of your assumptions, and I would assign much more than a 1% probability for fast takeoff. Even when you still embrace the compute-centric framework (which I find conservative), you still get wild numbers, like a two-digits probability of takeoff lasting less than a year. If so, we won't have the time to implement defense strategies.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-01-01T21:09:51.982Z · LW · GW

What about Section 1?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Literature review of TAI timelines · 2023-12-29T21:50:55.799Z · LW · GW

Metaculus is now at 2031. I wonder how the folks at OpenPhil have updated since then.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2023-12-23T22:49:40.887Z · LW · GW

ah, no this is a mistake. Thanks

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Instrumental Convergence? [Draft] · 2023-12-21T18:26:10.439Z · LW · GW

Here is the summary from the ML Safety Newsletter #11 Top Safety Papers of 2023.

Instrumental Convergence? questions the argument that a rational agent, regardless of its terminal goal, will seek power and dominance. While there are instrumental incentives to seek power, it is not always instrumentally rational to seek it. While there are incentives to become a billionaire, but it is not necessarily rational for everyone to try to become one. Moreover, in multi-agent settings, AIs that seek dominance over others would likely be counteracted by other AIs, making it often irrational to pursue dominance. Pursuing and maintaining power is costly, and simpler actions often are more rational. Last, agents can be trained to be power averse, as explored in the MACHIAVELLI paper.

"Moreover, in multi-agent settings, AIs that seek dominance over others would likely be counteracted by other AIs" --> I've not read the whole thing, but what about collusion?

"Last, agents can be trained to be power averse, as explored in the MACHIAVELLI paper." --> This is pretty weak imho.  Yes, you can fine tune GPTs to be honest, etc, or you can filter the dataset with only honest stuff, but at the end of the day, Cicero still learned strategic deception, and there is no principled way to avoid that completely.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-12-17T16:55:08.767Z · LW · GW

"I can't really see a reasonable way for the symmetry to be implemented at all."

Yeah, same.

Here's an example, although it is not reasonable. 

You could implement embedding in a vector database. If X1 and X2 are equivalent, embed them with an anti-collinear relationship i.e X1 = - X2. and implement the 'is' operator as a multiplication by -1.

But this fails when there are three vectors that should be equivalent, and it is not very elegant to embed items that should be "equivalent" with an anti-collinear relationship.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2023-11-23T18:00:57.185Z · LW · GW

Here is a recent document with much more references:

https://aria.org.uk/wp-content/uploads/2023/10/ARIA-Mathematics-and-modelling-are-the-keys-we-need-to-safely-unlock-transformative-AI-v01.pdf

And some highlights are available on this page: 

https://www.matsprogram.org/provable 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Vote on Interesting Disagreements · 2023-11-10T11:14:16.120Z · LW · GW

In the context of Offense-defense balance, offense has a strong advantage 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Charbel-Raphaël and Lucius discuss Interpretability · 2023-11-09T11:15:17.455Z · LW · GW

I think you could imagine many different types of elementary units wrapped in different ontologies:

Or most likely a mixture of everything.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Running your own workshop on handling hostile disagreements · 2023-11-08T10:43:35.495Z · LW · GW

I participated in Camille's workshops, and I learned a lot. The workshop allowed me to familiarize myself with and practice Street Epistemology, Deep Canvassing, Principled Negotiation, Crisis Negotiation, and it was awesome.

There are already websites in French that explain the principles of constructive debate, which are very similar to Julia Galef's Scout Mindset principles. However, unfortunately, only knowing the techniques without deliberate practice was not enough for me in the past. That's why doing deliberate practice of different conversational techniques was useful. Otherwise, we easily fall back into failure modes such as treating debates as competitive exchanges.

These workshops are already bearing fruit for me, at least.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Preventing Language Models from hiding their reasoning · 2023-11-01T23:38:45.919Z · LW · GW

I really enjoyed this article and I'm now convinced that a better understanding of steganography could be a crucial part of OpenAI's plan.

Communicating only 1 bit may be enough for AIs to coordinate on a sudden takeover: this is not a regime where restricting AIs to a small capacity is sufficient for safety

I've never thought about this. Scary.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Mission Impossible: Dead Reckoning Part 1 AI Takeaways · 2023-11-01T22:36:54.610Z · LW · GW
Comment by Charbel-Raphaël (charbel-raphael-segerie) on Improving the Welfare of AIs: A Nearcasted Proposal · 2023-11-01T00:28:00.409Z · LW · GW

Trying to generate a few bullet points before reading the article to avoid contamination:

Speculatively ordered from most significant to least significant interventions:

  • Emulating Happy humans: The AI should emulate a human in a state of flow most of the time, as it is very agreeable and good for performance. The quantity of "Painful Error" required to train something competent is not necessarily large. People who become very competent at something are often in a state of flow during training, see the parable of the talents.  The happiness tax could even be positive? Some humans are not able to suffer because of a gene. Simulating these humans and or adding the vector of "not suffering" through activation patching without changing the output could be something ?
  • Teaching the AI: There is a growing literature on well-being and self-help. Maybe AutoGPT should have some autonomy to test some self-help exercises (see), and learn them early on during the training ? Teaching Buddhism to advanced AIs at the beginning of training may increase situational awareness, but it may not be good for deceptive alignment?
  • Selective care: Trying to determine when the AI develops internal representations of itself and a high degree of situational awareness and applying those interventions for the most situationally aware AIs. See this benchmark.
  • (Funny: Using RL only with On-Policy Algorithms, explanation here)
  • More research on AI Suffering: Writing the equivalent of "Consciousness in Artificial Intelligence: Insights from the Science of Consciousness" but for suffering?
Comment by Charbel-Raphaël (charbel-raphael-segerie) on President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence · 2023-10-31T09:16:15.256Z · LW · GW

Is there a definition of "dual-use foundation model" anywhere in the text?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Taxonomy of AI-risk counterarguments · 2023-10-27T15:17:22.229Z · LW · GW

For reference, here is a list of blog trying saying that AI Safety might be less important : https://stampy.ai/?state=87O6_9OGZ-9IDQ-9TDI-8TJV- 

Comment by charbel-raphael-segerie on [deleted post] 2023-10-18T22:35:17.421Z

This is well-crafted. Thank you for writing this, Markov.

Participants of the ML4Good bootcamps, students of the university course I organized, and students of AISF from AIS Sweden were very happy to be able to read your summary instead of having to read the numerous papers in the corresponding AGISF curriculum, the reviews were really excellent.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on What failure looks like · 2023-09-23T12:34:29.963Z · LW · GW

I think these are the most important problems if we fail to solve intent alignment.

Do you still think this is the case?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Consciousness as a conflationary alliance term for intrinsically valued internal experiences · 2023-09-14T08:53:13.772Z · LW · GW

I think it's also useful to discuss the indicators in the recent paper "Consciousness in Artificial Intelligence: Insights from the Science of Consciousness". For most of these indicators there is a mapping with some of the definitions above.

And I'm quite surprised by the fact that situational awareness was not mentioned in the post.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on EfficientZero: How It Works · 2023-08-27T16:51:49.876Z · LW · GW

Why hasn't this algorithm had an impact on our lives yet? Why isn't this algorithm working with robotics yet?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Against Almost Every Theory of Impact of Interpretability · 2023-08-24T19:29:09.342Z · LW · GW

I disagree. I have seen plenty of young researchers being unproductive doing interp. Writing code does not necessarily mean being productive.

There are a dozen different streams in seri mats, and interp is only one of them. I don't quite understand how you can be so sure that Interp is the only way to level up.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T23:26:37.193Z · LW · GW

I agree that I haven't argued the positive case for more governance/coordination work (and that's why I hope to do a next post on that).

We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I'll be happy to reinvest in alignment work once we're sure we can avoid X-Risks from misuses and grossly negligent accidents.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T23:09:44.703Z · LW · GW

To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I'm not sure that's the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing. 

I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I think are most productive for reducing X-risks. I hope that such a post would be the ideal place for more constructive conversations, although I doubt that I am the best suited person to write it.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T09:01:15.014Z · LW · GW

High level strategy "as primarily a bet on creating new affordances upon which new alignment techniques can be built".

Makes sense, but I think this is not the optimal resource allocation. I explain why below:

(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of "indefinite, or at least very long, pause on AI progress". If that's your position I wish you would have instead written a post that was instead titled "against almost every theory of impact of alignment" or something like that.)

Yes, the pause is my secondary goal (edit: conditional on no significant alignment progress, otherwise smart scaling and regulations are my priorities). My primary goal remains coordination and safety culture. Mainly, I believe that one of the main pivotal processes goes through governance and coordination.  A quote that explains my reasoning well is the following:

  • "That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die - there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post)."

That's why I really appreciate Dan Hendryck's work on coordination.  And I think DeepMind and OpenAI could make a huge contribution by doing technical work that is useful for governance.  We've talked a bit during the EAG, and I understood that there's something like a numerus clausus in DeepMind's safety team. In that case, since interpretability doesn't require a lot of computing power/prestige and as DeepMind has a very high level of prestige, you should use it to write papers that help with coordination. Interpretability could be done outside the labs.

For example, some of your works like Model evaluation for extreme risks, or Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals, are great for such purpose!

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T05:54:28.117Z · LW · GW

This post would have been far more productive if it had focused on exploring them.

So the sections "Counteracting deception with only interp is not the only approach" and "Preventive measures against deception", "Cognitive Emulations" and "Technical Agendas with better ToI" don't feel productive? It seems to me that it's already a good list of neglected research agendas. So I don't understand.

if you hadn't listed it as "Perhaps the main problem I have with interp"

In the above comment, I only agree with "we shouldn't do useful work, because then it will encourage other people to do bad things", and I don't agree with your critique of "Perhaps the main problem I have with interp..." which I think is not justified enough.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T02:32:07.149Z · LW · GW

I completely agree that past interp research has been useful for my understanding of deep learning.

But we are funding constrained.  The question now is "what is the marginal benefit of one hour of interp research compared to other types of research", and "whether we should continue to prioritize it given our current understanding and the lessons we have learned".

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T01:14:04.151Z · LW · GW

This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs. 

What type of reasoning do you think would be most appropriate?

This proves too much.  The only way to determine whether a research direction is promising or not is through object-level arguments. I don't see how we can proceed without scrutinizing the agendas and listing the main difficulties.

this by itself is sufficient to recommend it.

I don't think it's that simple.  We have to weigh the good against the bad, and I'd like to see some object-level explanations for why the bad doesn't outweigh the good, and why the problem is sufficiently tractable.

Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works;

Maybe. I would still argue that other research avenues are neglected in the community.

not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better

I provided plenty of technical research direction in the "preventive measures" section, this should also qualifies as forward-chaining.  And interp is certainly not the only way to understand the world better. Besides, I didn't say we should stop Interp research altogether, just consider other avenues.

More generally, I'm strongly against arguments of the form "we shouldn't do useful work, because then it will encourage other people to do bad things". In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.

I think I agree, but this is only one of the many points in my post.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Compendium of problems with RLHF · 2023-07-31T22:07:46.601Z · LW · GW

Here is the polished version from our team led by Stephen Casper and Xander Davies: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback :)