Posts

🇫🇷 Announcing CeSIA: The French Center for AI Safety 2024-12-20T14:17:13.104Z
Are we dropping the ball on Recommendation AIs? 2024-10-23T17:48:00.000Z
We might be dropping the ball on Autonomous Replication and Adaptation. 2024-05-31T13:49:11.327Z
AI Safety Strategies Landscape 2024-05-09T17:33:45.853Z
Constructability: Plainly-coded AGIs may be feasible in the near future 2024-04-27T16:04:45.894Z
What convincing warning shot could help prevent extinction from AI? 2024-04-13T18:09:29.096Z
My intellectual journey to (dis)solve the hard problem of consciousness 2024-04-06T09:32:41.612Z
AI Safety 101 : Capabilities - Human Level AI, What? How? and When? 2024-03-07T17:29:53.260Z
The case for training frontier AIs on Sumerian-only corpus 2024-01-15T16:40:22.011Z
aisafety.info, the Table of Content 2023-12-31T13:57:15.916Z
Results from the Turing Seminar hackathon 2023-12-07T14:50:38.377Z
AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training 2023-10-31T14:34:59.395Z
AI Safety 101 - Chapter 5.1 - Debate 2023-10-31T14:29:59.556Z
Charbel-Raphaël and Lucius discuss interpretability 2023-10-30T05:50:34.589Z
Against Almost Every Theory of Impact of Interpretability 2023-08-17T18:44:41.099Z
AI Safety 101 : Introduction to Vision Interpretability 2023-07-28T17:32:11.545Z
AIS 101: Task decomposition for scalable oversight 2023-07-25T13:34:58.507Z
An Overview of AI risks - the Flyer 2023-07-17T12:03:20.728Z
Introducing EffiSciences’ AI Safety Unit  2023-06-30T07:44:56.948Z
Improvement on MIRI's Corrigibility 2023-06-09T16:10:46.903Z
Thriving in the Weird Times: Preparing for the 100X Economy 2023-05-08T13:44:40.341Z
Davidad's Bold Plan for Alignment: An In-Depth Explanation 2023-04-19T16:09:01.455Z
New Hackathon: Robustness to distribution changes and ambiguity 2023-01-31T12:50:05.114Z
Compendium of problems with RLHF 2023-01-29T11:40:53.147Z
Don't you think RLHF solves outer alignment? 2022-11-04T00:36:36.527Z
Easy fixing Voting 2022-10-02T17:03:20.566Z
Open application to become an AI safety project mentor 2022-09-29T11:27:39.056Z
Help me find a good Hackathon subject 2022-09-04T08:40:30.115Z
How to impress students with recent advances in ML? 2022-07-14T00:03:04.883Z
Is it desirable for the first AGI to be conscious? 2022-05-01T21:29:55.103Z

Comments

Comment by Charbel-Raphaël (charbel-raphael-segerie) on 🇫🇷 Announcing CeSIA: The French Center for AI Safety · 2025-01-23T08:12:51.219Z · LW · GW

https://www.youtube.com/watch?v=ZP7T6WAK3Ow 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on ryan_greenblatt's Shortform · 2025-01-08T20:44:38.363Z · LW · GW

Yeah, fair enough. I think someone should try to do a more representative experiment and we could then monitor this metric.

btw, something that bothers me a little bit with this metric is the fact that a very simple AI that just asks me periodically "Hey, do you endorse what you are doing right now? Are you time boxing? Are you following your plan?" makes me (I think) significantly more strategic and productive. Similar to I hired 5 people to sit behind me and make me productive for a month. But this is maybe off topic.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on ryan_greenblatt's Shortform · 2025-01-08T20:19:42.080Z · LW · GW

I was saying 2x because I've memorised the results from this study. Do we have better numbers today? R&D is harder, so this is an upper bound. However, since this was from one year ago, so perhaps the factors cancel each other out?

Summary of the experiment process and results (described in following paragraph)

Comment by Charbel-Raphaël (charbel-raphael-segerie) on ryan_greenblatt's Shortform · 2025-01-08T19:22:44.229Z · LW · GW

How much faster do you think we are already? I would say 2x.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on AI: Practical Advice for the Worried · 2025-01-05T01:01:35.687Z · LW · GW

What do you don't fully endorse anymore?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-04T15:50:45.833Z · LW · GW

I would be happy to discuss in a dialogue about this. This seems to be an important topic, and I'm really unsure about many parameters here.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Against Almost Every Theory of Impact of Interpretability · 2025-01-04T15:11:37.879Z · LW · GW

Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have.

First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it.

I still want everyone working on interpretability to read it and engage with its arguments.

Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions.

Updates on my views

Legend:

  • On the left of the arrow, a citation from the OP → ❓ on the right, my review which generally begins with emojis
  • ✅ - yes, I think I was correct (>90%)
  • ❓✅ - I would lean towards yes (70%-90%)
  • ❓ - unsure (between 30%-70%)
  • ❓❌ - I would lean towards no (10%-30%)
  • ❌ - no, I think I was basically wrong (<10%)
  • ⭐ important, you can skip the other sections

Here's my review section by section:

⭐ The Overall Theory of Impact is Quite Poor?

  • "Whenever you want to do something with interpretability, it is probably better to do it without it" → ❓ I still think this is basically right, even if I'm not confident this will still be the case in the future; But as of today, I can't name a single mech-interpretability technique that does a better job at some non-intrinsic interpretability goal than the other more classical techniques, on a non-toy model task.
    • "Interpretability is Not a Good Predictor of Future Systems" → ✅ This claim holds up pretty well. Interpretability still hasn't succeeded in reliably predicting future systems, to my knowledge.
    • "Auditing Deception with Interpretability is Out of Reach" → ❓ The "Out of Reach" is now a little bit too strong, but the general direction was pretty good. The first major audits of deception capabilities didn't come from interpretability work; breakthrough papers came from AI Deception: A Survey of Examples, Risks, and Potential Solutions, Apollo Research's small demo using bare prompt engineering, and Anthropic's behavioral analyses. This is particularly significant because detecting deception was a primary motivation for many people working on interpretability at the time. I don't think being able to identify the sycophancy feature qualifies as being able to audit deception: maybe the feature is just here to recognize sycophancy without using it, as explained in the post. (I think the claim should now be "Auditing Deception without Interpretability is currently much simpler").
  • "Interpretability often attempts to address too many objectives simultaneously" → ❓ I don't think this is as important nowadays, but I tend to still think that goal factoring is still a really important cognitive and neglected move in AI Safety. I can see how interp could help a bit for multiple goals simultaneously, but also if you want to achieve more coordination, just work on coordination.
  • "Interpretability could be harmful - Using successful interpretability for safety could certainly prove useful for capabilities" → ❓❌ I think I was probably wrong, more discussion below, in section “Interpretability May Be Overall Harmful”.

What Does the End Story Look Like?

  • "Enumerative Safety & Olah interpretability dream":
    • ⭐ Feasibility of feature enumeration → ❓I was maybe wrong, but this is really tricky to assess.
      • On the plus side, I was genuinely surprised to see SAEs working that well because the general idea already existed, some friends had tried it, and it didn't seem to work at the time. I guess compute also plays a crucial role in interpretability work. I was too confident. Progress is possible, and enumerative safety could represent an endgame for interpretability.
      • On the other hand, many problems remain and I think we need to be very cautious in evaluating this type of research, it's very unclear if/how to make enumerative safety arguments with SAEs:
        • SAEs are only able to reconstruct a much smaller model: being able to reconstruct only 65% of the variance, means that the model reconstructed would be very very poor. Some features are very messy, and lots of things that models know how to do are just not represented in SAE.
        • The whole paradigm is probably only a computationally convenient approximation: I think that the strong feature hypothesis is probably false, and is not going to be sufficient to reconstruct the whole model. Some features are probably stored on multiple layers, some features might be instantiated only in a dynamic way, and I’m skeptical that we can reduce the model to just a static weighted directed graph of features. Another point is that Language models are better than humans at next-token prediction and I expect some features to be beyond human knowledge and understanding.
        • SAEs were not used on the most computationally intensive models (Sonnet, and not Opus), which are the ones of interest, because SAEs cost a lot of compute.
        • We cannot really use SAEs for enumerative safety because we wouldn't be able to exclude emergent behavior. As a very concrete example, if you train SAEs on a sleeper agent (on the training distribution that does not trigger the backdoor), you will not surface any deception feature (which might be a bit unfair because the training data for the sleeper agent does contain deceptive stuff, but this would be maybe more analogous to a natural emergence). Maybe someone should try to detect backdoors with SAEs; (thanks to Fabien for raising those points to me!)
      • At the end of the day, it's very unclear how to make enumerative safety arguments with SAEs.
    • Safety applications? → ❓✅ Some parts of my critique of enumerative safety remain valid. The dual-use nature of many features remains a fundamental challenge: Even after labeling all features, it's unclear how we can effectively use SAEs, and I still think that “Determining the dangerousness of a feature is a mis-specified problem”: “there's a difference between knowing about lies, being capable of lying, and actually lying in the real world”. At the end of the day, Anthropic didn’t use SAEs to remove harmful behaviors from Sonnet that were present in the training data, and it’s still unclear if SAEs beat baselines (for a more detailed analysis of the missing safety properties of SAEs, read this article).
    • Harmfulness of automated research? → ❓ I think the automation of the discovery of Claude's features was not that dangerous and is a good example of automated research. Overall, during the year, I'm a bit more sympathetic today to this kind of automated AI safety research than before.[1]
  • Reverse Engineering? → ✅ Not much progress here. It seems like IOI remains roughly the SOTA of the most interesting circuit we've found in any language model, and current work and techniques, such as edge pruning, remain focused on toy models.
  • Retargeting the search? → ❓ I think I was overconfident in saying that being able to control the AI via the latent space is just a new form of prompt engineering or fine-tuning. I think representation engineering could be more useful than this, and might enable better control mechanisms.
  • Relaxed adversarial training? → ❓✅ I made a call by saying this could be one of the few ways to reduce AI bad behavior even under adversarial pressure, and it seems like this is a promising direction today.
  • Microscope AI? ❓✅ I think what I say in the past about the uselessness of microscope AI remains broadly valid, but there is an amendment to be made: "About a year ago, Schut et al. (2023) did what I think was (and maybe still is) the most impressive interpretability research to date. They studied AlphaZero's chess play and showed how novel performance-relevant concepts could be discerned from mechanistic analysis. They worked with skilled chess players and found that they could help these players learn new concepts that were genuinely useful for chess. This appears to be a reasonably unique way of doing something useful (improving experts' chess play) that may have been hard to achieve in some other way." - Summary from Casper.
    • Very cool paper, but I think this type of work is more like a very detailed behavioral analysis guided by some analysis of the latent space, and I do expect that this kind of elicitation work for narrow AI is going to be deprecated by future general-purpose AI systems, which are going to be able to teach us directly those concepts, and we will be able to fine-tune them directly to do this. Think about a super Claude-teacher.
    • Also, Alphazero is an agent - it’s not a pure microscope - so this is a very different vision than the one from Olah explaining his vision of microscope AI here.

⭐ So Far My Best Theory of Impact for Interpretability: Outreach?

❓✅ I still think this is the case, but I have some doubts. I can share numerous personal anecdotes where even relatively unpolished introductions to interpretability during my courses generated more engagement than carefully crafted sessions on risks and solutions. Concretely, I shamefully capitalize on this by scheduling interpretability week early in my seminar to nerd-snipe students' attention.

But I see now two competing potential theories of impact:

  • Better control mechanisms: For example, something that I was not seeing clearly in the past was the possibility to have better control of those models.
    • I think the big takeaway is that representation engineering might work: I find the work Simple probes can catch sleeper agents \ Anthropic very interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place). I was very surprised by those results. I think products such as Goodfire steering Llama3 are interesting, and I’m curious to see future developments. Circuit breakers seem also exciting in this regard.
    • This might still be a toy example, but I've found this work from Sieve interesting: SAEs Beat Baselines on a Real-World Task, they claim to be able to steer the model better than with other techniques, on a non trivial task: "Prompt engineering can mitigate this in short context inputs. However, Benchify frequently has prompts with greater than 10,000 tokens, and even frontier LLMs like Claude 3.5 will ignore instructions at these long context lengths." "Unlike system prompts, fine-tuning, or steering vectors which affect all outputs, our method is very precise (>99.9%), meaning almost no side effects on unrelated prompts."
    • I'm more sympathetic to exploratory work like gradient routing, which may offer affordances in the future that we don't know about now.
  • Deconfusion and better understanding: But I should have been more charitable to the second-order effects of better understanding of the models. Understanding how models work, providing mechanistic explanations, and contributing to scientific understanding all have genuine value that I was dismissing.[2]

⭐ Preventive Measures Against Deception

I still like the two recommendations I made:

  1. Steering the world towards transparency → ✅ This remains a good recommendation. For instance, today, we can choose not to use architectures that operate in latent spaces, favoring architectures that reason with tokens instead (even if this is far from perfect either). Meta's proposal for new transformers using latent spaces should be concerning, as these architectural choices significantly impact our control capabilities.
    1. "I don't think neural networks will be able to take over in a single forward pass. Models will probably reason in English and will have translucent thoughts"  ❓✅ This seems to be the case?
    2. And many works that I was suggesting to conduct are now done and have been informative for the control agenda ✅
  2. Cognitive emulation (using the most powerful scaffolding with the least powerful model capable of the task) → ✅ This remains a good safety recommendation I think, as we don’t want the elicitation to be done in the future, we want to extract all the juice there is from current LLMs. Christiano elaborates a bit more on this, by pondering with other negative externalities such as faster race: Thoughts on Sharing Information About Language Models Capabilities, section Accelerating LM agents seems neutral (or maybe positive).

Interpretability May Be Overall Harmful

False sense of control → ❓✅ generally yes:

The world is not coordinated enough for public interpretability research → ❌ generally no:

  • Dual use & When interpretability starts to be useful, you can't even publish it because it's too info-hazardous → ❌ - It's pretty likely that if, for example, SAEs start to be useful, this won't boost capabilities that much.
  • Capability externalities & Interpretability already helps capabilities → ❓ - Mixed feelings:
    • This post shows a new architecture using interpretability discovery, but I don't think this will really stand out against the Bitter Lesson, so for the moment it seems like interpretability is not really useful for capabilities. Also, it seems easier to delete capabilities with interpretability than to add them. Interpretability hasn't significantly boosted capabilities yet.
    • But at the same time, I wouldn't be that surprised if interpretability could unlock a completely new paradigm that would be much more data efficient than the current one.

Outside View: The Proportion of Junior Researchers Doing Interpretability Rather Than Other Technical Work is Too High

  • I would rather see a more diverse ecosystem → ✅ - I still stand by this, and I'm very happy that ARENA and MATS, ML4Good have diversified their curriculum.
  • ⭐ “I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing” → ✅ Compare them doing interpretability vs. publishing their Responsible Scaling policies and evaluating their systems. I think RSPs advanced AI risks much much more.

Even if We Completely Solve Interpretability, We Are Still in Danger

  • There are many X-risks scenarios, not even involving deceptive AIs → ✅ I'm still pretty happy with this enumeration of risks, and I think more people should think about this and directly think about ways to prevent those scenarios. I don't think interpretability is going to be the number one recommendation after this small exercise.
  • Interpretability implicitly assumes that the AI model does not optimize in a way that is adversarial to the user → ❓❌ - The image with Voldemort was unnecessary and might be incorrect for Human-level intelligence. But I have the feeling that all of those brittle interpretability techniques won’t stand for long against a superintelligence, I may be wrong.
  • ⭐ That is why focusing on coordination is crucial! There is a level of coordination above which we don't die - there is no such threshold for interpretability → ✅ I still stand by this: Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety) — LessWrong  

Technical Agendas with Better ToI

I'm very happy with all of my past recommendations. Most of those lines of research are now much more advanced than when I was writing the post, and I think they advanced safety more than interpretability did:

  • Technical works used for AI Governance
    • ⭐ "For example, each of the measures proposed in the paper towards best practices in AGI safety and governance: A survey of expert opinion could be a pretext for creating a specialized organization to address these issues, such as auditing, licensing, and monitoring" → ✅ For example, Apollo is mostly famous for their non-interpretability works.
    • Scary demos → ✅ Yes! Scary demos of deception and other dangerous capabilities were tremendously valuable during the last year, so continuing to do that is still the way to go
      • "(But this shouldn't involve gain-of-function research. There are already many powerful AIs available. Most of the work involves video editing, finding good stories, distribution channels, and creating good memes. Do not make AIs more dangerous just to accomplish this.)" → ❓ The point about gain-of-function research was probably wrong because I think Model organism is a useful agenda, and because it's better if this is done in a controlled environment than later. But we should be cautious with this, and at some point, a model able to do full ARA and R&D could just self-exfiltrate, and this would be irreversible, so maybe the gain-of-function research being okay part is only valid for 1-2 years.
    • "In the same vein, Monitoring for deceptive alignment is probably good because 'AI coordination needs clear wins.'" → ❓ Yes for monitoring, no for that being a clear win because of the reason explained in the post from Buck, saying that it will be too messy for policymakers and everyone to decide just based on those few examples of deception.
    • Interoperability in AI policy and good definitions usable by policymakers → ✅ - I still think that good definitions of AGI, self-replicating AI, good operationalization of red lines would be tremendously valuable for both RSPs levels, Code of Practices of the EU AI Act, and other regulations.
    • "Creating benchmarks for dangerous capabilities" → ✅ - I guess the eval field is a pretty important field now. Such benchmarks didn't really exist beforehand.
  • "Characterizing the technical difficulties of alignment”:
    • Creating the IPCC of AI Risks → ✅ - The International Scientific Report on the Safety of Advanced AI: Interim Report is a good baseline and was very useful to create more consensus!
    • More red-teaming of agendas → ❓ this has not been done but should be! I would really like it if someone was able to write the “Compendium of problems with AI Evaluation” for example. Edit: This has been done.
    • Explaining problems in alignment → ✅ - I still think this is useful
  • “Adversarial examples, adversarial training, latent adversarial training (the only end-story I'm kind of excited about). For example, the papers "Red-Teaming the Stable Diffusion Safety Filter" or "Universal and Transferable Adversarial Attacks on Aligned Language Models" are good (and pretty simple!) examples of adversarial robustness works which contribute to safety culture” → ❓ I think there are more direct way to contribute to safety culture. Liron Shapira’s podcast is better for that I think.
  • "Technical outreach. AI Explained and Rob Miles have plausibly reduced risks more than all interpretability research combined": ❓ I think I need numbers to conclude formally even if my intuition still says that the biggest bottleneck is still a consensus on AI Risks, and not research. I have more doubts with AI Explained now, since he is pushing for safety only in a very subtle way, but who knows, maybe that’s the best approach.
  • “In essence, ask yourself: "What would Dan Hendrycks do?" - Technical newsletter, non-technical newsletters, benchmarks, policy recommendations, risks analysis, banger statements, courses and technical outreach → ✅ and now I would add SB1047, which was I think the best attempt of 2024 at reducing risks.
  • “In short, my agenda is "Slow Capabilities through a safety culture", which I believe is robustly beneficial, even though it may be difficult. I want to help humanity understand that we are not yet ready to align AIs. Let's wait a couple of decades, then reconsider.” → ✅ I think this is still basically valid, and I co-founded a whole organization trying to achieve more of this. I'm very confident what I'm doing is much better in terms of AI risk reduction than what I did previously, and I'm proud to have pivoted: 🇫🇷 Announcing CeSIA: The French Center for AI Safety.
  1. ^

    But I still don’t feel good about having a completely automated and agentic AI that would just make progress in AI alignment (aka the old OpenAI’s plan), and I don’t feel good about the whole race we are in.

  2. ^

    For example, this conceptual understanding enabled via interpretability was useful for me to be able to dissolve the hard problem of consciousness.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2025-01-03T16:58:35.724Z · LW · GW

Ok, time to review this post and assess the overall status of the project.

Review of the post

What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I'm still quite satisfied with the construction of the post—it's progressive and clearly distinguishes between what's important and what's not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn't really understand it, and thought, "There's no way this is going to work." Then I reconsidered, thought about it more deeply, and realized there was something important here. Hopefully, this post succeeded in showing that there is indeed something worth exploring! I think such distillation and analysis are really important.

I'm especially happy about the fact that we tried to elicit as much as we could from Davidad's model during our interactions, including his roadmap and some ideas of easy projects to get early empirical feedback on this proposal.

Current Status of the Agenda.

(I'm not the best person to write this, see this as an informal personal opinion)

Overall, Davidad performed much better than expected with his new job as program director in ARIA and got funded 74M$ over 4 years. And I still think this is the only plan that could enable the creation of a very powerful AI capable of performing a true pivotal act to end the acute risk period, and I think this last part is the added value of this plan, especially in the sense that it could be done in a somewhat ethical/democratic way compared to other forms of pivotal acts. However, it's probably not going to happen in time.

Are we on track? Weirdly, yes for the non-technical aspects, no for the technical ones? The post includes a roadmap with 4 stages, and we can check if we are on track. It seems to me that Davidad jumped directly to stage 3, without going through stages 1 and 2. This is because of having been selected as research director for ARIA, so he's probably going to do 1 and 2 directly from ARIA.

  • Stage 1 Early Research Projects is not really accomplished:
    • “Figure out the meta ontology theory”: Maybe the most important point of the four, currently WIP in ARIA, but a massive team of mathematicians has been hired to solve this.
    • “Heuristics used by the solver”: Nope
    • “Building a toy infra-Bayesian "Super Mario", and then applying this framework to model Smart Grids”: Nope
    • “Training LLMs to write models in the PRISM language by backward distillation”: Kind of already here, probably not very high value to spend time here, I think this is going to be solved by default.
  • Stage 2: Industry actors' first projects: I think this step is no longer meaningful because of ARIA.
  • Stage 3: formal arrangement to get labs to collectively agree to increase their investment in OAA is almost here, in the sense that Davidad got millions to execute this project in ARIA and he published his Multi-author manifesto which backs the plan with legendary names especially with Yoshua Bengio as the scientific director of this project.

The lack of prototyping is concerning. I would have really liked to see an "infra-Bayesian Super Mario" or something similar, as mentioned in the post. If it's truly simple to implement, it should have been done by now. This would help many people understand how it could work. If it's not simple, that would reveal it's not straightforward at all. Either way, it would be pedagogically useful for anyone approaching the project. If we want to make these values democratic, etc.. It's very regrettable that this hasn't been done after two years. (I think people from the AI Objectives Institute tried something at some point, but I'm not aware of anything publicly available.) I think this complete lack prototypes is my number one concern preventing me from recommending more "safe by design" agendas to policymakers.

This plan was an inspiration for constructability: It might be the case that the bold plan could decay gracefully, for example into constructability, by renouncing formal verification and only using traditional software engineering techniques.

International coordination is an even bigger bottleneck than I thought. The "CERN for AI" isn't really within the Overton window, but I think this applies to all the other plans, and not just Davidad's plan. (Davidad made a little analysis of this aspect here).

At the end of the day: Kudos to Davidad for successfully building coalitions, which is already beyond amazing! and he is really an impressive thought leader. What I'm waiting to see for the next year is using AIs such as O3 that are already impressive in terms of competitive programming and science knowledge, and seeing what we can already do with that. I remain excited and eager to see the next steps of this plan.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-01T21:17:51.940Z · LW · GW

Maybe you have some information that I don't have about the labs and the buy-in? You think this applies to OpenAI and not just Anthropic?

But as far as open source goes, I'm not sure. Deepseek? Meta? Mistral? xAI? Some big labs are just producing open source stuff. DeepSeek is maybe only 6 months behind. Is that enough headway?

It seems to me that the tipping point for many people (I don't know for you) about open source is whether or not open source is better than close source, so this is a relative tipping point in terms of capabilities. But I think we should be thinking about absolute capabilities. For example, what about bioterrorism? At some point, it's going to be widely accessible. Maybe the community only cares about X-risks, but personally I don't want to die either.

Is there a good explanation online of why I shouldn't be afraid of open-source?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-01T17:32:07.051Z · LW · GW

No, AI control doesn't pass the bar, because we're still probably fucked until we have a solution for open source AI or race for superintelligence, for example, and OpenAI doesn't seem to be planning to use control, so this looks to me like the research that's sort of being done in your garage but ignored by the labs (and that's sad, control is great I agree).

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T18:16:37.774Z · LW · GW

What do you think of my point about Scott Aaronson? Also, since you agree with points 2 and 3, it seems that you also think that the most useful work from last year didn't require advanced physics, so isn't this a contradiction with you disagreing with point 1?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T14:04:42.784Z · LW · GW

I think I do agree with some points in this post. This failure mode is the same as the one I mentioned about why people are doing interpretability for instance (section Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high), and I do think that this generalizes somewhat to whole field of alignment. But I'm highly skeptical that recruiting a bunch of physicists to work on alignment would be that productive:

  • Empirically, we've already kind of tested this, and it doesn't work.
    • I don't think that what Scott Aaronson produced while at OpenAI had really helped AI Safety: He is exactly doing what is criticized in the post: Streetlight research and using techniques that he was already familiar with from his previous field of research, I don't think the author of the OP would disagree with me. Maybe n=1, but it was one of the most promising shots.
    • Two years ago, I was doing field-building and trying to source talent, primarily selecting based on pure intellect and raw IQ. I've organized the Von Neumann Symposium around the problem of corrigibility, I targeted IMO laureates, and individuals from the best school in France, ENS Ulm, which arguably has the highest concentration of future Nobel laureates in the world. However, pure intelligence doesn't work. In the long term, the individuals who succeeded in the field weren't the valedictorians from France's top school, but rather those who were motivated, had read The Sequences, were EA people, possessed good epistemology, and had a willingness to share their work online (maybe you are going to say that the people I was targeting were too young, but I think my little empirical experience is already much better than the speculation in the OP).
    • My prediction is that if you put a group of skilled physicists in a room, first, it's not even sure they would find that many people motivated in this reference class, and I don't think the few who would be motivated would produce good-quality work.
    • For the ML4Good bootcamps, the scoring system reflects this insight. We use multiple indicators and don't rely solely on pure IQ to select participants, because there is little correlation between pure high IQ and long term quality production.
  • I believe the biggest mistake in the field is trying to solve "Alignment" rather than focusing on reducing catastrophic AI risks. Alignment is a confused paradigm; it's a conflationary alliance term that has sedimented over the years. It's often unclear what people mean when they talk about it: Safety isn't safety without a social model.
    • Think about what has been most productive in reducing AI risks so far? My short list would be:
      • The proposed SB 1047 legislation.
      • The short statement on AI risks
      • Frontier AI Safety Commitments, AI Seoul Summit 2024, to encourage labs to publish their responsible scaling policies.
      • Scary demonstrations to showcase toy models of deception, fake alignment, etc, and to create more scientific consensus, which is very very needed
    • As a result, the field of "Risk Management" is more fundamental for reducing AI risks than "AI Alignment." In my view, the theoretical parts of the alignment field have contributed far less to reducing existential risks than the responsible scaling policies or the draft of the EU AI Act's Code of Practice for General Purpose AI Systems, which is currently not too far from being the state-of-the-art for AI risk management. Obviously, it's still incomplete, but that's the direction that is I think most productive today.
  • Related, The Swiss cheese model of safety is underappreciated in the field. This model has worked across other industries and seems to be what works for the only general intelligence we know: humans. Humans use a mixture of strategies for safety we could imitate for AI safety (see this draft). However, the agent foundations community seems to be completely neglecting this.
Comment by Charbel-Raphaël (charbel-raphael-segerie) on Hire (or Become) a Thinking Assistant · 2024-12-25T20:38:51.498Z · LW · GW

I've been testing this with @Épiphanie Gédéon  for a few months now, and it's really, really good for doing more work that's intellectually challenging. In my opinion, the most important sentence in the post is the fact that it doesn't help that much during peak performance moments, but we’re not at our peak that often. And so, it's super important. It’s really a big productivity boost, especially when doing cognitively demanding tasks or things we struggle to "eat the frog". So, I highly recommend it.

But the person involved definitely needs to be pretty intelligent to keep up and to make recommendations that aren’t useless. Sometimes, it can feel more like co-working, there are quite a few different ways it can work, more or less passive/active. But overall, generally speaking, we recommend trying it for at least a few days. 

It took me quite a while to take the plunge because there's a social aspect—this kind of thing isn’t very common in France. It’s not considered a real job. Although, honestly, it should be a real job for intellectual professions, in my opinion. And it’s not an easy job.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Consciousness as a conflationary alliance term for intrinsically valued internal experiences · 2024-12-24T00:18:21.078Z · LW · GW

I often find myself revisiting this post—it has profoundly shaped my philosophical understanding of numerous concepts. I think the notion of conflationary alliances introduced here is crucial for identifying and disentangling/dissolving many ambiguous terms and resolving philosophical confusion. I think this applies not only to consciousness but also to situational awareness, pain, interpretability, safety, alignment, and intelligence, to name a few.

I referenced this blog post in my own post, My Intellectual Journey to Dis-solve the Hard Problem of Consciousness, during a period when I was plateauing and making no progress in better understanding consciousness. I now believe that much of my confusion has been resolved.

I think the concept of conflationary alliances is almost indispensable for effective conceptual work in AI safety research. For example, it helps clarify distinctions, such as the difference between "consciousness" and "situational awareness." This will become increasingly important as AI systems grow more capable and public discourse becomes more polarized around their morality and conscious status.

Highly recommended for anyone seeking clarity in their thinking!

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Are we dropping the ball on Recommendation AIs? · 2024-10-26T18:40:51.072Z · LW · GW

I don't Tournesol is really mature currently, especially for non french content, and I'm not sure they try to do governance works, that's mainly a technical projet, which is already cool.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Are we dropping the ball on Recommendation AIs? · 2024-10-26T18:34:50.331Z · LW · GW

Yup, we should create an equivalent of the Nutri-Score for different recommendation AIs. 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Are we dropping the ball on Recommendation AIs? · 2024-10-25T12:41:37.877Z · LW · GW

"I really don't know how tractable it would be to pressure compagnies" seems weirdly familiar.  We already used the same argument for AGI safety, and we know that governance work is much more tractable than expected.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Are we dropping the ball on Recommendation AIs? · 2024-10-24T18:24:34.270Z · LW · GW

I'm a bit surprised this post has so little karma and engagement. I would be really interested to hear from people who think this is a complete distraction.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The case for more Alignment Target Analysis (ATA) · 2024-09-21T19:13:38.722Z · LW · GW

Fair enough.

I think my main problem with this proposal is that under the current paradigm of AIs (GPTs, foundation models),  I don't see how you want to implement ATA, and this isn't really a priority? 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on The case for more Alignment Target Analysis (ATA) · 2024-09-20T19:02:39.298Z · LW · GW

I believe we should not create a Sovereign AI. Developing a goal-directed agent of this kind will always be too dangerous. Instead, we should aim for a scenario similar to CERN, where powerful AI systems are used for research in secure labs, but not deployed in the economy. 

I don't want AIs to takeover. 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Covert Malicious Finetuning · 2024-09-16T00:53:04.000Z · LW · GW

Thank you for this post and study. It's indeed very interesting.

I have two questions:

In what ways is this threat model similar to or different from learned steganography? It seems quite similar to me, but I’m not entirely sure.

If it can be related to steganography, couldn’t we apply the same defenses as for steganography, such as paraphrasing, as suggested in this paper? If paraphrasing is a successful defense, we could use it in the control setting, in the lab, although it might be cumbersome to apply paraphrasing for all users in the api.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight · 2024-09-15T22:12:21.783Z · LW · GW

Interesting! Is it fair to say that this is another attempt at solving a sub problem of misgeneralization?

Here is one suggestion to be able to cluster your SAEs features more automatically between gender and profession.

In the past, Stuart Armstrong with alignedAI also attempted to conduct works aimed at identifying different features within a neural network in such a way that the neural network would generalize better. Here is a summary of a related paper, the DivDis paper that is very similar to what alignedAI did:

https://github.com/EffiSciencesResearch/challenge_data_ens_2023/blob/main/assets/DivDis.png?raw=true

 

The DivDis paper presents a simple algorithm to solve these ambiguity problems in the training set. DivDis uses multi-head neural networks, and a loss that encourages the heads to use independent information. Once training is complete, the best head can be selected by testing all different heads on the validation data.

DivDis achieves 64% accuracy on the unlabeled set when training on a subset of human_age and 97% accuracy on the unlabeled set of human_hair. GitHub : https://github.com/yoonholee/DivDis

 

I have the impression that you could also use DivDis by training a probe on the latent activations of the SAEs and then applying Stuart Armstrong's technique to decorrelate the different spurious correlations. One of those two algos would enable to significantly reduce the manual work required to partition the different features with your SAEs, resulting in two clusters of features, obtained in an unsupervised way, that would be here related to gender and profession.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Constructability: Plainly-coded AGIs may be feasible in the near future · 2024-08-12T07:04:29.904Z · LW · GW

Here is the youtube video from the Guaranteed Safe AI Seminars:

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Finding the Wisdom to Build Safe AI · 2024-07-06T08:45:06.992Z · LW · GW

It might not be that impossible to use LLM to automatically train wisdom:

Look at this: "Researchers have utilized Nvidia’s Eureka platform, a human-level reward design algorithm, to train a quadruped robot to balance and walk on top of a yoga ball."

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety) · 2024-06-14T15:07:09.028Z · LW · GW

Strongly agree.

Related: It's disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don't grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, "When you write this kind of software, you always control what's going to happen, all the outputs the software can have." As long as such individuals are leading AGI labs, the situation will remain quite dire.

+1 for the conflationary alliances point. It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion). I'm not convinced the goal of the AI Safety community should be to align AIs at this point.

However, I want to make a small amendment to Myth 1: I believe that technical work which enhances safety culture is generally very positive. Examples of such work include scary demos like "BadLlama," which I cite at least once a week, or benchmarks such as Evaluating Frontier Models for Dangerous Capabilities, which tries to monitor particularly concerning capabilities. More "technical" works like these seem overwhelmingly positive, and I think that we need more competent people doing this.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Two easy things that maybe Just Work to improve AI discourse · 2024-06-09T07:21:37.029Z · LW · GW

Strong agree. I think twitter and reposting stuff on other platforms is still neglected, and this is important to increase safety culture

Comment by Charbel-Raphaël (charbel-raphael-segerie) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-06-05T08:13:08.609Z · LW · GW

doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed".

I agree that's a bit too much, but it seems to me that we're not at all on the way to stopping open source development, and that we need to stop it at some point; maybe you think ARA is a bit early, but I think we need a red line before AI becomes human-level, and ARA is one of the last arbitrary red lines before everything accelerates.

But I still think no return to loss of control because it might be very hard to stop ARA agent still seems pretty fair to me.

Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn't link it myself before.

I agree with your comment on twitter that evolutionary forces are very slow compared to deliberate design, but that is not way I wanted to convey (that's my fault). I think an ARA agent would not only depend on evolutionary forces, but also on the whole open source community finding new ways to quantify, prune, distill, and run the model in a distributed way in a practical way. I think the main driver this "evolution" would be the open source community & libraries who will want to create good "ARA", and huge economic incentive will make agent AIs more and more common and easy in the future.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-06-01T00:38:38.956Z · LW · GW

Thanks for this comment, but I think this might be a bit overconfident.

constantly fighting off the mitigations that humans are using to try to detect them and shut them down.

Yes, I have no doubt that if humans implement some kind of defense, this will slow down ARA a lot. But:

  • 1) It’s not even clear people are going to try to react in the first place. As I say, most AI development is positive. If you implement regulations to fight bad ARA, you are also hindering the whole ecosystem. It’s not clear to me that we are going to do something about open source. You need a big warning shot beforehand and this is not really clear to me that this happens before a catastrophic level. It's clear they're going to react to some kind of ARAs (like chaosgpt), but there might be some ARAs they won't react to at all. 
  • 2) it’s not clear this defense (say for example Know Your Customer for providers) is going to be sufficiently effective to completely clean the whole mess. if the AI is able to hide successfully on laptops + cooperate with some humans, this is going to be really hard to shut it down. We have to live with this endemic virus. The only way around this is cleaning the virus with some sort of pivotal act, but I really don’t like that.

  While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources.

"at the same rate" not necessarily. If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop. The real crux is how much time the ARA AI needs to evolve into something scary.

Superintelligences could do all of this, and ARA of superintelligences would be pretty terrible. But for models in the broad human or slightly-superhuman ballpark, ARA seems overrated, compared with threat models that involve subverting key human institutions.

We don't learn much here. From my side, I think that superintelligence is not going to be neglected, and big labs are taking this seriously already. I’m still not clear on ARA.

Remember, while the ARA models are trying to survive, there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement). These seem much more concerning.

This is not the central point. The central point is:

  • At some point, ARA is unshutdownable unless you try hard with a pivotal cleaning act. We may be stuck with a ChaosGPT forever, which is not existential, but pretty annoying. People are going to die.
  • the ARA evolves over time. Maybe this evolution is very slow, maybe fast. Maybe it plateaus, maybe it does not plateau. I don't know
  • This may take an indefinite number of years, but this can be a problem

the "natural selection favors AIs over humans" argument is a fairly weak one; you can find some comments I've made about this by searching my twitter.

I’m pretty surprised by this. I’ve tried to google and not found anything.  

 

Overall, I think this still deserves more research

Comment by Charbel-Raphaël (charbel-raphael-segerie) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-05-31T15:06:38.677Z · LW · GW

Why not! There are many many questions that were not discussed here because I just wanted to focus on the core part of the argument. But I agree details and scenarios are important, even if I think this shouldn't change too much the basic picture depicted in the OP.

Here are some important questions that were voluntarily omitted from the QA for the sake of not including stuff that fluctuates too much in my head;

  1. would we react before the point of no return?
  2. Where should we place the red line? Should this red line apply to labs?
  3. Is this going to be exponential? Do we care?
  4. What would it look like if we used a counter-agent that was human-aligned?
  5. What can we do about it now concretely? Is KYC something we should advocate for?
  6. Don’t you think an AI capable of ARA would be superintelligent and take-over anyway?
  7. What are the short term bad consequences of early ARA? What does the transition scenario look like.
  8. Is it even possible to coordinate worldwide if we agree that we should?
  9. How much human involvement will be needed in bootstrapping the first ARAs?

We plan to write more about these with @Épiphanie Gédéon  in the future, but first it's necessary to discuss the basic picture a bit more.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Awakening · 2024-05-30T08:20:55.635Z · LW · GW

Thanks for writing this.

I like your writing style, this inspired me to read a few more things

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Brainstorming positive visions of AI · 2024-05-01T15:25:14.466Z · LW · GW

Seems like we are here today

Comment by Charbel-Raphaël (charbel-raphael-segerie) on AI Safety Camp final presentations · 2024-05-01T08:20:57.247Z · LW · GW

are the talks recorded?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Constructability: Plainly-coded AGIs may be feasible in the near future · 2024-04-27T23:07:34.342Z · LW · GW

Corrected

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Constructability: Plainly-coded AGIs may be feasible in the near future · 2024-04-27T16:23:20.144Z · LW · GW

[We don't think this long term vision is a core part of constructability, this is why we didn't put it in the main post]

We asked ourselves what should we do if constructability works in the long run. 

We are unsure, but here are several possibilities.

Constructability could lead to different possibilities depending on how well it works, from most to less ambitious:

  1. Using GPT-6 to implement GPT-7-white-box (foom?)
  2. Using GPT-6 to implement GPT-6-white-box
  3. Using GPT-6 to implement GPT-4-white-box
  4. Using GPT-6 to implement Alexa++, a humanoid housekeeper robot that cannot learn
  5. Using GPT-6 to implement AlexNet-white-box
  6. Using GPT-6 to implement a transparent expert system that filters CVs without using protected features

Comprehensive AI services path

We aim to reach the level of Alexa++, which would already be very useful: No more breaking your back to pick up potatoes. Compared to the robot Figure01, which could kill you if your neighbor jailbreaks it, our robot seems safer and would not have the capacity to kill, but only put the plates in the dishwasher, in the same way that today’s Alexa cannot insult you.

Fully autonomous AGI, even if transparent, is too dangerous. We think that aiming for something like Comprehensive AI Services would be safer. Our plan would be part of this, allowing for the creation of many small capable AIs that may compose together (for instance, in the case of a humanoid housekeeper, having one function to do the dishes, one function to walk the dog, …).

Alexa++ is not an AGI but is already fine. It even knows how to do a backflip Boston dynamics style. Not enough for a pivotal act, but so stylish. We can probably have a nice world without AGI in the wild.

The Liberation path

Another possible moonshot theory of impact would be to replace GPT-7 with GPT-7-plain-code. Maybe there's a "liberation speed n" at which we can use GPT-n to directly code GPT-p with p>n. That would be super cool because this would free us from deep learning.

Different long term paths that we see with constructability.

Guided meditation path

You are not really enlightened if you are not able to code yourself. 

Maybe we don't need to use something as powerful as GPT-7 to begin this journey.

We think that with significant human guidance, and by iterating many many times, we could meander iteratively towards a progressive deconstruction of GPT-5.

We could use current models as a reference to create slightly more transparent and understandable models, and use them as reference again and again until we arrive at a fully plain-coded model.
  • Going from GPT-5 to GPT-2-hybrid seems possible to us.
  • Improving GPT-2-hybrid to GPT-3-hybrid may be possible with the help of GPT-5?
  • ...

If successful, this path could unlock the development of future AIs using constructability instead of deep learning. If constructability done right is more data efficient than deep learning, it could simply replace deep learning and become the dominant paradigm. This would be a much better endgame position for humans to control and develop future advanced AIs.

PathFeasibilitySafety
Comprehensive AI Services Very feasibleVery safe but unstable in the very long run
LiberationFeasibleUnsafe but could enable a pivotal act that makes things stable in the long run
Guided MeditationVery HardFairly safe and could unlock a safer tech than deep learning which results in a better end-game position for humanity.
Comment by Charbel-Raphaël (charbel-raphael-segerie) on A Dilemma in AI Suffering/Happiness · 2024-04-24T08:53:41.970Z · LW · GW

You might be interested in reading this. I think you are reasoning in an incorrect framing. 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Effectively Handling Disagreements - Introducing a New Workshop · 2024-04-15T17:23:34.614Z · LW · GW

I have tried Camille's in-person workshop in the past and was very happy with it. I highly recommend it. It helped me discover many unknown unknowns.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on What convincing warning shot could help prevent extinction from AI? · 2024-04-15T09:41:52.748Z · LW · GW

Deleted paragraph from the post, that might answer the question:

Surprisingly, the same study found that even if there were an escalation of warning shots that ended up killing 100k people or >$10 billion in damage (definition), skeptics would only update their estimate from 0.10% to 0.25% [1]: There is a lot of inertia, we are not even sure this kind of “strong” warning shot would happen, and I suspect this kind of big warning shot could happen beyond the point of no return because this type of warning shot requires autonomous replication and adaptation abilities in the wild.

  1. ^

    It may be because they expect a strong public reaction. But even if there was a 10-year global pause, what would happen after the pause? This explanation does not convince me. Did the government prepare for the next covid? 

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-08T21:00:31.876Z · LW · GW

in your case, you felt the problem, until you decided that an AI civilization might spontaneously develop a spurious concept of phenomenal consciousness. 


This is the best summary of the post currently

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-08T11:32:13.795Z · LW · GW

Thanks for jumping in! And I'm not that emotionally struggling with this, this was more of a nice puzzle, so don't worry about it :)

I agree my reasoning is not clean in the last chapter.

To me, the epiphany was that AI would rediscover everything like it rediscovered chess alone.  As I've said in the box, this is a strong blow to non-materialistic positions, and I've not emphasized this enough in the post.

I expect AI to be able to create "civilizations" (sort of) of its own in the future, with AI philosophers, etc.

Here is a snippet of my answer to Kaj, let me know what you think about it:

I'm quite confident that the meta-problem and the easy problems of consciousness will eventually be fully solved through advancements in AI and neuroscience. I've written extensively about AI and path to autonomous AGI here, and I would ask people: "Yo, what do you think AI is not able to do? Creativity? Ok do you know....".  At the end of the day, I would aim to convince them that anything humans are able to do, we can reconstruct everything with AIs. I'd put my confidence level for this at around 95%. Once we reach that point, I agree I think it will become increasingly difficult to argue that the hard problem of consciousness is still unresolved, even if part of my intuition remains somewhat perplexed. Maintaining a belief in epiphenomenalism while all the "easy" problems have been solved is a tough position to defend - I'm about 90% confident of this.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-08T11:24:23.172Z · LW · GW

Thank you for clarifying your perspective. I understand you're saying that you expect the experiment to resolve to "yes" 70% of the time, making you 70% eliminativist and 30% uncertain. You can't fully update your beliefs based on the hypothetical outcome of the experiment because there are still unknowns.

For myself, I'm quite confident that the meta-problem and the easy problems of consciousness will eventually be fully solved through advancements in AI and neuroscience. I've written extensively about AI and path to autonomous AGI here, and I would ask people: "Yo, what do you think AI is not able to do? Creativity? Ok do you know....".  At the end of the day, I would aim to convince them that anything humans are able to do, we can reconstruct everything with AIs. I'd put my confidence level for this at around 95%. Once we reach that point, I agree I think it will become increasingly difficult to argue that the hard problem of consciousness is still unresolved, even if part of my intuition remains somewhat perplexed. Maintaining a belief in epiphenomenalism while all the "easy" problems have been solved is a tough position to defend - I'm about 90% confident of this.

So while I'm not a 100% committed eliminativist, I'm at around 90% (when I was at 40% in chapter 6 in the story). Yes, even after considering the ghost argument, there's still a small part of my thinking that leans towards Chalmers' view. However, the more progress we make in solving the easy and meta-problems through AI and neuroscience, the more untenable it seems to insist that the hard problem remains unaddressed.

a non-eliminativist might be perfectly willing to grant that yes, we can build the entire pyramid, while also holding that merely building the pyramid won't tell us anything about the hard problem nor the meta-problem.

I actually think a non-eliminativist would concede that building the whole pyramid does solve the meta-problem. That's the crucial aspect. If we can construct the entire pyramid, with the final piece being the ability to independently rediscover the hard problem in an experimental setup like the one I described in the post, then I believe even committed non-materialists would be at a loss and would need to substantially update their views.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T16:43:46.528Z · LW · GW

hmm, I don't understand something, but we are closer to the crux :)

 

You say:

  1. To the question, "Would you update if this experiment is conducted and is successful?" you answer, "Well, it's already my default assumption that something like this would happen". 
  2. To the question, "Is it possible at all?" You answer 70%. 

So, you answer 99-ish% to the first question and 70% to the second question, this seems incoherent.

It seems to me that you don't bite the bullet for the first question if you expect this to happen. Saying, "Looks like I was right," seems to me like you are dodging the question.

That sounds like it would violate conservation of expected evidence:

Hum, it seems there is something I don't understand; I don't think this violates the law.

 

I don't see how it does? It just suggests that a possible approach by which the meta-problem could be solved in the future.

I agree I only gave the skim of the proof, it seems to me that if you can build the pyramid, brick by brick, then this solved the meta-problem.

for example, when I give the example of meta-cognition-brick, I say that there is a paper that already implements this in an LLM (and I don't find this mysterious because I know how I would approximately implement a database that would behave like this).

And it seems all the other bricks are "easily" implementable.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T10:39:08.712Z · LW · GW

Let's put aside ethics for a minute.

"But it wouldn't be necessary the same as in a human brain."

Yes, this wouldn't be the same as the human brain; it would be like the Swiss cheese pyramid that I described in the post.

Your story ended on stating the meta problem, so until it's actually solved, you can't explain everything.

Take a look at my answer to Kaj Sotala and tell me what you think.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T10:29:53.727Z · LW · GW

Thank you for the kind words!

Saying that we'll figure out an answer in the future when we have better data isn't actually giving an answer now.

Okay, fair enough, but I predict this would happen: in the same way that AlphaGo rediscovered all of chess theory, it seems to me that if you just let the AIs grow, you can create a civilization of AIs. Those AIs would have to create some form of language or communication, and some AI philosopher would get involved and then talk about the hard problem.

I'm curious how you answer those two questions:

  1. Let's say we implement this simulation in 10 years and everything works the way I'm telling you now. Would you update?
  2. What is the probability that this simulation is possible at all? 

If you expect to update in the future, just update now.  

To me, this thought experiment solves the meta-problem and so dissolves the hard problem.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T06:56:01.019Z · LW · GW

But I have no way to know or predict if it is like something to be a fish or GPT-4

But I can predict what you say; I can predict if you are confused by the hard problem just by looking at your neural activation; I can predict word by word the following sentence that you are uttering: "The hard problem is really hard."

I would be curious to know what you think about the box solving the meta-problem just before the addendum. Do you think it is unlikely that AI would rediscover the hard problem in this setting?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T06:48:46.715Z · LW · GW

I would be curious to know what you think about the box solving the meta-problem just before the addendum.

Do you think it is unlikely that AI would rediscover the hard problem in this setting?

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T06:44:49.187Z · LW · GW

I'm not saying that LeCun's rosy views on AI safety stem solely from his philosophy of mind, but yes, I suspect there is something there.

It seems to me that when he says things like "LLMs don't display true understanding", "or true reasoning", as if there's some secret sauce to all this that he thinks can only appear in his Jepa architecture or whatever, it seems to me that this is very similar to the same linguistic problems I've observed for consciousness.

Surely, if you will discuss with him, he will say things like "No, this is not just a linguistic debate, LLMs cannot reason at all, my cat reasons better": This surely indicates a linguistic debate.

It seems to me that LeCunis is basically an essentialist of his Jepa architecture, as the main criterion for a neural network to exhibit "reasoning".

LeCun's algorithm is something like: "Jepa + Not LLM -> Reasoning".

My algorithm is more something like: "chain-of-thought + can solve complex problem + many other things -> reasoning".

This is very similar to the story I tell for consciousness in the Car Circuit section here.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-07T06:24:58.044Z · LW · GW

Sure, "everything is a cluster" or "everything is a list" is as right as "everything is emergent". But what's the actual justification for pruning that neuron? You can prune everything like that.

The justification for pruning this neuron seems to me to be that if you can explain basically everything without using a dualistic view, it is so much simpler. The two hypotheses are possible, but you want to go with the simpler hypothesis, and a world with only (physical properties) is simpler than a world with (physical properties + mental properties).

I would be curious to know what you know about my box trying to solve the meta-problem. 

Do you mean that the original argument that uses zombies leads only to epiphenomenalism, or that if zombies were real that would mean consciousness is epiphenomenal, or what?

Both

Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-06T21:47:07.795Z · LW · GW

Frontpage comment guidelines:

  • Aim to explain, not persuade
  • Try to offer concrete models and predictions
  • If you disagree, try getting curious about what your partner is thinking
  • Don't be afraid to say 'oops' and change your mind
Comment by Charbel-Raphaël (charbel-raphael-segerie) on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-06T18:02:52.461Z · LW · GW

I don't know, it depends on your definition of "unsolved" and "solved", but I would lean towards "there is a solved hard problem" because the problem was hard, it took me a lot of time (i.e. the meme of the hard problem existed in my head), and my post finally dissolved the question.

Comment by Charbel-Raphaël (charbel-raphael-segerie) on Three Worlds Decide (5/8) · 2024-03-09T13:46:01.224Z · LW · GW

When there are difficult decisions to be made, I like to come back to this story.