Posts
Comments
I am not from the US, so I don't know anything about the organizations that you have listed. However, we can look at three main conventional sources of existential risk (excluding AI safety, for now, we will come back to it later):
- Nuclear Warfare - Cooperation strategies + decision theory are active academic fields.
- Climate Change - This is a very hot topic right now, and a lot of research is being put into it.
- Pandemics - There was quite a bit of research before COVID, but even more now.
As to your point about Hassabis not doing other projects for reducing existential risk besides AI alignment:
Most of the people on this website (at least, that's my impression) have rather short timelines. This means that work on a lot of existential risks becomes much less critical. For example, while supervolcanoes are scary, they are unlikely to wipe us out before we get an AI powerful enough to be able to solve this problem more efficiently than we can.
As to your point about Sam Altman choosing to cure cancer over reducing x-risk:
I don't think Sam looks forward to the prospect of dying due to some x-risk, does he? After all, he is just as human as we are, and he, too, would hate the metaphorical eruption of a supervolcano. But that's going to become much more critical after we have an aligned and powerful AI on our hands. This also applies to LeCun/Zuckerberg/random OpenAI researchers. They (or at least most of them) seem to be enjoying their lives and wouldn't want to randomly lose them.
As to your last paragraph:
I think your formulation of AI response ("preventing existential risks is very hard and fraught, but hey, what if I do a global mass persuasion campaign") really undersells it. It almost makes it sound like a mass persuasion campaign is a placebo to make AI's owner feel good.
Consider another response "the best option for preventing existential risks, would be doing a mass persuasion campaign of such and such content, that would have such and such effects, and would prevent such and such risks".
My (naive) reaction (assuming that I think that AI is properly aligned and is not hallucinating) would be to ask more questions about the details of this proposal and, if its answers make sense, to proceed with doing it. Is it that crazy? I think doing a persuasion campaign is in itself not necessarily bad (we deal with them every day in all sorts of contexts, ranging from your relatives convincing you to take a day off to people online making you change your mind on some topic to politicians running election campaigns), and is especially not bad when it is the best option for preventing x-risk.
In the scenario that you described, I am not sure at all that an average person, who would be able to get their hands on AI (this is an important point, because I believe these people to be generally smarter/more creative/more resourceful) would opt for a second option, especially after being told: "Well I could try something much more low-key and norm-following, but it probably won't work".
Option 1: Good odds of preventing x-risk, but apparently also doesn't work in movies.
Option 2: Bad odds of preventing x-risk.
Perhaps, the crux is that I think that people who actually have good chances of controlling powerful AI (Sam Altman, Dario Amodei etc) would see that option #1 is obviously better if they themselves want to survive (something that is instrumental to whatever other goals they have like curing cancer).
Furthermore, there is a more sophisticated solution to this problem. I can ask this powerful AI to find the best representations of human preferences and then find the best way to align itself to it. Then, this AI would no longer be constrained by my cognitive limitations.
An average person probably wouldn't come up with it by themselves, but they don't need to. Presumably, if this is indeed a good solution, it would be suggested at the first stage where I ask AI to find best solutions for preventing existential risk. Would an average person take this option? I expect our intuitions to diverge here much strongly than in the previous case. Looking at myself, I was an average person just a few month ago. I have no expertise in AI whatsoever. I don't consider myself to be among the smartest humans on Earth. I haven't read much literature on AI safety. I was basically an exemplary "normie".
And yet, when I read about the idea of an instruction-following AI, I almost immediately came up with this solution of making an AI aligned with human preferences using the aforementioned instruction-following AI. It sidesteps a lot of misuse concerns, which (at the time) seemed scary.
I grant that an average person might not come up with this solution by themselves, but presumably (if this solution is any good), it would be suggested by AI itself when we ask it to come up with solutions for x-risk.
An average person could ask this AI how to ensure that this endeavor won't end up in a scenario like the one in the movie. Then it can cross-check its answer with a security team which has successfully aligned it.
My conclusion is basically that I would expect a good chunk of average people to opt for a sophisticated solution after AI suggesting it, an ever bigger chunk of average people to opt for a naive solution, and an even bigger chunk of people, who would most likely control powerful AI in the future (generally smarter than an average person), to opt for either solution.
Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.
It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following AIs will not be spreading wisdom and foresight in the first place.
I don't think that whoever is using these AI powers (let's call him Alex) needs to be that wise (beyond the wiseness of an average person who could get their hands on a powerful AI, which is probably higher-than-average).
Alex doesn't need to come up with @Noosphere89's proposed solution of persuasion campaigns all by himself. Alex merely needs to ask his AI what are the best solutions for preventing existential risks. If Noosphere's proposal is indeed wise, then AI would suggest it. Alex could then implement this solution.
Alex doesn't necessarily need to want to spread wisdom and foresight in this scheme. He merely needs to want to prevent existential risks.
Not OP, but for me, it comes down to LLMs correctly interpreting the intent behind my questions/requests. In other words, I don't need to be hyper specific in my prompts in order to get the results I want.
By value alignment, which one of the following do you mean?
1) alignment with humanity's CEV
2) alignment with some other specified way of interpreting humanity's values
3) alignment with some unspecified way of interpreting humanity's values that future AGI/ASI will hopefully come up with
This article seems to have found a big weakness of untrusted monitoring:
Synthetic inputs auditing is trivially defeated if the checker AI can distinguish synthetic inputs from real ones, as it could then always answer truthfully on the synthetic inputs to pass auditing, then collude with the generator on real inputs.
To create useful synthetic inputs, either the blue team must be able to match the generator's style well enough to consistently fool the checker, or it must be able to redact the generator's output to the point where the checker has no identifying details to distinguish inputs. The former seems impractical, since faking the inputs at that level of verisimilitude would require the blue team to have significant understanding of the AI's inner workings, at which point there are probably more directly usable alignment strategies. The latter seems difficult, as by the protocol's definition, the AI is working on problems/solutions that humans cannot easily understand, but effective redaction requires humans to understand the content of the solution to some extent.
Overall, I think this represents a significant flaw in the untrusted monitoring protocol.
Do you think that this weakness can be somehow circumvented? If not, do you think it is as fatal (to untrusted monitoring protocol) as the author of the article seems to think?
Maybe it is a controversial take, but I am in favor of publishing all solutions to corrigibility.
If a company fails at coming up with solutions for AGI corrigibility, they wouldn't stop building AGI. Instead, they would proceed with ramping up capabilities and end up with a misaligned AGI that (for all we know) will want to turn universe into paperclips.
Due to instrumental convergence, AGI, whose goals are not explicitly aligned with human goals, is going to engage in a very undesirable behavior by default.
I think that, perhaps, after reading this clarifying comment from you, @quila would change his perception of your position.
Let me put my attitudes in practical terms: I don't kick dogs, but I have destroyed a wasp's nest in my garage, and I don't donate to animal charities. (I don't donate to many other charities either, but there have been a few.) Let those who think saving the shrimps is worthwhile do so, but I do not travel on that path
This is what I expected. Your take when put in these terms seems pretty moderate. Whereas, when I read your original comment, this take (which presumably stayed the same) seemed very extreme.
In other words, my personal beliefs haven't changed a single bit and yet my perception of your beliefs changed a lot. I can only imagine that your original comment has been so strongly disagree-voted because of the framing.
Agreed!
There is also the issue of clarity, I am not sure if Richard has a moderate position that sounds like a very extreme position due to the framing or if he genuinely shares this extreme position.
I do think that animals, the larger ones at least, can suffer. But I don’t much care.
Does this mean a more moderate take of "I don't care enough to take any actions, because I don't believe that I am personally causing much suffering to animals" or a very radical take of "I would rather take $10 than substantially improve the well-being of all animals"?
I think there might be some double standard going on here.
You seem to not care much about animal well-being, and @quila evidently does, so would it not be fair from Quila's perspective to not care much about your well-being? And if Quila doesn't care about your well-being, then he might think that had you not existed, the world (in a utilitarian sense) would be better.
Quila can similarly say "I don't care for lives of people who are actively destroying something that I value a lot
Bring on the death threats!"
I will also reply to the edit here.
would you say that "we have qualia" is just a strong intuition imbued in us by the mathematical process of evolution, and your "realist" position is that you choose to accept {referring to it as if it is real} as a good way of communicating about that imbued intuition/process?
I wouldn't say that about qualia and I wouldn't say that about free will. What you described sounds like being only a nominal realist.
This is going into details, but I think there is a difference between semantic realism and nominal realism. Semantic realist would say that based on a reasonable definition of a term, the term is real. Most if not all realists are semantic realists. Almost nobody would an apple realist if the reasonable definition of apple would be something like "fruit that turns you into the unicorn".
Whereas nominal realist is someone who would say that based on a reasonable definition of a term, the term probably isn't real, but it's a useful communication tool, so they are going to keep using it.
Using an analogy (specifically for free will), I would say that free will is as real as consent, whereas I would imagine that a nominal realist would say that free will is as real as karma.
By the way, I will quickly reply to your edit in this comment. Determinism certainly seems compatible with qualia being fundamental.
"Hard Determinism" is just a shorthand for a position (in a free will debate) that accepts the following claims:
- Determinism is true.
- Determinism is incompatible with free will
I am pretty sure that the term has no use outside of discussions about free will.
I will try reply to your edit:
Also, another thing I'm interested in is how someone could have helped past you (and camp #1 people generally) understand what camp #2 even means. There may be a formal-alignment failure mode not noticeable to people in camp #1 where a purely mathematical (it may help to read that as '3rd-person')
I don't remember the exact specifics, but I came across Mary's Room thought experience (perhaps through this video). When presented in that way and when directly asked "does she learn anything new?" my surprising (to myself at the time) answer was an emphatic "yes".
what is qualia, in your ontology
It the direct phenomenal experience of stuff like pain, pleasure, colors, etc.
what does it mean for something to be irreducible or metaphysically fundamental
Something is irreducible in a sense that it can't be reduced to interactions between atoms. It can't also be completely completely described from a 3rd person perspective (the perspective from which science usually operates).
I understand the 'compatibilist' view on free will to be something like this: determinism is true, but we still run some algorithm which deterministically picks between options, so the algorithm that we are is still 'making a choice', which I choose to conceptualize as "free will and determinism are compatible"
I would rather conceptualize it as acting according to my value system without excessive external coercion. But I would also say that I agree with all four claims that you made there, namely:
- Determinism is true
- We run some algorithm which deterministically picks between options
- We are making a choice
- Free will and determinism are compatible
Anti-realists about free will" and I have no actual disagreement about the underlying reality of determinism being true and us being predictable.
Yeah, I would agree that we are predictable in principle.
If so, is your view on qualia similar, in that you've mostly re-conceptualized it instead of having "noticed a metaphysically fundamental thing"?
I used to think that qualia was a reducible and non-fundamental phenomenon, whereas now I think it is an irreducible and fundamental phenomenon. So I did "notice a metaphysically fundamental thing".
one who believes that those in camp #2 merely have a strong intuition that they have <mysterious-word>
Under this definition I think I would still qualify as a former illusionist about qualia.
Ah, I see. I usually think of illusionism as a position where one thinks that a certain phenomenon isn't real but only appears real.
Under the terminology of your most recent comment, I would be a determinist and a realist about free will.
I used to be a hard determinist (I think that's what people usually mean by "illusionism"), but currently I am a compatibilist.
So yeah, right now I am not an illusionist about free will either.
Has there ever been a case of someone being in camp #1, but eventually realizing, according to their self-report, "I actually do have qualia; it was right in front of me this whole time, but I've only now noticed it as a separate thing needing explanation, even though in retrospect it seems obvious to me"?
This is almost exactly my story. I used to think that qualia was just another example of anti-physicalist magical thinking.
Currently, I am actually pretty firmly in Camp #2
illusionists actually do not experience qualia
As someone who used to be (but no longer is) an illusionist about both free will and qualia, I can assure you that this was not the case (at least with me). Nothing really changed, besides me getting a deeper understanding of the position of the other camp.
It was pointed out, e.g. [here](https://web.archive.org/web/20170918044233/http://files.openphilanthropy.org/files/Grants/MIRI/consolidated_public_reviews.pdf) but unfortunately the mindset is self-reinforcing so there wasn't much to be done.
Your link formatting got messed up here.
More generally when you're asked a question like "How much do you expect [politician's name] to win/lose by?" how do you go about generating your response? How do you generate your predictions?
My usual response would look something like this:
"Pre-scandal polls are showing Bodee leading by about 1-2%, but given the recent spying scandal and the strong emphasis of our society on ethical conduct, I expect him to lose by 3-4%."
If I am speaking to a person who (on my expectations) is following the polls, I would try to explain why my prediction differs from them.
But you seem to mean something broader with "possible worlds." Something like "in theory, there is a physically possible arrangement of atoms/energy states that would result in an 'aligned' AGI, even if that arrangement of states might not be reachable from our current or even a past world."
–> Am I interpreting you correctly?
Yup, that's roughly what I meant. However, one caveat would be that I would change "physically possible" to "metaphysically/logically possible" because I don't know if worlds with different physics could exist, whereas I am pretty sure that worlds with different metaphysical/logical laws couldn't exist. By that, I mean stuff like the law of non-contradiction and "if a = b, then b = a."
You saying this shows the ambiguity here of trying to understand what different people mean. One researcher can make a technical claim about the possibility/tractability of "alignment" that is similarly worded to a technical claim others made. Yet their meaning of "alignment" could be quite different.
It's hard then to have a well-argued discussion, because you don't know whether people are equivocating (ie. switching between different meanings of the term).
I think the main antidote against this is to ask the person you are speaking with to define the term if they are making claims in which equivocation is especially likely.
The way I deal with the wildly varying uses of the term "alignment" is to use a minimum definition that most of those six interpretations are consistent with
Yeah, that's reasonable.
Listeners would understandably infer that "these well-informed people apparently don't really worry much, maybe I shouldn't worry much either".
I think this is countered to a great extent by all the well-informed people who worry a lot about AI risk. I think the "well-informed people apparently disagree on this topic, I better look into it myself" environment promotes inquiry and is generally good for truth-seeking.
More generally, I agree with @Neel Nanda, it seems somewhat doubtful that people listening to a very niche Epoch Podcast aren't aware of all the smart people worried about AI risk.
Future humans in specific, will at least one of: [ die off early, run lots of high-fidelity simulations of our universe's history ["ancestor-simulations"], decide not to run such simulations
Is there any specific reason the first option is "die off early" and not "be unable to run lots of high-fidelity simulations"? The latter encompasses the former as well as scenarios where future humans survive but for one reason or the other can't run these simulations.
I think a more general argument, in my opinion, would look like this:
"Future humans will at least one of 1) be unable to run high-fidelity simulations or 2) be unwilling to run high-fidelity simulations or 3) run high-fidelity simulations."
Here is Richard Carrier's response to avturchin's comment (for the courtesy of those reading this thread later):
These do not respond to my article. Just re-stating what was refuted (that most civilizations will be irredeemably evil and wasteful) is not a rebuttal. It’s simply ignoring the argument.
I suggest sending this as a comment under his article if you haven't already. I am similarly interested in his response.
Richard Carrier has written an entire rebuttal of the argument on his blog. He always answers comments under his posts, so if you disagree with some part of the rebuttal, you can just leave a comment there explaining which claim makes no sense to you. Then, he will usually defend the claim in question or provide the necessary clarification.
By "possible worlds," I mean all worlds that are consistent with laws of logic, such as the law of non-contradiction.
For example, it might be the case that, for some reason, alignment would only have been solved if and only if Abraham Lincoln wasn't assassinated in 1865. That means that humans in 2024 in our world (where Lincoln was assasinated in 1865) will not be able to solve alignment, despite it being solvable in principle.
My answer is kind of similar to @quila's. I think that he means roughly the same thing by "space of possible mathematical things."
I don't think that my definition of alignment is particularly important here because I was mostly clarifying how I would interpret the sentence if a stranger said it. Alignment is a broad word, and I don't really have the authority to interpret stranger's words in a specific way without accidentally misrepresenting them.
For example, one article managed to find six distinct interpretations of the word:
P1: Avoiding takeover from emergent optimization in AI agents
P2: Ensuring that AI's information processing (and/or reasoning) is intelligible to us
P3: Ensuring AIs are good at solving problems as specified (by user or designer)
P4: Ensuring AI systems enhance, and don't erode, human agency
P5: Ensuring that advanced AI agents learn a human utility function
P6: Ensuring that AI systems lead to desirable systemic and long-term outcomes
The claim "alignment is solvable in principle" means "there are possible worlds where alignment is solved."
Consequently, the claim "alignment is unsolvable in principle" means "there are no possible worlds where alignment is solved."
A new method for reducing sycophancy. Sycophantic behavior is present in quite a few AI threat models, so it's an important area to work on.
The article not only uses activation steering to reduce sycophancy in AI models but also provides directions for future work.
Overall, this post is a valuable addition to the toolkit of people who wish to build safe advanced AI.
Providing for material needs is less than 0.0000001% of the range of powers and possibilities that an AGI/ASI offers.
Imagine a scenario where we are driving from Austin to Fort Worth. The range of outcomes where we arrive at our destination is perhaps less than 0.0000001% of the total range of outcomes. There are countless potential interruptions that might prevent us from arriving at Fort Worth: traffic accidents, vehicle breakdowns, family emergencies, sudden illness, extreme weather, or even highly improbable events like alien encounters. The universe of possible outcomes is vast, and arriving safely in Fort Worth represents just one specific path through this possibility space.
Yet despite this, we reasonably expect to complete such journeys under normal circumstances. We don't let the theoretical multitude of failure modes prevent us from making the trip. Drives like this typically succeed when basic conditions are met - a functioning vehicle, decent weather, and an alert driver.
So, as Seth Herd correctly points out, it all ends up depending on whether we can manage to align AGI (and deal with other issues such as governance or economy). And that's a very large topic with a very wide range of opinions.
This article provides a concrete, object-level benchmark for measuring the faithfulness of CoT. In addition to that, a new method for improving CoT faithfulness is introduced (something that is mentioned in a lot of alignment plans).
The method is straightforward and relies on breaking questions into subquestions. Despite its simplicity, it is surprisingly effective.
In the future, I hope to see alignment plans relying on CoT faithfulness incorporate this method into their toolkit.
An (intentionally) shallow collection of AI alignment agendas that different organizations are working on.
This is the post I come back to when I want to remind myself what agendas different organizations are pursuing.
Overall, it is a solid and comprehensive post that I found very useful.
A very helpful collection of measures humanity can take to reduce risk from advanced AI systems.
Overall, this post is both detailed and thorough, making it a good read for people who want to learn about potential AI threat models as well as promising countermeasures we can take to avoid them.
I guess the word "reality" is kind of ambiguous, and maybe that's why we've been disagreeing for so long.
For example, imagine a scenario where we have 1) a non-simulated base world (let's say 10¹² observers in it) AND 2) a simulated world with 10¹¹ observers AND 3) a simulated world with 1 observer. All three worlds actually concretely exist. People from world #1 just decided to run two simulations (#2 and #3). Surely, in this scenario, as per SSA, I can say that I am a randomly selected observer from the set of all observers. As far as I see, this "set of all observers" would include 10¹² + 10¹¹ + 1 observer because all of these observers actually exist, and I could've been born as any one of them.
Edit 1: I noticed that you edited one of your replies to include this:
And note: one potential problem with your reasoning is that if we take it to it's logical extreme, it would be 100% certain that we are living in a simulation with infinite invisible observers. Because infinity dominates all the finite possibilities.
I don't actually think this is true. My reasoning only really says that we are most likely to exist in the world with the most observers as compared to other actual worlds, not other possible worlds.
The most you can get out of this is the fact that conditional on a simulation with infinite observers existing, we are most likely in that simulation. However, because of the weirdness of actual infinity, because of the abysmal computational costs (it's one thing to simulate billions of observers and another thing to simulate an infinity of observers), and because of the fact that it is probably physically impossible, I put an incredibly low prior on the fact that a simulation with infinite observers actually exists. And if it doesn't exist, then we are not in it.
Edit 2: You don't even need to posit a 10¹¹ simulation for it to be unlikely that you are in an "only one observer" simulation. It is enough that the non-simulated world has multiple observers. To illustrate what I mean, imagine that a society in a non-simulated world with 10¹² observers decides to make a simulation with only 1 observer. The odds are overwhelming that you'd be among 10¹² mundane, non-distinct observers in the non-simulated world.
The way I understand it, the main difference between SIA and SSA is the fact that in SIA "I" may fail to exist. To illustrate what I mean, I will have to refer to "souls" just because it's the easiest thing I can come up with.
SSA: There are 10¹¹ + 1 observers and 10¹¹ + 1 souls. Each soul gets randomly assigned to an observer. One of the souls is you. The probability of you existing is 1. You cannot fail to exist.
SIA: There are 10¹¹ + 1 observers and a very large (much larger than 10¹¹ + 1) amount of souls. Let's call this amount N. Each soul gets assigned to an observer. One of the souls is you. However, in this scenario, you may fail to exist. The probability of you existing is (10¹¹ + 1)/N
But the thing is that, there is a matter of fact of whether there are other observers in our world if it is simulated. Either you are the only observer or there are other observers, but one of them is true. Not just potentially true, but actually true.
The same is true of my last paragraph in the original answer (although perhaps, I could've used a clearer wording). If, as a matter of fact there actually exist 10¹¹ + 1 observers, then you are more likely to be in 10¹¹ group as per SSA. We don't know if there are actually 10¹¹ + 1 observers, but that is merely an epistemic gap.
Can you please explain why my explaination requires SIA? From a quick Google search: "The Self Sampling Assumption (SSA) states that we should reason as if we're a random sample from the set of actual existent observers"
My last paragraph in my original answer was talking about a scenario where simulators have actually simulated a) a world with 1 observers AND b) a world with 10¹¹ observer. So a set of "actual existent observers" includes 1 + 10¹¹ observers. You are randomly selected from that, giving you 1:10¹¹ odds of being in the world where you are the only observer. I don't see where SIA is coming in play here.
but the problem with your argument is that my original argument could also just consider only simulations in which I am the only observer. In which case Pr(I'm distinct | I'm in a simulation)=1, not 0.5. And since there's obviously some prior probability of this simulation being true, my argument still follows.
But then this turns Pr(I'm in a simulation) into Pr(I'm in a simulation) + Pr(only simulations with one observer exist | simulations exist). It's not enough that a simulation exists with only one observer. It needs to be so that simulations with multiple observers also don't exist. For example, if there is just one simulation with a billion observers, it heavily skews the odds in favor of you not being in a simulation with just one observer.
And I am very much willing to say Pr(I'm in a simulation) + Pr(only simulations with one observer exist | simulations exist) is going to be lower than Pr(I'm distinct | I'm not in a simulation).
I now think my actual error is saying Pr(I'm distinct | I'm not in a simulation)=0.0001, when in reality this probability should be 1, since I am not a random sample of all humans (i.e., SSA is wrong), I am me. Is that clear?
That answer seems reasonable to me. However, I think that there is value in my answer as well: it works even if SSA (the "least favorable" assumption) is true.
No, I don't think it is.
- Imagine a scenario in which the people running the simulation decided to simulate every human on Earth as an actual observer.
In this case, Pr(I'm distinct | I'm not in a simulation) = Pr(I'm distinct | I'm in a simulation) because no special treatment has been shown to you. If you think it is very unlikely that you just happened to be distinct in a real world, then in this scenario, you ought to think that it is very unlikely that you just happened to be distinct in a simulated world.
- I think what you are actually thinking about is a scenario where only you are an actual observer, whereas everyone else is a p-zombie (or an NPC if you wish).
But this scenario also raises a few questions. Why would the simulators make you a real observer and everyone else a p-zombie? Apparently, p-zombies are able to carry on any tasks that are useful for observation just as well as actual observers.
But even leaving that aside, it is unclear why is your Pr(I'm distinct | I'm in a simulation) so high? Computer-game-style simulations where you are "the main character" are plausible, but so are other types of simulations. For example, imagine a civilization wanting to learn about its past and running a simulation of its history. Or imagine a group of people who want to run an althistory simulation. Perhaps they want to see what could've happened had Nazi Germany won WW2 (specifics are irrelevant here). Clearly, in the althistory simulation, there would be no "main characters," so it would be plausible that everyone would be an actual observer (as opposed to there being only one actual observer). And let's also imagine for a second that althistory simulation has one hundred billion (10¹¹) observers in total. The chances of you being the only real observer in this world vs. being one of a trillion real observers in althistory world are 1:10¹¹.
And from here, we can generalize. If it is plausible that simulators would ever run a simulation with 10¹¹ observers (or any other large number), then it would require 10¹¹ (or any other large number) simulations with only one observer to match the odds of you being in a "one observer" simulation.
A very promising non-mainstream AI alignment agenda.
Learning-Theoretic Agenda (LTA) attempts to combine empirical and theoretical data, which is a step in the right direction as it avoids a lot of "we don't understand how this thing works, so no amount of empirical data can make it safe" concerns.
I'd like to see more work in the future integrating LTA with other alignment agendas, such as scalable oversight or Redwood's AI control.
If you are an AI researcher who desperately needs access to paywalled articles/papers to save humanity:
- archive.is lets you access a lot of paywalled articles for free (one exception that I know of is Substack)
- libgen.is lets you access a lot of paywalled academic papers for free (it has a much bigger library than SciHub and doesn't require DOI)
How do you supervise an AI system that is more capable than its overseer? This is the question this article sets to answer.
It brings together two somewhat different approaches: scalable oversight and weak-to-strong generalization. The article then shows how a unified solution would work under different assumptions (with or without scheming).
Overall, the solution seems quite promising. In the future, I'd like the unified solution to be (empirically) tested and separately compared with just scalable oversight or just weak-to-strong generalization to prove its increased effectiveness.
A concise and clear facilitation of a relatively unknown alignment strategy that relies on pursuing other relatively unknown alignment strategies.
If you've ever wondered how many promising alignment strategies never see the light of the day, AE Studio would be the place to ask this question.
Overall, I believe that this strategy will have a positive impact on "widening" the field of AI alignment, which will, in turn, improve our chances of avoiding catastrophic outcomes.
This article provides object-level arguments for thinking that deceptive alignment is very unlikely.
Recently, some organizations (Redwood Research, Anthropic) have been focusing on AI control in general and avoiding deceptive alignment in particular. I would like to see future works from these organizations explaining why deceptive alignment is likely enough to spend considerable resources on it.
Overall, while I don't agree that deceptive alignment is <1% likely, this article made me update towards deceptive alignment being somewhat less likely.
Perhaps the largest, most detailed piece of AI risk skepticism of 2023. It engages directly with one of the leading figures on the "high p(doom)" side of the debate.
The article generated a lot of discussion. As of January 4, 2025, it had 230 comments.
Overall, this article updated me towards strongly lowering my p(doom). It is thorough, it is clearly written and it proposes object-level solutions to problems raised by Yudkowski.
I was thinking of ways to improve upon the results of your paper to achieve even better control. You could try having the untrusted model and the trusted model both evaluate the code and then look for differences in how the code is evaluated.
For starters, any code below a -2 monitoring score by the trusted model is automatically run. The rest is passed to a round where the trusted model judges whether the evaluation of the untrusted model is suspiciously more naive than the evaluation of the trusted model (though you might have to ask each model to motivate the score they give the code). The rating the trusted model gave it is used unless the trusted model thinks the evaluation by the untrusted model is not suspicious at all. In cases where it is suspicious, a higher suspicion rating of the code and evaluation is used (or possibly even the sum).
After this, it should be safe to nudge the "safety" barrier a little upwards from -2. There should be less overlap between the suspicion ratings of innocent vs. backdoored code afterward, making it more likely that the 2% of code that can be manually reviewed will have a backdoor in it.
On top of this, you can also keep submitting faulty code and arguments into this process to see if the untrusted model will lie about it, which should improve the outcome further. The best part is that you can iterate this process.
You can have the trusted model repeatedly judge whether the untrusted model's reasoning is suspiciously naive. You'd basically have the trusted and untrusted model constantly debate the merits of code that the trusted model isn't absolutely sure is unsuspicious, and if the trusted model ever thinks the untrusted model isn't making sense, it further ups the total suspicion, whereas if it thinks there is nothing suspicious about the untrusted model's reasoning at all, it'll move the total suspicion towards what the untrusted model is arguing for (unless the untrusted model is outright saying the trusted model is correct in a way the trusted model doesn't think is suspicious, in which case that rating will be used).
In the end, this should lead to more of the most safe code being successfully run, more of the most suspicious code being manually audited, and less of the questionable but not manually reviewable code needing to be written by the trusted model.
One of the few high-quality papers on automatic deception detection in black-box LLMs.
Asking completely unrelated questions is a simple yet effective way of catching AI red-handed. In addition, this lie detector generalizes well to 1) other LLM architectures, 2) LLMs fine-tuned to lie, 3) sycophantic lies, and 4) lies emerging in real-life scenarios such as sales.
Despite the solution's simplicity, the paper has been relatively overlooked in LessWrong community. I am hoping to see more future work combining this lie detector with techniques like those presented in Bürger et al., 2024
This article is an excellent foundation for alignment plans relying on faithful CoT reasoning.
Its main contributions include 1) evaluating the density of hidden information that steganography schemes can maintain in the presence of various defenses and 2) developing an effective mitigation strategy (paraphrasing).
Overall, this article is easy to follow, detailed, and insightful.
Possibly one of the most impactful AI control papers of 2023. It went far beyond LessWrong, making into a separate 30-minute video dedicated to (positively) reviewing the proposed solution.
The paper also enjoyed some academic success. As of January 3, 2025, it not only has 23 citations on LessWrong, but also 24 citations on Google Scholar.
This paper strongly updated me towards thinking that AI control is possible, feasible and should be actively implemented to prevent catastrophic outcomes.