Posts
Comments
I think the website just links back to this blog post? Is that intentional?
Edit: I also think the application link requires access before seeing the form?
Second Edit: Seems fixed now! Thanks!
Maybe I should also expand on what the "AI agents are submitting the programs themselves subject to your approval" scenario could look like. When I talk about a preorder on Turing Machines (or some subset of Turing Machines), you don't have to specify this preorder up front. You just have to be able to evaluate it and the debaters have to be able to guess how it will evaluate.
If you already have a specific simulation program in mind then you can define as follows: if you're handed two programs which are exact copies of your simulation software using different hard-coded world models then you consult your ordering on world models, if one submission is even a single character different from your intended program then it's automatically less, if both programs differ from your program then you decide arbitrarily. What's nice about the "ordering on computations" perspective is that it naturally generalizes to situations where you don't follow this construction.
What could happen if we don't supply our own simulation program via this construction? In the planes example, maybe the "snap" debater hands you a 50,000-line simulation program with a bug so that if you're crafty with your grid sizes then it'll get confused about the material properties and give the wrong answer. Then the "safe" debater might hand you a 200,000-line simulation program which avoids / patches the bug so that the crafty grid sizes now give the correct answer. Of course, there's nothing stopping the "safe" debater from having half of those lines be comments containing a Lean proof using super annoying numerical PDE bounds or whatever to prove that the 200,000-line program avoids the same kind of bug as the 50,000-line program.
When you think about it that way, maybe it's reasonable to give the "it'll snap" debater a chance to respond to the "it's safe" debater's comments. Now maybe we change the type of from being a subset of (Turing Machines) x (Turing Machines) to being a subset of (Turing Machines) x (Turing Machines) x (Justifications from safe debater) x (Justifications from snap debater). In this manner deciding how you want to behave can become a computational problem in its own right.
These are all excellent points! I agree that these could be serious obstacles in practice. I do think that there are some counter-measures in practice, though.
I think the easiest to address is the occasional random failure, e.g. your "giving the wrong answer on exact powers of 1000" example. I would probably try to address this issue by looking at stochastic models of computation, e.g. probabilistic Turing machines. You'd need to accommodate stochastic simulations anyway because so many techniques in practice use sampling. I think you can handle stochastic evaluation maps in a similar fashion to the deterministic case but everything gets a bit more annoying (e.g. you likely need to include an incentive to point to simple world models if you want the game to have any equilibria at all). Anyways, if you've figured out topological debate in the stochastic case, then you can reduce from the occasional-errors problem to the stochastic problem as follows: suppose is a directed set of world models and is some simulation software. Define a stochastic program which takes in a world model , randomly samples a world model according to some reasonably-spread-out distribution, and return . In the 1D plane case, for example, you could take in a given resolution, divide it by a uniformly random real number in , and then run the simulation at that new resolution. If your errors are sufficiently rare then your stochastic topological debate setup should handle things from here.
Somewhat more serious is the case where "it's harder to disrupt patterns injected during bids." Mathematically I interpret this statement as the existence of a world model which evaluates to the wrong answer such that you have to take a vastly more computationally intensive refinement to get the correct answer. I think it's reasonable to detect when this problem is occurring but preventing it seems hard: you'd basically need to create a better simulation program which doesn't suffer from the same issue. For some problems that could be a tall order without assistance but if your AI agents are submitting the programs themselves subject to your approval then maybe it's surmountable.
What I find the most serious and most interesting, though, is the case where your simulation software simply might not converge to the truth. To expand on your nonlinear effects example: suppose our resolution map can specify dimensions of individual grid cells. Suppose that our simulation software has a glitch where, if you alternate the sizes of grid cells along some direction, the simulation gets tricked into thinking the material has a different stiffness or something. This is a kind of glitch which both sides can exploit and the net probably won't converge to anything.
I find this problem interesting because it attacks one of the core vulnerabilities that I think debate problems struggle with: grounding in reality. You can't really design a system to "return the correct answer" without somehow specifying what makes an answer correct. I tried to ground topological debate in this pre-existing ordering on computations that gets handed to us and which is taken to be a canonical characterization of the problem we want to solve. In practice, though, that's really just kicking the can down the road: any user would have to come up with a simulation program or method of comparing simulation programs which encapsulates their question of interest. That's not an easy task.
Still, I don't think we need to give up so easily. Maybe we don't ground ourselves by assuming that the user has a simulation program but instead ground ourselves by assuming that the user can check whether a simulation program or comparison between simulation programs is valid. For example, suppose we're in the alternating-grid-cell-sizes example. Intuitively the correct debater should be able to isolate an extreme example and go to the human and say "hey, this behavior is ridiculous, your software is clearly broken here!" I will think about what a mathematical model of this setup might look like. Of course, we just kicked the can down the road, but I think that there should be some perturbation of these ideas which is practical and robust.
Thank you! If I may ask, what kind of fatal flaws do you expect for real-world simulations? Underspecified / ill-defined questions, buggy simulation software, multiple simulation programs giving irreconcilably conflicting answers in practice, etc.? I ask because I think that in some situations it's reasonable to imagine the AI debaters providing the simulation software themselves if they can formally verify its accuracy, but that would struggle against e.g. underspecified questions.
Also, is there some prototypical example of a "tough real world question" you have in mind? I will gladly concede that not all questions naturally fit into this framework. I was primarily inspired by physical security questions like biological attacks or backdoors in mechanical hardware.
My initial reaction is that at least some of these points would be covered by the Guaranteed Safe AI agenda if that works out, right? Though the "AGIs act much like a colonizing civilization" situation does scare me because it's the kind of thing which locally looks harmless but collectively is highly dangerous. It would require no misalignment on the part of any individual AI.
Totalitarian dictatorship
I'm unclear why this risk is specific to multipolar scenarios? Even if you have a single AGI/ASI you could end up with a totalitarian dictatorship, no? In fact I would imagine that having multiple AGI/ASI's would mitigate this risk as, optimistically, every domestic actor in possession of an AGI/ASI should be counterbalanced by another domestic actor with divergent interests also in possession of an AGI/ASI.
I actually think multipolar scenarios are less dangerous than having a single superintelligence. Watching the AI arms race remain multipolar has actually been one of the biggest factors in my P(doom) declining recently. I believe that maintaining a balance of power at all times is key and that humanity's best chance for survival is to ensure that, for any action humanity wishes to take, there is some superintelligence that would benefit from this action and which would be willing to defend it. This intuition is largely based on examples from human history and may not generalize to the case of superintelligences.
EDIT: I do believe there's a limit to the benefits of having multiple superintelligences, especially in the early days when biological defense may be substantially weaker than offense. As an analogy to nuclear weapons, if one country possesses a nuclear bomb then that country can terrorize the world at will, if a few countries have nuclear bombs then everyone has an incentive to be restrained but alert, if every country has a nuclear bomb then eventually someone is going to press the big red button for lolz.
Something I should have realized about AI Safety via Debate ages ago but only recently recognized: I usually see theoretical studies of debate through the lens of determining the output of a fixed Turing machine (or stochastic Turing machine or some other computational model), e.g. https://arxiv.org/abs/2311.14125. This formalization doesn't capture the kinds of open-ended questions I usually think about. For example, suppose I'm looking at the blueprint of a submarine and I want to know whether it will be watertight at 1km depth. Suppose I have a physics simulation engine at my disposal that I trust. Maybe I could run the physics simulation engine at 1 nanometer resolution and get an answer I trust after ten thousand years, but I don't have time for that. This is such an extremely long computation that I don't think any AI debater would have time for it either. Instead, if I were tasked with solving this problem alone I would attempt to find some discretization parameter that is sufficiently small for me to trust the conclusion but sufficiently big for the computation to be feasible.
Now, if I had access to two debaters, how would I proceed? I would ask them both for thresholds beyond which their desired conclusion holds. For example, maybe I get the "it's watertight" debater to commit that any simulation at a resolution below 1cm will conclude that the design is watertight and I get the "it's not watertight" debater to commit that any simulation at a resolution below 1mm will conclude that the design is not watertight. I then run the simulation (or use one of the conventional debate protocols to simulate running the simulation) at the more extreme of the two suggestions, in this case 1mm. There are details to be resolved with the incentives but I believe it's possible.
I like to view this generalized problem as an infinite computation where we believe the result converges to the correct answer at some unknown rate. For example, we can run our simulation at 1m resolution, 10cm resolution, 1cm resolution, 1mm resolution, etc., saving the result from the most recent simulation as our current best guess. If we trust our simulation software then we should believe that this infinite computation eventually converges to the true answer. We can implement debate in this setup by having an initial debate over what steps T have the property that the answer at all steps T' >= T agrees with the answer at T, pick one such T that both sides agree with (take the maximum), then run a conventional debate protocol on the resulting finite computation.
EDIT: I wonder if there's a generalization of this idea to having a directed set of world models where one world model is at least as good as another if it is at least as precise in every model detail. Each debater proposes a world model, the protocol takes the maximum of the world models which exists by the directed set property, and we simulate that world model. I'm thinking of the Guaranteed Safe AI agenda here.
Has anyone made a good, easy to use user interface for implementing debate / control protocols in deployment? For example, maybe I want to get o12345 to write some code in my codebase but I don't trust it. I'd like to automatically query Sonnet 17.5 with the code o12345 returned and ask if anything fishy is going on. If so, spin up one of the debate protocols and give me the transcript at the end. Surely the labs are experimenting with this kind of thing internally / during training but it still feels useful during deployment as an end-user, especially between models from different labs which might have internalized different objective functions and hence be less likely to collude.
When you talk about "other sources of risk from misalignment," these sound like milder / easier-to-tackle versions of the assumptions you've listed? Your assumptions sound like they focus on the worst case scenario. If you can solve the harder version then I would imagine that the easier version would also be solved, no?
I think this professor has relevant interests: https://www.cs.cmu.edu/~nihars/.
I agree, though I think it would be a very ridiculous own-goal if e.g. GPT-4o decided to block a whistleblowing report about OpenAI because it was trained to serve OpenAI's interests. I think any model used by this kind of whistleblowing tool should be open-source (nothing fancy / more dangerous than what's already out there), run locally by the operators of the tool, and tested to make sure it doesn't block legitimate posts.
My gut instinct is that this would have been a fantastic thing to create 2-4 years ago. My biggest hesitation is that the probability a tool like this decreases existential risk is proportional to the fraction of lab researchers who know about it and adoption can be a slow / hard thing to make happen. I still think that this kind of program could be incredibly valuable under the right circumstances so someone should probably be working on this.
Also, I have a very amateurish security question: if someone provides their work email to verify their authenticity with this tool, can their employer find out? For example, I wouldn't put it past OpenAI to check if an employee's email account got pinged by this tool and then to pressure / fire that employee.
I guess I'm a bit confused where o3 comes into this analysis. This discussion appears to be focused on base models to me? Is data really the bottleneck these days for o-series-type advancements? I thought that compute available to do RL in self-play / CoT / long-time-horizon-agentic-setups would be a bigger consideration.
Edit: I guess upon another reading this article seems like an argument against AI capabilities hitting a plateau in the coming years, whereas the o3 announcement makes me more curious about whether we're going to hyper-accelerate capabilities in the coming months.
Strong upvote. I think that interactive demo's are much more effective at showcasing "scary" behavior than static demo's that are stuck in a paper or a blog post. Readers won't feel like the examples are cherry-picked if they can see the behavior exhibited in real time. I think that the community should make more things like this.
For anyone wanting to harness AI models to create formally verified proofs for theoretical alignment, it looks like last call for formalizing question statements.
Game theory is almost completely absent from mathlib4. I found some scattered attempts at formalization in Lean online but nothing comprehensive. This feels like a massive oversight to me -- if o12345 were released tomorrow with perfect reasoning ability then I'd have no framework with which to check its proofs of any of my current research questions.
So I've been thinking a little more about the real-world-incentives problem, and I still suspect that there are situations in which rules won't solve this. Suppose there's a prediction market question with a real-world outcome tied to the resulting market probability (i.e. a relevant actor says "I will do XYZ if the prediction market says ABC"). Let's say the prediction market participants' objective functions are of the form play_money_reward + real_world_outcome_reward. If there are just a couple people for whom real_world_outcome_reward is at least as significant as play_money_reward and if you can reliably identify those people (i.e. if you can reliably identify the people with a meaningful conflict of interest), then you can create rules preventing those people from betting on the prediction market.
However, I think that there are some questions where the number of people with real-world incentives is large and/or it's difficult to identify those people with rules. For example, suppose a sports team is trying to determine whether to hire a star player and they create a prediction market for whether the star athlete will achieve X performance if hired. There could be millions of fans of that athlete all over the world who would be willing to waste a little prediction market money to see that player get hired. It's difficult to predict who those people are without massive privacy violations -- in particular, they have no publicly verifiable connection to the entities named in the prediction market.
Awesome post!
I have absolutely no experience with prediction markets, but I’m instinctively nervous about incentives here. Maybe the real-world incentives of market participants could be greater than the play-money incentives? For example, if you’re trying to select people to represent your country at an international competition and the potential competitors have invested their lives into being on that international team and those potential competitors can sign up as market participants (maybe anonymously), then I could very easily imagine those people sabotaging their competitors’ markets and boosting their own with no regard for their post-selection in-market prediction winnings.
For personal stuff (friendship / dating), I have some additional concerns. Suppose person A is open for dating and person B really wants to date them. By punting the decision-making process to the public, person A restricts themselves to working with publicly available information about person B and simultaneously person B is put under pressure to publicly reveal all relevant information. I can imagine a lot of Goodharting on the part of person B. Also, if it was revealed that person C bet against A and B dating and person B found out, I can imagine some … uh … lasting negative emotions between B and C. That possibility could also mess with person C’s incentives. In other words, the market participants with the closest knowledge of A and B also have the most to lose by A and B being upset with their bets and thus face the most misaligned incentives. (Note: I also dislike dating apps and lots of people use those so I’m probably biased here.)
Finally, I can imagine circumstances where publicly revealing probabilities of things can cause negative externalities, especially on mental health. For example, colleges often don’t reveal students’ exact percentage scores on classes even if employers would be interested — the amount of stress that would induce on the student body could result in worse learning outcomes overall. In an example you listed, with therapists/patients, I feel like it might not be great to have someone suffering from anxiety watch their percentage chance of getting an appointment go up and down.
But for circumstances with low stakes (so play money incentives beat real-world incentives) and small negative externalities, such as gym partners, I could imagine this kind of system working really well! Super cool!
I like these observations! As for your last point about ranges and bounds, I'm actually moving towards relaxing those in future posts: basically I want to look at the tree case where you have more than one variable feeding into each node and I want to argue that even if the conditional probabilities are all 0's and 1's (so we don't get any hard bounds with arguments like the one I present here) there can still be strong concentration towards one answer.
Wow, this is exactly what I was looking for! Thank you so much!
I also suspect that the evaluation mechanism is going to be very important. I can think of philosophical debates whose resolution could change the impact of an "artifact" by many orders of magnitude. If possible I think it could be good to have several different metrics (corresponding to different objective functions) by which to grade these artifacts. That way you can give donors different scores depending on which metrics you want to look at. For example, you might want different scores for x-risk minimization, s-risk minimization, etc. That still leaves the "[optimize for (early, reliable) evidence of impact] != [optimize for impact]" issue open, of course.
I really like this perspective! Great first post!
Okay that paper doesn't seem like what I was thinking of either but it references this paper which does seem to be on-topic: https://research.rug.nl/en/publications/justification-by-an-infinity-of-conditional-probabilities
Thanks for the response! I took a look at the paper you linked to; I'm pretty sure I'm not talking about combinatorial explosion. Combinatorial explosion seems to be an issue when solving problems that are mathematically well-defined but computationally intractable in practice; in my case it's not even clear that these objects are mathematically well-defined to begin with.
The paper https://www.researchgate.net/publication/335525907_The_infinite_epistemic_regress_problem_has_no_unique_solution initially looks related to what I'm thinking, but I haven't looked at it in depth yet.
Hmmm, I’m still thinking about this. I’m kinda unconvinced that you even need an algorithm-heavy approach here. Let’s say that you want to apply logit, add some small amount of noise, apply logistic, then score. Consider the function on R^n defined as (score function) composed with (coordinate-wise logistic function). We care about the expected value of this function with respect to the probability measure induced by our noise. For very small noise, you can approximate this function by its power series expansion. For example, if we’re adding iid Gaussian noise, then look at the second order approximation. Then in the limit as the standard deviation of the noise goes to zero, the expected value of the change is some constant (something something Gaussian integral) times the Laplacian of our function on R^n times the square of the standard deviation. Thus the Laplacian is very related to this precision we care about (it basically determines it for small noise). For most reasonable scoring functions, the Laplacian should have a closed-form solution. I think that gets you out of having to simulate anything. Let me know if I messed anything up! Cheers!
If my interpretation of precision function is correct then I guess my main concern is this: how are we reaching inside the minds of the predictors to see what their distribution on is? Like, imagine we have an urn with black and red marbles in it and we have a prediction market on the probability that a uniformly randomly chosen marble will be red. Let's say that two people participated in this prediction market: Alice and Bob. Alice estimated there to be a 0.3269230769 (or approximately 17/52) chance of the marble being red because she saw the marbles being put in and there were 17 red marbles and 52 marbles total. Bob estimated there to be a 0.3269230769 chance of the marble being red because he felt like it. Bob is clearly providing false precision while Alice is providing entirely justified precision. However, no matter which way the urn draw goes, the input tuple (0.3269230769, 0) or (0.3269230769, 1) will be the same for both participants and thus the precision returned by any precision function will be the same. This feels to me like a fundamental disconnect between what we want to measure and what we are measuring. Am I mistaken in my understanding? Thanks!
Awesome post! I'm very ignorant of the precision-estimation literature so I'm going to be asking dumb questions here.
First of all, I feel like a precision function should take some kind of "acceptable loss" parameter. From what I gather, to specify the precision you need some threshold in your algorithm(s) for how much accuracy loss you're willing to tolerate.
More fundamentally, though, I'm trying to understand what exactly we want to measure. The list of desired properties of a precision function feel somewhat pulled out of thin air, and I'd feel more comfortable with a philosophical understanding of where these properties come from. So let's say we have a set of possible states/trajectories of the world, the world provides us with some evidence , and we're interested in for some event . Maybe reality has some fixed out there, but we're not privy to that, so we're forced to use some "hyperprior" (am I using that word right?) on probability measures over . After conditioning on , we get some probability distribution on , which participants in a prediction market will take the expected value of as their answer. The precision is trying to quantify something like the standard deviation of this probability distribution on values of , right?
P.S. This is entirely a skill issue on my part but I'm not sure what symbols you're using for precision function and perturbation function. Detexify was of no use. Feel free to enlighten me!
I'm probably the least qualified person imaginable to represent "the Lesswrong community" given that I literally made my first post this weekend, but I did get into EA between high school and college and I have some thoughts on the topic.
My gut reaction is that it depends a lot on the kind of person this high schooler is. I was very interested in math and theoretical physics when I started thinking about EA. I don't think I'm ever going to be satisfied with my life unless I'm doing work that's math-heavy. I applied to schools with good AI programs with the intent of upskilling on AI/ML during college and then going into AI Safety. When I started college I waved away the honors math classes with the intent of getting into theoretical machine learning research as fast as possible. Before the end of freshman year, I realized that I was miserable and the courses felt dumb and that I was finding it very hard to relate to any of the other people in the AI program -- most of them were practically-minded and non-math-y. I begged to be let back into the honors math courses and thankfully the department allowed me to do so. I proceeded to co-found the AI Safety club at my college and have been thinking somewhat independently on questions adjacent to AI Safety that interest me. In retrospect, I think that I was too gung-ho about upskilling on ML to stop and pay attention to where my skills and my passion were. This nearly resulted in me having no friend group in college and not being productive at anything.
So yeah, I don't know what exactly I would recommend. If I had been a more practically-minded person then my actions would probably have been pretty perfect. I guess the only advice I can give is cliches: think independently, explore, talk to people, listen to yourself. Sorry I can't say anything more concrete!
You raise an excellent point! In hindsight I’m realizing that I should have chosen a different example, but I’ll stick with it for now. Yes, I agree that “What states of the universe are likely to result from me killing vs not killing lanternflies” and “Which states of the universe do I prefer?” are both questions grounded in the state of the universe where Bayes’ rule applies very well. However, I feel like there’s a third question floating around in the background: “Which states of the universe ‘should’ I prefer?” Based on my inner experiences, I feel that I can change my values at will. I specifically remember a moment after high school when I first formalized an objective function over states of the world, and this was a conscious thing I had to do. It didn’t come by default. You could argue that the question “Which states of the universe would I decide I should prefer after thinking about it for 10 years” is a question that’s grounded in the state of the universe so that Bayes’ Rule makes sense. However, trying to answer this question basically reduces to thinking about my values for 10 years; I don’t know of a way to short circuit that computation. I’m reminded of the problem about how an agent can reason about a world that it’s embedded inside where its thought processes could change the answers it seeks.
If I may propose another example and take this conversation to the meta-level, consider the question “Can Bayes’ Rule alone answer the question ‘Should I kill lanternflies?’?” When I think about this meta-question, I think you need a little more than just Bayes’ Rule to reason. You could start by trying to estimate P(Bayes Rule alone solves the lanternfly question), P(Bayes Rule alone solves the lanternfly question | the lanternfly question can be decomposed into two separate questions), etc. The problem is that I don’t see how to ground these probabilities in the real world. How can you go outside and collect data and arrive at the conclusion “P(Bayes Rule alone solves the lanternfly question | the lanternfly question can be decomposed into two separate questions) = 0.734”?
In fact, that’s basically the issue that my post is trying to address! I love Bayes’ rule! I love it so much that the punchline of my post, the dismissive growth-consistent ideology weighting, is my attempt to throw probability theory at abstract arguments that really didn’t ask for probability theory to be thrown at them. “Growth-consistency” is a fancy word I made up that basically means “you can apply probability theory (including Bayes’ Rule) in the way you expect.” I want to be able to reason with probability theory in places where we don’t get “real probabilities” inherited from the world around us.
Hey, thanks for the response! Yes, I've also read about Bayes' Theorem. However, I'm unconvinced that it is applicable in all the circumstances that I care about. For example, suppose I'm interested in the question "Should I kill lanternflies whenever I can?" That's not really an objective question about the universe that you could, for example, put on a prediction market. There doesn't exist a natural function from (states of the universe) to (answers to that question). There’s interpretation involved. Let’s even say that we get some new evidence (my post wasn’t really centered on that context, but still). Suppose I see the news headline "Arkansas Department of Stuff says that you should kill lanternflies whenever you can." How am I supposed to apply Bayes’ rule in this context? How do I estimate P(I should kill lanternflies whenever I can | Arkansas Department of Stuff says I should kill lanternflies whenever I can)? It would be nice to be able to dismiss these kinds of questions as ill-posed, but in practice I spend a sizeable fraction of my time thinking about them. Am I incorrect here? Is Bayes’ theorem more powerful than I’m realizing?
(1) Yeah, I'm intentionally inserting a requirement that's trivially true. Some claims will make object-level statements that don’t directly impose restrictions on other claims. Since these object-level claims aren’t directly responsible for putting restrictions on the structure of the argument, they induce trivial clauses in the formula.
(2) Absolutely, you can’t provide concrete predictions on how beliefs will evolve over time. But I think you can still reason statistically. For example, I think it’s valid to ask “You put ten philosophers in a room and ask them whether God exists. At the start, you present them with five questions related to the existence of God and ask them to assign probabilities to combinations of answers to these questions. After seven years, you let the philosophers out and again ask them to assign probabilities to combinations of answers. What is the expected value of the shift (say, the KL divergence) between the original probabilities and the final probabilities?“ I obviously cannot hope to predict which direction the beliefs will evolve, but the degree to which we expect them to evolve seems more doable. Even if we’ve updated so that our current probabilities equal the expected value of our future probabilities, we can still ask about the variance of our future probabilities. Is that correct or am I misunderstanding something?
Thanks again, by the way!