Posts
Comments
Cool thanks.
I've seen that you've edited your post. If you look at ASL-3 Containment Measures, I'd recommend considering editing away the "Yay" aswell.
This post is a pretty significant goalpost moving.
While my initial understanding was that the autonomous replication would be a ceiling, this doc now made it a floor.
So in other words, this paper is proposing to keep navigating beyond levels that are considered potentially catastrophic, with less-than-military-grade cybersecurity, which makes it very likely that at least one state, and plausibly multiple states, will have access to those things.
It also means that the chances of leaking a system which is irreversibly catastrophic are probably not below 0.1%, maybe not even below 1%.
My interpretation of the excitement around the proposal is a feeling that "yay, it's better than where we were before".
But I think it neglects heavily a few things.
1. It's way worse than risk management 101, which is easy to push for.
2. the US population is pro-slowdown (so you can basically be way more ambitious than "responsibly scaling")
3. an increasing share of policymakers are worried
4. self-regulation has a track record of heavily affecting hard law (either by preventing it, or by creating a template that the state can enforce. That's the ToC that I understood from people excited by self-regulation). For instance I expect this proposal to actively harm the efforts to push for ambitious slowdowns that would let us put the probability of doom below two-digit numbers.
For those reasons, I wish this doc didn't exist.
Can you quote the parts you're referring to?
I agree with this general intuition, thanks for sharing.
I'd value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against "bad instrumental convergence" but where we fail/ or a better sense of how you'd guess it would look like on an LLM agent or a scaled GPT.
I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).
I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.
I think it's far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn't necessarily hurt the company) and empirically.
I'm not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don't demonstrate the opposite as far as I can tell.
- Labs have been pushing for the rule that we should wait for evals to say "it's dangerous" before we consider what to do, rather than do like in most other industries, i.e. that something is assumed dangerous until proven safe.
- Most mentions of slowdown have been described as necessary potentially at some point in the distant future, while most people in those labs have <5y timelines.
Finally, on your conceptual part, as some argued, it's in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier.
will comment that it seems like a big leap from "X product was released N months earlier than otherwise" to "Transformative AI will now arrive N months earlier than otherwise."
I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.
Thanks for the clarifications.
But is there another "decrease the race" or "don't make the race worse" intervention that you think can make a big difference? Based on the fact that you're talking about a single thing that can help massively, I don't think you are referring to "just don't make things worse"; what are you thinking of?
1. I think we agree on the fact that "unless it's provably safe" is the best version of trying to get a policy slowdown.
2. I believe there are many interventions that could help on the slowdown side, most of which are unfortunately not compatible with the successful careful AI lab. The main struggle that a successful careful AI lab encounters is that it has to trade-off tons of safety principles along the way, essentially bc it needs to attract investors & talent and that attracting investors & talent is hard if you're say too loudly that we should slow down as long as our thing is not provably safe.
So de facto a successful careful AI lab will be a force against slowdown & a bunch of other relevant policies in the policy world. It will also be a force for the perceived race which is making things harder for every actor.
Other interventions for slowdown are mostly in the realm of public advocacy.
Mostly drawing upon the animal welfare activism playbook, you could use public campaigns to de facto limit the ability of labs to race, via corporate or policy advocacy campaigns.
I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of "heat" to be felt regardless of any one player's actions). And the potential benefits seem big. My rough impression is that you're confident the costs outweigh the benefits for nearly any imaginable version of this; if that's right, can you give some quantitative or other sense of how you get there?
I guess, heuristically, I tend to take arguments of the form "but others would have done this bad thing anyway" with some skepticism because I think it tends to assume too much certainty over the counterfactual, in part due to many second order effects (e.g. the existence of one marginal key player increases the chances that more player invest, show that competition is possible etc.) that tend to be hard to compute (but are sometimes observable ex post).
On this specific case I think it's not right that there are "lots of players" close from the frontier. If we take the case of OA and Anthropic for example, there are about 0 players at their level of deployed capabilities. Maybe Google will deploy at some point but they haven't been serious players for the past 7 months. So if Anthropic hadn't been around, OA could have chilled longer at ChatGPT level, and then at GPT-4 without plugins + code interpreter & without suffering from any threat. And now they'll need to do something very impressive against the 100k context etc.
The compound effects of this are pretty substantial because for each new differentiation, it accelerates the whole field and pressures teams to find something new, causing a significantly more powerful race to the bottom.
If I had to be quantitative (vaguely) for the past 9 months, I'd guess that the existence of Anthropic has caused (/will cause, if we count the 100k thing) 2 significant counterfactual features and 3-5 months of timelines (which will probably compound into more due to self-improvement effects). I'd guess there are other effects (e.g. pressure on compute, scaling for driving costs down etc.) I'm not able to give vague estimates for.
My guess for the 3-5 months is mostly driven by the release of ChatGPT & GPT-4 which have both likely been released earlier than without Anthropic.
So I guess first you condition over alignment being solved when we win the race. Why do you think OpenAI/Anthropic are very different from DeepMind?
Thanks for writing that up.
I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively". And this core consideration is also why I don't think that the "Successful, careful AI lab" is right.
Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.
Extremely excited to see this new funder.
I'm pretty confident that we can indeed find a significant number of new donors for AI safety since the recent Overton window shift.
Chatting with people with substantial networks, it seemed to me like a centralized non-profit fundraising effort could probably raise at least $10M. Happy to intro you to those people if relevant @habryka.
And reducing the processing time is also very exciting.
So thanks for launching this.
Thanks for writing this.
Overall, I don't like the post much under it's current form. There's ~0 evidence (e.g. from Chinese newspapers) and there is very little actual argumentation. I like that you give us a local view but putting a few links to back your claims would be very very appreciated. Right now it's hard to update on your post given that the claims are very empirical and without any external sources.
More minorly: "A domestic regulation framework for nuclear power is not a strong signal for a willingness to engage in nuclear arms reduction" I also disagree with this statement. I think it's definitely a signal.
@beren in this post, we find that our method (Causal Direction Extraction) allows to capture a lot of the gender difference with 2 dimensions in a linearly separable way. Skimming that post might of interest to you and your hypothesis.
In the same post though, we suggest that it's unclear how much logit lens "works", to the extent that basically the direction encoding the best a same concept likely changes by a small angle at each layer, which causes two directions that best encode a concept at 15 layers of interval to have a cosine similarity <0.5.
But what seems plausible to me is that basically almost ~all of the information relevant to a feature are encoded in a very small amounts of directions, which are slightly different for each layer.
I'd add that it's not an argument to make models agentic in the wild. It's just an argument to be already worried.
Thanks for writing that up Charbel & Gabin. Below are some elements I want to add.
In the last 2 months, I spent more than 20h with David talking and interacting with his ideas and plans, especially in technical contexts.
As I spent more time with David, I got extremely impressed by the breadth and the depth of his knowledge. David has cached answers to a surprisingly high number of technically detailed questions on his agenda, which suggests that he has pre-computed a lot of things regarding his agenda (even though it sometimes look very weird on first sight). I noticed that I never met anyone as smart as him.
Regarding his ability to devise a high level plan that works in practice, David has built a technically impressive crypto (today ranked 22nd) following a similar methodology, i.e. devising the plan from first principles.
Finally, I'm excited by the fact that David seems to have a good ability to build ambitious coalitions with researchers, which is a great upside for governance and for such an ambitious proposal. Indeed, he has a strong track record of convincing researchers to work on his stuff after talking for a couple hours, because he often has very good ideas on their field.
These elements, combined with my increasing worry that scaling LLMs at breakneck speed is not far from certain to kill us, make me want to back heavily this proposal and pour a lot of resources into it.
I'll thus personally dedicate in my own capacity an amount of time and resources to try to speed that up, in the hope (10-20%) that in a couple of years it could become a credible proposal as an alternative to scaled LLMs.
I'll focus on 2 first given that it's the most important. 2. I would expect sim2real to not be too hard for foundations models because they're trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn't be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance. I'm not certain but I feel like robotics is more sensitive to details than plans (which is why I'm mentioning a simulation here). Finally regarding long horizon I agree that it seems hard but I worry that at current capabilities level you can already build ~any reward model because LLMs, given enough inferences seem generally very capable atb evaluating stuff.
-
I agree that it's not something which is very likely. But I disagree that "nobody would do that". People would do that if it were useful.
-
I've asked some ML engineers and it happens that you don't look at it for a day. I don't think that deploying it in the real world changes much. Once again you're also assuming a pretty advanced formb of security mindset.
Yes, I definitely think that countries with strong deontologies will try to solve some narrow versions of alignment harder than those that tolerate failures.
I think it's quite reassuring and means that it's quite reasonable to focus on the US quite a lot in our governance approaches.
I think that this is misleading to state it that way. There were definitely dinners and discussions with people around the creation of OpenAI.
https://timelines.issarice.com/wiki/Timeline_of_OpenAI
Months before the creation of OpenAI, there was a discussion including Chris Olah, Paul Christiano, and Dario Amodei on the starting of OpenAI: "Sam Altman sets up a dinner in Menlo Park, California to talk about starting an organization to do AI research. Attendees include Greg Brockman, Dario Amodei, Chris Olah, Paul Christiano, Ilya Sutskever, and Elon Musk."
Also, I think that it's fine to have less chances of being an excellent alignment research for that reason. What matters is having impact, not being an excellent alignment researcher. E.g. I don't go full-in a technical career myself essentially for that reason, combined with the fact that I have other features that might allow me to go further in the impact tail in other subareas that are relevant.
If I try to think about someone's IQ (which I don't normally do, except for the sake of this message above where I tried to think about a specific number to make my claim precise) I feel like I can have an ordering where I'm not too uncertain on a scale that includes me, some common reference classes (e.g. the median student of school X has IQ Y), and a few people who did IQ tests around me. I'd by the way be happy to bet on anyone if someone accepted to reveal their IQ (e.g. from the list of SERI MATS's mentors) if you think my claim is wrong.
Thanks for writing that.
Three thoughts that come to mind:
- I feel like a more right claim is something like "beyond a certain IQ, we don't know what makes a good alignment researcher". Which I think is a substantially weaker claim than the one which is underlying your post. I also think that the fact that the probability of being a good alignment researcher increases with IQ is relevant if true (and I think it's very likely to be true, as for most sciences where Nobels are usually outliers along that axis).
- I also feel like I would expect predictors from other research fields to roughly apply (e.g. conscientiousness).
- In this post you don't cover what seems to be the most important part of why sometimes advice that are of the form "It seems like given features X and Y you're more likely to be able to fruitfully contribute to Z" (which seems to be adjacent to the claims you're criticizing), i.e. the opportunity cost of someone.
I think that yes it is reasonable to say that GPT-3 is obsolete.
Also, you mentioned loads AGI startups being created in 2023 while it already happened a lot in 2022. How many more AGI startups do you expect in 2023?
But I don't expect these kinds of understanding to transfer well to understanding Transformers in general, so I'm not sure it's high priority.
The point is not necessarily to improve our understanding of Transformers in general, but that if we're pessimistic about interpretability on dense transformers (like markets are, see below), we might be better off speeding up capabilities on architectures we think are a lot more interpretable.
The idea that EVERY governments are dumb and won't figure out a way which is not too bad to allocate their resources into AGI seems highly unlikely to me. There seems to be many mechanisms by which it could not be the case (e.g national defense is highly involved and is a bit more competent, the strategy is designed in collaboration with some competent people from the private sector etc.).
To be more precise, I'd be surprised if no one of these 7 countries had an ambitious plan which meaningfully changed the strategic landscape post-2030:
- US
- Israel
- UK
- Singapore
- France
- China
- Germany
I guess I'm a bit less optimistic on the ability of governments to allocate funds efficiently, but I'm not very confident in that.
A fairly dumb-but-efficient strategy that I'd expect some governments to take is "give more money to SOTA orgs" or "give some core roles to SOTA orgs in your Manhattan Project". That seems likely to me and that would have substantial effects.
Unfortunately, good compute governance takes time. E.g., if we want to implement hardware-based safety mechanisms, we first have to develop them, convince governments to implement them, and then they have to be put on the latest chips, which take several years to dominate compute.
This is a very interesting point.
I think that some "good compute governance" such as monitoring big training runs doesn't require on-chip mechanisms but I agree that for any measure that would involve substantial hardware modifications, it would probably take a lot of time.
note that compute gov likely requires government levers, so this clashes a bit with you other statement
I agree that some governments might be involved but I think that it will look very differently from "national government policy". My model of international coordination is that there are a couple of people involved in each government and what's needed to move the position of these people (and thus of a country essentially) is not comparable with national policy.
What I'm confident in is that they're more likely to be ahead than now or within a couple years. As I said, otherwise my confidence is ~35% by 2035 that China catches up (or become better), which is not huge?
My reasoning is that they've been better at optimizing ~everything than the US mostly because of their centralization and norms (not caring too much about human rights helps optimizing) which is why I think it's likely that they'll catch up.
Mostly because they have a lot of resources and thus can weigh a lot in the race once they enter it.
Thanks for your comment!
I see your point on fear spreading causing governments to regulate. I basically agree that if it's what happens, it's good to be in a position to shape the regulation in a positive way or at least try to. I still think that I'm more optimistic about corporate governance which seems more tractable than policy governance to me.
The points you make are good, especially in the second paragraph. My model is that if scale is all you need, then it's likely that indeed smaller startups are also worrying. I also think that there could be visible events in the future that would make some of these startups very serious contenders (happy to DM about that).
Having a clear map of who works in corporate governance and who works more towards policy would be very helpful. Is there anything like a "map/post of who does what in AI governance" or anything like that?
Have you read note 2? If note 2 was made more visible, would you still think that my claims imply a too high certainty?
I hesitated on decreasing the likelihood on that one based on your consideration to be honest, but I still think that 30% of having strong effects is quite a lot because as you mentioned it requires the intersection of many conditions.
In particular, you don't mention which intervention you expect from them. If you take the intervention I took as a reference class ("Constrain labs to airgap and box their SOTA models while they train them”), do you think there are things that are as much or more "extreme" than this and that are likely?
What might be misleading in my statement is that it could be understood as "let's drop national government policy" while it's more "I think that currently too many people are focused on national government policy and not enough are focused on corporate governance, and it puts us in a fairly bad position for pre-2030 timelines".
Thanks for your comment!
First, you have to have in mind that when people are talking about "AI" in industry and policymaking, they usually have mostly non-deep learning or vision deep learning techniques in mind simply because they mostly don't know the ML academic field but they have heard that "AI" was becoming important in industry. So this sentence is little evidence that Russia (or any other country) is trying to build AGI, and I'm at ~60% Putin wasn't thinking about AGI when he said that.
If anyone who could play any role at all in developing AGI (or uncontrollable AI as I prefer to call it) isn't trying to develop it by now, I doubt very much that any amount of public communication will change that.
I think that you're deeply wrong about this. Policymakers and people in industry, at least till ChatGPT had no idea what was going on (e.g at the AI World Summit, 2 months ago very few people even knew about GPT-3). SOTA large language models are not really properly deployed, so nobody cared about them or even knew about them (till ChatGPT at least). The level of investment right now in top training runs probably doesn't go beyond $200M. The GDP of the US is 20 trillion. Likewise for China. Even a country like France could unilaterally put $50 billion in AGI development and accelerate timelines quite a lot within a couple of years.
Even post ChatGPT, people are very bad at projecting what it means for next years and still have a prior on the fact that human intelligence is very specific and can't be beaten which prevents them from realizing all the power of this technology.
I really strongly encourage you to go talk to actual people from industry and policy to get a sense of their knowledge on the topic. And I would strongly recommend not publishing your book as long as you haven't done that. I also hope that a lot of people who have thought about these issues have proofread your book because it's the kind of thing that could really increase P(doom) substantially.
I think that to make your point, it would be easier to defend the line that "even if more governments got involved, that wouldn't change much". I don't think that's right because if you gave $10B more to some labs, it's likely they'd move way faster. But I think that it's less clear.
a common, clear understanding of the dangers
I agree that it would be something good to have. But the question is: is it even possible to have such a thing?
I think that within the scientific community, it's roughly possible (but then your book/outreach medium must be highly targeted towards that community). Within the general public, I think that it's ~impossible. Climate change, which is a problem which is much easier to understand and explain is already way too complex for the general public to have a good idea of what are the risks and what are the promising solutions to these risks (e.g. a lot people's top priorities is to eat organic food, recycle and decrease plastic consumption).
I agree that communicating with the scientific community is good, which is why I said that you should avoid publicizing only among "the general public". If you really want to publish a book, I'd recommend targeting the scientific community, which is not at all the same public as the general public.
"On the other hand, if most people think that strong AI poses a significant risk to their future and that of their kids, this might change how AI capabilities researchers are seen, and how they see themselves"
I agree with this theory of change and I think that it points a lot more towards "communicate in the ML community" than "communicate towards the general public". Publishing great AI capabilities is mostly cool for other AI researchers and not that much for the general public. People in San Francisco (where most of the AGI labs are) also don't care much about the general public and whatever it thinks ; the subculture there and what is considered to be "cool" is really different from what the general public thinks is cool. As a consequence, I think they mostly care about what their peers are thinking about them. So if you want to change the incentives, I'd recommend focusing your efforts on the scientific & the tech community.
[Cross-posting my answer]
Thanks for your comment!
That's an important point that you're bringing up.
My sense is that at the movement level, the consideration you bring up is super important. Indeed, even though I have fairly short timelines, I would like funders to hedge for long timelines (e.g. fund stuff for China AI Safety). Thus I think that big actors should have in mind their full distribution to optimize their resource allocation.
That said, despite that, I have two disagreements:
- I feel like at the individual level (i.e. people working in governance for instance, or even organizations), it's too expensive to optimize over a distribution and thus you should probably optimize with a strategy of "I want to have solved my part of the problem by 20XX". And for that purpose, identifying the main characteristics of the strategic landscape at that point (which this post is trying to do) is useful.
- "the EV gained in the worlds where things move quickly is not worth the expected cost in worlds where they don't." I disagree with this statement, even at the movement level. For instance I think that the trade-off of "should we fund this project which is not the ideal one but still quite good?" is one that funders often encounter and I would expect that funders have more risk adverseness than necessary because when you're not highly time-constrained, it's probably the best strategy (i.e. in every fields except in AI safety, it's probably a way better strategy to trade-off a couple of years against better founders).
Finally, I agree that "the best strategies will have more variance" is not a good advice for everyone. The reason I decided to write it rather than not is because I think that the AI governance community tends to have a too high degree of risk adverseness (which is a good feature in their daily job) which penalizes mechanically a decent amount of actions that are way more useful under shorter timelines.
To get a better sense of people's standards' on "cut at the hard core of alignment", I'd be curious to hear examples of work that has done so.
It would be worth paying someone to do this in a centralized way:
- Reach out to authors
- Convert to LaTeX, edit
- Publish
If someone is interested in doing this, reach out to me (campos.simeon @gmail.com)
Do you think we could use grokking/current existing generalization phenomena (e.g induction heads) to test your theory? Or do you expect the generalizations that would lead to the sharp left turn to be greater/more significant than those that occurred earlier in the training?
Likewise!
Thanks for trying! I don't think that's much evidence against GPT3 being a good oracle though, bc to me it's pretty normal that without fine-tuning he's not able to forecast. He'd need to be extremely sample efficient to be able to do that. Does anyone want to try fine-tuning?
Cost: You have basically 3 months free with GPT3 Davinci (175B) (under a given limit but which is sufficient for personal use) and then you pay as you go. Even if you use it a lot, you're likely to pay less than 5$ or 10$ per months.
And if you have some tasks that need a lot of tokens but that are not too hard (e.g hard reading comprehension), Curie (GPT3 6B) is often enough and is much cheaper to use!
In few-shot settings (i.e a setting in which you show examples of something so that it reproduces it), Curie is often very good so it's worth trying it!
Merits: It's just a matter of cost and inference speed that you need. The biggest models are almost always better so taking the biggest thing that you can afford, both in terms of speed and of cost is a good heuristic
Use: It's very easy to use with the new Instruct models. You just put your prompt and it completes it. The only parameter you have to care about are token uses (which is basically the max size of the completion you want) / temperature (it's a parameter that affects how "creative" is the answer ; the higher the more creative)
Thanks for the feedback! I will think about it and maybe try to do something along those lines!
Are there existing models for which we're pretty sure we know all their latent knowledge ? For instance small language models or something like that.
Thanks for the answer! The post you mentioned indeed is quite similar!
Technically, the strategies I suggested in my two last paragraphs (Leverage the fact that we're able to verify solutions to problems we can't solve + give partial information to an algorithm and use more information to verify) should enable to go far beyond human intelligence / human knowledge using a lot of different narrowly accurate algorithms.
And thus if the predictor has seen many extremely (narrowly) smart algorithms, it would be much more likely to know what is it like to be much smarter than a human on a variety of tasks. It probably still requires some optimism on generalization. So technically the counterexample could be happening on the gap between the capability of the predictor and the capability of the reporter. I feel like one question is : do we expect some narrow algorithms to be much better on very precise tasks than general-purpose algorithms (such as the predictor for instance) ? Because if it were the case, then the generalization that the reporter would have to do from training data (humans + narrowly accurate algorithms capabilities) to inference data (predictor's capabilities) could be small. We could even have data on the predictor's capability in the training dataset using the second approach I mentioned (i.e giving partial information to the predictor (e.g one camera in SuperVault) and using more information (i.e more cameras for humans) than him to verify its prediction). We could give some training examples and show the AI how the human fails much more often than the predictor on the exact same sample of examples. That way, we could greatly reduce the gap of generalization which is required.
The advantage of this approach is that the bulk of the additionnal cost of training that the reporter requires is due to the generation of the dataset which is a fixed cost that no user has to repay. So that could slightly decrease the competitivity issues as compared with approaches where we affect the training procedure.
Despite all that, thanks to your comment and the report, I see why the approach I mention might have some intrinsic limitations in its ability to elicit latent knowledge though. The problem is that even if it understands roughly that it has incentives to use most of what it knows when we ask him simulating the prediction of someone with its own characteristics (or 1400 IQ), given that with ELK we look for an global maximum (we want that it uses ALL its knowledge), there's always an uncertainty on whether it did understand that point or not for extreme intelligence / examples or whether it tries to fit to the training data as much as possible and thus still doesn't use something it knows.
You said that naive questions were tolerated so here’s a scenario I can’t figure out why it wouldn’t work.
It seems to me that the fact that an AI fails to predict the truth (because it predicts as humans would) is due to the fact that the AI has built an internal model of how humans understand things and predict based on that understanding. So if we assume that an AI is able to build such an internal model, why wouldn’t we train an AI to predict what a (benevolent) human would say given an amount of information and a capacity to process information ? Doing so, it could develop an understanding of why humans predict badly and then understand that given a huge amount of information and a huge capacity to process information, the true answer is the right one.
A concrete training procedure could use the fact that even among humans, there’s a lot of variance in :
- what they know (i.e the amount of information they have)
- their capacity to process information (so for instance in the case of the vault, it could be the capacity to infer what happened based on partial information / partial images and based on a certain capacity to process images (no more that x images per second)
So we could use the variance among humans capacity to understand what happened to try to make the AI understanding that benevolent humans predict badly whether the diamond has been stolen only because they lack information or capacity to process information. There’s a lot of fields and questions where the majority of humans are wrong and only a small subset is right, becase there is either a big gap in the information they have or in their ability to process information.
Once we would have trained that AI on humans, we would like to do inference specifying the true capabilities of the AI to ensure that it tells us the truth. The remaining question might be whether such an out-of-sample prediction would work. My guess is that if we also included examples with the human bayes net to add more variance in the training dataset, it would probably reduce the chances that it fails.
Finally, the concrete problem of specifying the information humans have access to is not trivial but I think it’s feasible.
I don’t understand why it wouldn’t work, so I’d be happy to have an answer to better understand the problem!
EDIT : Here's an update after a discussion about my proposal. After having read Learning The Prior (https://ai-alignment.com/learning-the-prior-48f61b445c04), I think that the key novelty of my proposal, if there's any, is to give to the model input information about the capacity to reason of the person / entity that predicts an outcome. So here are a few relevant features that we could give it :
- In the case of the vault, we could give an estimate of the number of images that a human is able to see and understand per seconds.
- IQ (or equivalent) when the task involves reasoning
- Accuracy on a few benchmark datasets of the AI assistant who's helping the human to think (human's Bayes net)
That said, I feel like the main problem is to know whether such a model would do well out-of-distribution (i.e on problems no human is able to resolve). I feel like using the approach I suggested, we should able to use the great variations of capacities among humans and algorithms to increase the chances that our algorithm do well when it's much better.
In particular, I thought about 2 approaches that could enable us to go far beyond human capacity. We could:
- leverage the fact that we are able to verify a solution to some problems that we can't solve. There might be algorithms very good at solving some problems that no humans can solve, but we can still verify (and thus label) these results. There might be mathematical problems such that we can use algorithms that would do mathematical reasoning that no human would be able to understand and still be able to verify its solution. If we gave to our predictor's Bayes net the characteristics of the algorithm which is much better than us at solving some problems and the labels of the answers, that would be a way to leverage narrowly superhumans algorithms to generate more variance on the training dataset and decrease the chances of failures in inference. I feel like this approach is very interesting because that would enable to expose the predictor's Bayes net to algorithms who are both superhuman and have a lot of information (more than any human) which is the situation in which the predictor's would have to do inferences.
- use more information than a more (narrowly) clever entity than us to verify its prediction. In the case of the vault for instance, there could be situations such that no humans would be able to predict whether the vault was stolen or not given partial information, but we could then just use better cameras for ourselves to be able to verify accurately the predictions of a good algorithm. That way, we would also be able to communicate to the predictor's Bayes net how is it like to be incredibly smart (more than any human) and do good predictions when you are like that. I feel like using that approach of hiding information to extremely accurate narrow algorithms could be a way to still be able to verify their predictions, and expose the predictor's Bayes net to very smart entities in multiple domains. The problem of this approach though is that it doesn't give examples of both "maximal information" and "maximal intelligence" to the predictor because it relies on the fact that it always hides a part of the information to the algorithm to enable us to still verify its claims.
Despite that, I don't know whether asymptotically, I'd expect the algorithm to still be truthful. But it could greatly increase the distribution on which it's truthful.
I think that "There are many talented people who want to work on AI alignment, but are doing something else instead." is likely to be true. I met at least 2 talented people who tried to get into AI Safety but who weren't able to because open positions / internships were too scarce. One of them at least tried hard (i.e applied for many positions and couldn't find one (scarcity), despite the fact that he was one of the top french students in ML). If there was money / positions, I think that there are chances that he would work on AI alignment independently.
Connor Leahy in one of his podcasts mentions something similar aswell.
That's the impression I have.