Posts
Comments
Hi habryka, I don't really know how best to respond to such a comment. First, I would like to say thank you for your well-wishes, assuming you did not mean them sarcastically. Maybe I have lost the plot, and if so, I do appreciate help in recovering it. Secondly, I feel confused as to why you would say such things in general.
Just last month, me and my coauthors released a 100+ page explanation/treatise on AI extinction risk that gives a detailed account of where AGI risk comes from and how it works, which was received warmly by LW and the general public alike, and which continues to be updated and actively publicised.
In parallel, our sister org ControlAI, a non-profit policy advocacy org focused solely on extinction risk prevention I work with frequently, has had A Narrow Path, a similarly extensive writeup on principles of regulation to address xrisk from ASI, which me and ControlAI have pushed and discussed extensively with policy makers of multiple countries, and there are other regulation-promoting projects ongoing.
I have been on CNN, BBC, Fox News and other major news sources warning in no ambiguous terms about the risks. There is literally dozens of hours of podcast material, including from just last month, where I explain in excruciating depth the existential risk posed by AGI systems and where it comes from, and how it differs from other forms of AI risk. If you think all my previous material has "lost the plot", then well, I guess in your eyes I never had it, not much I can do.
This post is a technical agenda that is not framed in the usual LW ideological ontology, and has not been optimized to appeal to that audience, but rather to identify an angle that is tractable and generalizes the problem without losing its core, and leads to solutions that address the hard core, which is Complexity. In the limit, if we had beautifully simple, legible designs for ASIs that we fully understand and can predict, technical xrisk (but not governance) would be effectively solved. If you disagree with this, I would have greatly enjoyed your engagement with what object level points you think are wrong, and it may have helped me write a better roadmap.
But it seems to me that you have not even tried to engage with the content of this post at all, and have instead merely asserted it is a "random rant against AI-generated art" and "name-calling." I see no effort other than surface level pattern matching, or any curiosity to how it might fit with my previous writings and thinking that have been shared and discussed.
Do you truly think that's the best effort at engaging in good faith you can make?
If so, I don't know what I can say that would help. I hope we can both find the plot again, since neither of us seem to see it in the other person.
Morality is multifaceted and multilevel. If you have a naive form of morality that is just "I do whatever I think is the right thing to do", you are not coordinating or being moral, you are just selfish.
Coordination is not inherently always good. You can coordinate with one group to more effectively do evil against another. But scalable Good is always built on coordination. If you want to live in a lawful, stable, scalable, just civilization, you will need to coordinate with your civilization and neighbors and make compromises.
As a citizen of a modern country, you are bound by the social contract. Part of the social contract is "individuals are not allowed to use violence against other individuals, except in certain circumstances like self defense." [1] Now you might argue that this is a bad contract or whatever, but it is the contract we play by (at least in the countries I have lived in), and I think unilaterally reneging on that contract is immoral. Unilaterally saying "I will expose all of my neighbors to risk of death from AGI because I think I'm a good person" is very different from "we all voted and the majority decided building AGI is a risk worth taking."
Now, could it be that you in some exceptional circumstances need to do something immoral to prevent some even greater tragedy? Sure, it can happen. Murder is bad, but self defense can make it on net ok. But just because it's self defense doesn't make murder moral, it just means there was an exception in this case. War is bad, but sometimes countries need to go to war. That doesn't mean war isn't bad.
Civilization is all about commitments, and honoring them. If you can't honor your commitments to your civilization, even when you disagree with them sometimes, you are not civilized and are flagrantly advertising your defection. If everyone does this, we lose civilization.
Morality is actually hard, and scalable morality/civilization is much, much harder. If an outcome you dislike happened because of some kind of consensus, this has moral implications. If someone put up a shitty statue that you hate in the town square because he's an asshole, that's very different morally from "everyone in the village voted, and they like the statue and you don't, so suck it up." If you think "many other people want X and I want not X" has no moral implications whatsoever your "morality" is just selfishness.[2]
Hi, as I was tagged here, I will respond to a few points. There are a bunch of smaller points only hinted at that I won't address. In general, I strongly disagree with the overall conclusion of this post.
There are two main points I would like to address in particular:
1 More information is not more Gooder
There seems to be a deep underlying confusion here that in some sense more information is inherently more good, or inherently will result in good things winning out. This is very much the opposite of what I generally claim about memetics. Saying that all information is good is like saying all organic molecules or cells are equally good. No! Adding more biosludge and toxic algal blooms to your rosegarden won't make it better!
Social media is the exact living proof of this. People genuinely thought social media will bring everyone together, resolve conflicts, create a globally unified culture and peace and democracy, that autocracy and bigotry couldn't possibly thrive if you just only had enough information. I consider this hypothesis thoroughly invalidated. "Increasing memetic evolutionary pressure" is not a good thing! (all things equal)
Increasing the evolutionary pressure on the flu virus doesn't make the world better, and viruses mutate a lot faster than nice fluffy mammals. Most mutations in fluffy mammals kills them, mutations in viruses helps them far more. Value is fragile. It is asymmetrically easy to destroy than to create.
Raw evolution selects for fitness/reproduction, not Goodness. You are just feeding the Great Replicator.
For an accessible intro to some of this, I recommend the book "Nexus" by Yuval Harari. (not that I endorse everything in that book, but the first half is great)
2 "Pivotal Act" style theories of change
You talk about theories of change of the form "we safety people will keep everything secret and create an aligned AI, ship it to big labs and save the world before they destroy it (or directly use the AI to stop them)". I don't endorse, and in fact strongly condemn, such theories of change.
But not because of the hiding information part, but because of the "we will not coordinate with others and will use violence unilaterally" part! Such theories of change are fundamentally immoral for the same reasons labs building AGI is immoral. We have a norm in our civilization that we don't as private citizens threaten to harm or greatly upend the lives of our fellow civilians without either their consent or societal/governmental/democratic authority.
The not sharing information part is fine! Not all information is good! For example, Canadian researchers a while back figured out how to reconstruct an extinct form of smallpox, and then published how to do it. Is this a good thing for the world to have that information out there?? I don't think so. Should we open source the blue prints of the F-35 fighter jet? I don't think so, I think it's good that I don't have those blueprints!
Information is not inherently good! Not sharing information that would make the world worse is virtuous. Now, you might be wrong about the effects of sharing the information you have, sure, but claiming there is no tradeoff or the possibility that sharing might actually, genuinely, be bad, is just ignoring why coordination is hard.
3 Conclusion
If you ever find yourself thinking something of the shape "we must simply unreservedly increase [conceptually simple variable X], with no tradeoffs", you're wrong. Doesn't matter how clever you think X is, you're wrong. Any real life, not fake complex thing is made of towers upon towers of tradeoffs. If you think there are no tradeoffs in whatever system you are looking at, you don't understand the system.
Memes are not our friends. Conspiracy theories and lies spread faster than complex, nuanced truth. The printing press didn't bring the scientific revolution, it brought the witch burnings and the 30 year war. The scientific revolution came from the Royal Society and its nuanced, patient, complex norms of critical inquiry. Yes, spreading your scientific papers was also important, it was necessary but not sufficient for a good outcome.
More mutation/evolution, all things equal, means more cancer, not more health and beauty. Health and beauty can come from cancerous mutation and selection, but it's not a pretty process, and requires a lot of bloody, bloody trial and error (and a good selection function). The kind of inefficient and morally abominable process I would prefer us not relying on.
With that being said, I think it's good that you wrote things down and are thinking about them, please don't take what I'm saying as some kind of personal disparaging, I wish more people wrote down their ideas and tried to think things through! I think there is indeed a lot of valuable things in this direction, around better norms, tools, processes and memetic growth, but they're just really quite non trivial! You're on your way to thinking critically about morality, coordination and epistemology, which is great! That's where I think real solutions are!
Nice set of concepts, I might use these in my thinking, thanks!
I don't understand what point you are trying to make, to be honest. There are certain problems that humans/I care about that we/I want NNs to solve, and some optimizers (e.g. Adam) solve those problems better or more tractably than others (e.g. SGD or second order methods). You can claim that the "set of problems humans care about" is "arbitrary", to which I would reply "sure?"
Similarly, I want "good" "philosophy" to be "better" at "solving" "problems I care about." If you want to use other words for this, my answer is again "sure?" I think this is a good use of the word "philosophy" that gets better at what people actually want out of it, but I'm not gonna die on this hill because of an abstract semantic disagreement.
"good" always refers to idiosyncratic opinions, I don't really take moral realism particularly seriously. I think there is "good" philosophy in the same way there are "good" optimization algorithms for neural networks, while also I assume there is no one optimizer that "solves" all neural network problems.
I strongly disagree and do not think that will be how AGI will look, AGI isn't magic. But this is a crux and I might be wrong of course.
I can't rehash my entire views on coordination and policy here I'm afraid, but in general, I believe we are currently on a double exponential timeline (though I wouldn't model it quite like you, but the conclusions are similar enough) and I think some simple to understand and straightforwardly implementable policy (in particular, compute caps) at least will move us to a single exponential timeline.
I'm not sure we can get policy that can stop the single exponential (which is software improvements), but there are some ways, and at least we will then have additional time to work on compounding solutions.
Sure, it's not a full solution, it just buys us some time, but I think it would be a non-trivial amount, and let not perfect be the enemy of good and what not.
I see regulation as the most likely (and most accessible) avenue that can buy us significant time. The fmpov obvious is just put compute caps in place, make it illegal to do training runs above a certain FLOP level. Other possibilities are strict liability for model developers (developers, not just deployers or users, are held criminally liable for any damage caused by their models), global moratoria, "CERN for AI" and similar. Generally, I endorse the proposals here.
None of these are easy, of course, there is a reason my p(doom) is high.
But what happens if AI deception then gets solved relatively quickly (or someone comes up with a proposed solution that looks good enough to decision makers)? And this is another way that working on alignment could be harmful from my perspective...
Of course if a solution merely looks good, that will indeed be really bad, but that's the challenge of crafting and enforcing sensible regulation.
I'm not sure I understand why it would be bad if it actually is a solution. If we do, great, p(doom) drops because now we are much closer to making aligned systems that can help us grow the economy, do science, stabilize society etc. Though of course this moves us into a "misuse risk" paradigm, which is also extremely dangerous.
In my view, this is just how things are, there are no good timelines that don't route through a dangerous misuse period that we have to somehow coordinate well enough to survive. p(doom) might be lower than before, but not by that much, in my view, alas.
I think this is not an unreasonable position, yes. I expect the best way to achieve this would be to make global coordination and epistemology better/more coherent...which is bottlenecked by us running out of time, hence why I think the pragmatic strategic choice is to try to buy us more time.
One of the ways I can see a "slow takeoff/alignment by default" world still going bad is that in the run-up to takeoff, pseudo-AGIs are used to hypercharge memetic warfare/mutation load to a degree basically every living human is just functionally insane, and then even an aligned AGI can't (and wouldn't want to) "undo" that.
Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?
The fact you ask this question is interesting to me, because in my view the opposite question is the more natural one to ask: What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is exactly those areas where we can have as much feedback from reality in as tight loops as possible, and so if we are trying to tackle ever more lofty problems, it becomes ever more important to get exactly that feedback wherever we can get it! From my point of view, this is the default of successful human epistemology, and the exception should be viewed with suspicion.
And for what it's worth, acting in the real world, building a company, raising money, debating people live, building technology, making friends (and enemies), absolutely helped me become far, far less confused, and far more capable of tackling confusing problems! Actually testing my epistemology and rationality against reality, and failing (a lot), has been far more helpful for deconfusing everything from practical decision making skills to my own values than reading/thinking could have ever been in the same time span. There is value in reading and thinking, of course, but I was in a severe "thinking overhang", and I needed to act in the world to keep learning and improving. I think most people (especially on LW) are in an "action underhang."
"Why do people do things?" is an empirical question, it's a thing that exists in external reality, and you need to interact with it to learn more about it. And if you want to tackle even higher level problems, you need to have even more refined feedback. When a physicist wants to understand the fundamentals of reality, they need to set up insane crazy particle accelerators and space telescopes and supercomputers and what not to squeeze bits of evidence out of reality and actually ground whatever theoretical musings they may have been thinking about. So if you want to understand the fundamentals of philosophy and the human condition, by default I expect you are going to need to do the equivalent kind of "squeezing bits out of reality", by doing hard things such as creating institutions, building novel technology, persuading people, etc. "Building a company" is just one common example of a task that forces you to interact a lot with reality to be good.
Fundamentally, I believe that good philosophy should make you stronger and allow you to make the world better, otherwise, why are you bothering? If you actually "solve metaphilosophy", I think the way this should end up looking is that you can now do crazy things. You can figure out new forms of science crazy fast, you can persuade billionaires to support you, you can build monumental organizations that last for generations. Or, in reverse, I expect that if you develop methods to do such impressive feats, you will necessarily have to learn deep truths about reality and the human condition, and acquire the skills you will need to tackle a task as heroic as "solving metaphilosophy."
Everyone dying isn't the worst thing that could happen. I think from a selfish perspective, I'm personally a bit more scared of surviving into a dystopia powered by ASI that is aligned in some narrow technical sense. Less sure from an altruistic/impartial perspective, but it seems at least plausible that building an aligned AI without making sure that the future human-AI civilization is "safe" is a not good thing to do.
I think this grounds out into object level disagreements about how we expect the future to go, probably. I think s-risks are extremely unlikely at the moment, and when I look at how best to avoid them, most such timelines don't go through "figure out something like metaphilosophy", but more likely through "just apply bog standard decent humanist deontological values and it's good enough." A lot of the s-risk in my view comes from the penchant for maximizing "good" that utilitarianism tends to promote, if we instead aim for "good enough" (which is what most people tend to instinctively favor), that cuts off most of the s-risk (though not all).
To get to the really good timelines, that route through "solve metaphilosophy", there are mandatory previous nodes such as "don't go extinct in 5 years." Buying ourselves more time is powerful optionality, not just for concrete technical work, but also for improving philosophy, human epistemology/rationality, etc.
I don't think I see a short path to communicating the parts of my model that would be most persuasive to you here (if you're up for a call or irl discussion sometime lmk), but in short I think of policy, coordination, civilizational epistemology, institution building and metaphilosophy as closely linked and tractable problems, if only it wasn't the case that there was a small handful of AI labs (largely supported/initiated by EA/LW-types) that are deadset on burning the commons as fast as humanly possible. If we had a few more years/decades, I think we could actually make tangible and compounding progress on these problems.
I would say that better philosophy/arguments around questions like this is a bottleneck. One reason for my interest in metaphilosophy that I didn't mention in the OP is that studying it seems least likely to cause harm or make things worse, compared to any other AI related topics I can work on. (I started thinking this as early as 2012.) Given how much harm people have done in the name of good, maybe we should all take "first do no harm" much more seriously?
I actually respect this reasoning. I disagree strategically, but I think this is a very morally defensible position to hold, unlike the mental acrobatics necessary to work at the x-risk factories because you want to be "in the room".
Which also represents an opportunity...
It does! If I was you, and I wanted to push forward work like this, the first thing I would do is build a company/institution! It will both test your mettle against reality and allow you to build a compounding force.
Is it actually that weird? Do you have any stories of trying to talk about it with someone and having that backfire on you?
Yup, absolutely. If you take even a microstep outside of the EA/rat-sphere, these kind of topics quickly become utterly alien to anyone. Try explaining to a politician worried about job loss, or a middle aged housewife worried about her future pension, or a young high school dropout unable to afford housing, that actually we should be worried about whether we are doing metaphilosophy correctly to ensure that future immortal superintelligence reason correctly about acausal alien gods from math-space so they don't cause them to torture trillions of simulated souls! This is exaggerated for comedic effect, but this is really what even relatively intro level LW philosophy by default often sounds like to many people!
As the saying goes, "Grub first, then ethics." (though I would go further and say that people's instinctive rejection of what I would less charitably call "galaxy brain thinking" is actually often well calibrated)
As someone that does think about a lot of the things you care about at least some of the time (and does care pretty deeply), I can speak for myself why I don't talk about these things too much:
Epistemic problems:
- Mostly, the concept of "metaphilosophy" is so hopelessly broad that you kinda reach it by definition by thinking about any problem hard enough. This isn't a good thing, when you have a category so large it contains everything (not saying this applies to you, but it applies to many other people I have met who talked about metaphilosophy), it usually means you are confused.
- Relatedly, philosophy is incredibly ungrounded and epistemologically fraught. It is extremely hard to think about these topics in ways that actually eventually cash out into something tangible, rather than nerdsniping young smart people forever (or until they run out of funding).
- Further on that, it is my belief that good philosophy should make you stronger, and this means that fmpov a lot of the work that would be most impactful for making progress on metaphilosophy does not look like (academic) philosophy, and looks more like "build effective institutions and learn interactively why this is hard" and "get better at many scientific/engineering disciplines and build working epistemology to learn faster". Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused. I might be totally wrong, but I expect at this moment in time me building a company is going to help me deconfuse a lot of things about philosophy more than me thinking about it really hard in isolation would.
- It is not clear to me that there even is an actual problem to solve here. Similar to e.g. consciousness, it's not clear to me that people who use the word "metaphilosophy" are actually pointing to anything coherent in the territory at all, or even if they are, that it is a unique thing. It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences". I know the other view ofc and still worth engaging with in case there is something deep and universal to be found (the same way we found that there is actually deep equivalency and "correct" ways to think about e.g. computation).
Practical problems:
- I have short timelines and think we will be dead if we don't make very rapid progress on extremely urgent practical problems like government regulation and AI safety. Metaphilosophy falls into the unfortunate bucket of "important, but not (as) urgent" in my view.
- There are no good institutions, norms, groups, funding etc to do this kind of work.
- It's weird. I happen to have a very deep interest in the topic, but it costs you weirdness points to push an idea like this when you could instead be advocating more efficiently for more pragmatic work.
- It was interesting to read about your successive jumps up the meta hierarchy, because I had a similar path, but then I "jumped back down" when I realized that most of the higher levels is kinda just abstract, confusing nonsense and even really "philosophically concerned" communities like EA routinely fail basic morality such as "don't work at organizations accelerating existential risk" and we are by no means currently bottlenecked by not having reflectively consistent theories of anthropic selection or whatever. I would like to get to a world where we have bottlenecks like that, but we are so, so far away from a world where that kind of stuff is why the world goes bad that it's hard to justify more than some late night/weekend thought on the topic in between a more direct bottleneck focused approach.
All that being said, I still am glad some people like you exist, and if I could make your work go faster, I would love to do so. I wish I could live in a world where I could justify working with you on these problems full time, but I don't think I can convince myself this is actually the most impactful thing I could be doing at this moment.
Yep, you see the problem! It's tempting to just think of an AI as "just the model", and study that in isolation, but that just won't be good enough longterm.
Looks good to me, thank you Loppukilpailija!
Thanks!
As I have said many, many times before, Conjecture is not a deep shift in my beliefs about open sourcing, as it is not, and has never been, the position of EleutherAI (at least while I was head) that everything should be released in all scenarios, but rather that some specific things (such as LLMs of the size and strength we release) should be released in some specific situations for some specific reasons. EleutherAI would not, and has not, released models or capabilities that would push the capabilities frontier (and while I am no longer in charge, I strongly expect that legacy to continue), and there are a number of things we did discover that we decided to delay or not release at all for precisely such infohazard reasons. Conjecture of course is even stricter and has opsec that wouldn't be possible at a volunteer driven open source community.
Additionally, Carper is not part of EleutherAI and should be considered genealogically descendant but independent of EAI.
Thanks for this! These are great questions! We have been collecting questions from the community and plan to write a follow up post addressing them in the next couple of weeks.
I initially liked this post a lot, then saw a lot of pushback in the comments, mostly of the (very valid!) form of "we actually build reliable things out of unreliable things, particularly with computers, all the time". I think this is a fair criticism of the post (and choice of examples/metaphors therein), but I think it may be missing (one of) the core message(s) trying to be delivered.
I wanna give an interpretation/steelman of what I think John is trying to convey here (which I don't know whether he would endorse or not):
"There are important assumptions that need to be made for the usual kind of systems security design to work (e.g. uncorrelation of failures). Some of these assumptions will (likely) not apply with AGI. Therefor, extrapolating this kind of thinking to this domain is Bad™️." ("Epistemological vigilance is critical")
So maybe rather than saying "trying to build robust things out of brittle things is a bad idea", it's more like "we can build robust things out of certain brittle things, e.g. computers, but Godzilla is not a computer, and so you should only extrapolate from computers to Godzilla if you're really, really sure you know what you're doing."
I think this is something better discussed in private. Could you DM me? Thanks!
This is a genuinely difficult and interesting question that I want to provide a good answer for, but that might take me some time to write up, I'll get back to you at a later date.
Yes, we do expect this to be the case. Unfortunately, I think explaining in detail why we think this may be infohazardous. Or at least, I am sufficiently unsure about how infohazardous it is that I would first like to think about it for longer and run it through our internal infohazard review before sharing more. Sorry!
Answered here.
Redwood is doing great research, and we are fairly aligned with their approach. In particular, we agree that hands-on experience building alignment approaches could have high impact, even if AGI ends up having an architecture unlike modern neural networks (which we don’t believe will be the case). While Conjecture and Redwood both have a strong focus on prosaic alignment with modern ML models, our research agenda has higher variance, in that we additionally focus on conceptual and meta-level research. We’re also training our own (large) models, but (we believe) Redwood are just using pretrained, publicly available models. We do this for three reasons:
- Having total control over the models we use can give us more insights into the phenomena we study, such as training models at a range of sizes to study scaling properties of alignment techniques.
- Some properties we want to study may only appear in close-to-SOTA models - most of which are private.
- We are trying to make products, and close-to-SOTA models help us do that better. Though as we note in our post, we plan to avoid work that pushes the capabilities frontier.
We’re also for-profit, while Redwood is a nonprofit, and we’re located in London! Not everyone lives out in the Bay :)
For the record, having any person or organization in this position would be a tremendous win. Interpretable aligned AGI?! We are talking about a top .1% scenario here! Like, the difference between egoistical Connor vs altruistic Connor with an aligned AGI in his hands is much much smaller than Connor with an aligned AGI and anyone, any organization or any scenario, with a misaligned AGI.
But let’s assume this.
Unfortunately, there is no actual functioning reliable mechanism by which humans can guarantee their alignment to each other. If there was something I could do that would irreversibly bind me to my commitment to the best interests of mankind in a publicly verifiable way, I would do it in a heartbeat. But there isn’t and most attempts at such are security theater.
What I can do is point to my history of acting in ways that, I hope, show my consistent commitment to doing what is best for the longterm future (even if of course some people with different models of what is “best for the longterm future” will have legitimate disagreements with my choices of past actions), and pledge to remain in control of Conjecture and shape its goals and actions appropriately.
On a meta-level, I think the best guarantee I can give is simply that not acting in humanity’s best interest is, in my model, Stupid. And my personal guiding philosophy in life is “Don’t Be Stupid”. Human values are complex and fragile, and while many humans disagree about many details of how they think the world should be, there are many core values that we all share, and not fighting with everything we’ve got to protect these values (or dying with dignity in the process) is Stupid.
Probably. It is likely that we will publish a lot of our interpretability work and tools, but we can’t commit to that because, unlike some others, we think it’s almost guaranteed that some interpretability work will lead to very infohazardous outcomes. For example, obvious ways in which architectures could be trained more efficiently, and as such we need to consider each result on a case by case basis. However, if we deem them safe, we would definitely like to share as many of our tools and insights as possible.
We would love to collaborate with anyone (from academia or elsewhere) wherever it makes sense to do so, but we honestly just do not care very much about formal academic publication or citation metrics or whatever. If we see opportunities to collaborate with academia that we think will lead to interesting alignment work getting done, excellent!
Our current plan is to work on foundational infrastructure and models for Conjecture’s first few months, after which we will spin up prototypes of various products that can work with a SaaS model. After this, we plan to try them out and productify the most popular/useful ones.
More than profitability, our investors are looking for progress. Because of the current pace of progress, it would not be smart from their point of view to settle on a main product right now. That’s why we are mostly interested in creating a pipeline that lets us build and test out products flexibly.
Ideally, we would like Conjecture to scale quickly. Alignment wise, in 5 years time, we want to have the ability to take a billion dollars and turn it into many efficient, capable, aligned teams of 3-10 people working on parallel alignment research bets, and be able to do this reliably and repeatedly. We expect to be far more constrained by talent than anything else on that front, and are working hard on developing and scaling pipelines to hopefully alleviate such bottlenecks.
For the second question, we don't expect it to be a competing force (as in, we have people who could be working on alignment working on product instead). See point two in this comment.
This is why we will focus on SaaS products on top of our internal APIs that can be built by teams that are largely independent from the ML engineering. As such, this will not compete much with our alignment-relevant ML work. This is basically our thesis as a startup: We expect it to be EV+, as this earns much more money than we would have had otherwise.
To point 1: While we greatly appreciate what OpenPhil, LTFF and others do (and hope to work with them in the future!), we found that the hurdles required and strings attached were far greater than the laissez-faire silicon valley VC we encountered, and seemed less scalable in the long run. Also, FTX FF did not exist back when we were starting out.
While EA funds as they currently exist are great at handing out small to medium sized grants, the ~8 digit investment we were looking for to get started asap was not something that these kinds of orgs were generally interested in giving out (which seems to be changing lately!), especially to slightly unusual research directions and unproven teams. If our timelines were longer and the VC money had more strings attached (as some of us had expected before seeing it for ourselves!), we may well have gone another route. But the truth of the current state of the market is that if you want to scale to a billion dollars as fast as possible with the most founder control, this is the path we think is most likely to succeed.
To point 2: This is why we will focus on SaaS products on top of our internal APIs that can be built by teams that are largely independent from the ML engineering. As such, this will not compete much with our alignment-relevant ML work. This is basically our thesis as a startup: We expect it to be EV+, as this earns much more money than we would have had otherwise.
Notice this is a contingent truth, not an absolute one. If tomorrow, OpenPhil and FTX contracted us with 200M/year to do alignment work, this would of course change our strategy.
To point 3: We don’t think this has to be true. (Un)fortunately, given the current pace of capability progress, we expect keeping up with the pace to be more than enough for building new products. Competition on AI capabilities is extremely steep and not in our interest. Instead, we believe that (even) the (current) capabilities are so crazy that there is an unlimited potential for products, and we plan to compete instead on building a reliable pipeline to build and test new product ideas.
Calling it competition is actually a misnomer from our point of view. We believe there is ample space for many more companies to follow this strategy, still not have to compete, and turn a massive profit. This is how crazy capabilities and their progress are.
The founders have a supermajority of voting shares and full board control and intend to hold on to both for as long as possible (preferably indefinitely). We have been very upfront with our investors that we do not want to ever give up control of the company (even if it were hypothetically to go public, which is not something we are currently planning to do), and will act accordingly.
For the second part, see the answer here.
To address the opening quote - the copy on our website is overzealous, and we will be changing it shortly. We are an AGI company in the sense that we take AGI seriously, but it is not our goal to accelerate progress towards it. Thanks for highlighting that.
We don’t have a concrete proposal for how to reliably signal that we’re committed to avoiding AGI race dynamics beyond the obvious right now. There is unfortunately no obvious or easy mechanism that we are aware of to accomplish this, but we are certainly open to discussion with any interested parties about how best to do so. Conversations like this are one approach, and we also hope that our alignment research speaks for itself in terms of our commitment to AI safety.
If anyone has any more trust-inducing methods than us simply making a public statement and reliably acting consistently with our stated values (where observable), we’d love to hear about them!
To respond to the last question - Conjecture has been “in the making” for close to a year now and has not been a secret, we have discussed it in various iterations with many alignment researchers, EAs and funding orgs. A lot of initial reactions were quite positive, in particular towards our mechanistic interpretability work, and just general excitement for more people working on alignment. There have of course been concerns around organizational alignment, for-profit status, our research directions and the founders’ history with EleutherAI, which we all have tried our best to address.
But ultimately, we think whether or not the community approves of a project is a useful signal for whether a project is a good idea, but not the whole story. We have our own idiosyncratic inside-views that make us think that our research directions are undervalued, so of course, from our perspective, other people will be less excited than they should be for what we intend to work on. We think more approaches and bets are necessary, so if we would only work on the most consensus-choice projects we wouldn’t be doing anything new or undervalued. That being said, we don’t think any of the directions or approaches we’re tackling have been considered particularly bad or dangerous by large or prominent parts of the community, which is a signal we would take seriously.
We (the founders) have a distinct enough research agenda to most existing groups such that simply joining them would mean incurring some compromises on that front. Also, joining existing research orgs is tough! Especially if we want to continue along our own lines of research, and have significant influence on their direction. We can’t just walk in and say “here are our new frames for GPT, can we have a team to work on this asap?”.
You’re right that SOTA models are hard to develop, but that being said, developing our own models is independently useful in many ways - it enables us to maintain controlled conditions for experiments, and study things like scaling properties of alignment techniques, or how models change throughout training, as well as being useful for any future products. We have a lot of experience in LLM development and training from EleutherAI, and expect it not to take up an inordinate amount of developer hours.
We are all in favor of high bandwidth communication between orgs. We would love to work in any way we can to set these channels up with the other organizations, and are already working on reaching out to many people and orgs in the field (meet us at EAG if you can!).
In general, all the safety orgs that we have spoken with are interested in this, and that’s why we expect/hope this kind of initiative to be possible soon.
See the reply to Michaël for answers as to what kind of products we will develop (TLDR we don’t know yet).
As for the conceptual research side, we do not do conceptual research with product in mind, but we expect useful corollaries to fall out by themselves for sufficiently good research. We think the best way of doing fundamental research like this is to just follow the most interesting, useful looking directions guided by the “research taste” of good researchers (with regular feedback from the rest of the team, of course). I for one at least genuinely expect product to be “easy”, in the sense that AI is advancing absurdly fast and the economic opportunities are falling from the sky like candy, so I don’t expect us to need to frantically dedicate our research to finding worthwhile fruit to pick.
The incubator has absolutely nothing to do with our for profit work, and is truly meant to be a useful space for independent researchers to develop their own directions that will hopefully be maximally beneficial to the alignment community. We will not put any requirements or restrictions on what the independent researchers work on, as long as it is useful and interesting to the alignment community.
We currently have a (temporary) office in the Southwark area, and are open to visitors. We’ll be moving to a larger office soon, and we hope to become a hub for AGI Safety in Europe.
And yes! Most of our staff will be attending EAG London. See you there?
See a longer answer here.
TL;DR: For the record, EleutherAI never actually had a policy of always releasing everything to begin with and has always tried to consider each publication’s pros vs cons. But this is still a bit of change from EleutherAI, mostly because we think it’s good to be more intentional about what should or should not be published, even if one does end up publishing many things. EleutherAI is unaffected and will continue working open source. Conjecture will not be publishing ML models by default, but may do so on a case by case basis.
Our decision to open-source and release the weights of large language models was not a haphazard one, but was something we thought very carefully about. You can read my short post here on our reasoning behind releasing some of our models. The short version is that we think that the danger of large language models comes from the knowledge that they’re possible, and that scaling laws are true. We think that by giving researchers access to the weights of LLMs, we will aid interpretability and alignment research more than we will negatively impact timelines. At Conjecture, we aren’t against publishing, but by making non-disclosure the default, we force ourselves to consider the long-term impact of each piece of research and have a better ability to decide not to publicize something rather than having to do retroactive damage control.
EAI has always been a community-driven organization that people tend to contribute to in their spare time, around their jobs. I for example have had a dayjob of one sort or another for most of EAI’s existence. So from this angle, nothing has changed aside from the fact my job is more demanding now.
Sid and I still contribute to EAI on the meta level (moderation, organization, deciding on projects to pursue), but do admittedly have less time to dedicate to it these days. Thankfully, Eleuther is not just us - we have a bunch of projects going on at any one time, and progress for EAI doesn’t seem to be slowing down.
We are still open to the idea of releasing larger models with EAI, and funding may happen, but it’s no longer our priority to pursue that, and the technical lead of that project (Sid) has much less time to dedicate to it.
Conjecture staff will occasionally contribute to EAI projects, when we think it’s appropriate.
Answered here.
We strongly encourage in person work - we find it beneficial to be able to talk over or debate research proposals in person at any time, it’s great for the technical team to be able to pair program or rubber duck if they’re hitting a wall, and all being located in the same city has a big impact on team building.
That being said, we don’t mandate it. Some current staff want to spend a few months a year with their families abroad, and others aren’t able to move to London at all. While we preferentially accept applicants who can work in person, we’re flexible, and if you’re interested but can’t make it to London, it’s definitely still worth reaching out.
Currently, there is only one board position, which I hold. I also have triple vote as insurance if we decide to expand the board. We don’t plan to give up board control.
Thanks - we plan to visit the Bay soon with the team, we’ll send you a message!
We aren’t committed to any specific product or direction just yet (we think there are many low hanging fruit that we could decide to pursue). Luckily we have the independence to be able to initially spend a significant amount of time focusing on foundational infrastructure and research. Our product(s) could end up as some kind of API with useful models, interpretability tools or services, some kind of end-to-end SaaS product or something else entirely. We don’t intend to push the capabilities frontier, and don’t think this would be necessary to be profitable.
TL;DR: For the record, EleutherAI never actually had a policy of always releasing everything to begin with and has always tried to consider each publication’s pros vs cons. But this is still a bit of change from EleutherAI, mostly because we think it’s good to be more intentional about what should or should not be published, even if one does end up publishing many things. EleutherAI is unaffected and will continue working open source. Conjecture will not be publishing ML models by default, but may do so on a case by case basis.
Longer version:
First of all, Conjecture and EleutherAI are separate entities. The policies of one do not affect the other. EleutherAI will continue as it has.
To explain a bit of what motivated this policy: We ran into some difficulties when handling infohazards at EleutherAI. By the very nature of a public open source community, infohazard handling is tricky to say the least. I’d like to say on the record that I think EAI actually did an astoundingly good job not pushing every cool research or project discovery we encountered, for what it is. However, there are still obvious limitations to how well you can contain information spread in an environment that open.
I think the goal of a good infohazard policy should not be to make it as hard as possible to publish information or talk to people about your ideas to limit the possibility of secrets leaking, but rather to make any spreading of information more intentional. You can’t undo the spreading of information, it’s a one-way street. As such, the “by-default” component is what I think is important to allow actual control over what gets out and what not. By having good norms around not immediately sharing everything you’re working on or thinking about widely, you have more time to deliberate and consider if keeping it private is the best course of action. And if not, then you can still publish.
That’s the direction we’re taking things with Conjecture. Concretely, we are working on writing a well thought out infohazard policy internally, and plan to get the feedback of alignment researchers outside of Conjecture on whether each piece of work should or should not be published.
We have the same plan with respect to our models, which we by default will not release. However, we may choose to do so on a case by case basis and with feedback from external alignment researchers. While this is different from EleutherAI, I’d note that EAI does not, and has never, advocated for literally publishing anything and everything all the time as fast as possible. EAI is a very decentralized organization, and many people associated with the name work on pretty different projects, but in general the projects EAI chooses to do are informed by what we considered net good to be working on publicly (e.g. EAI would not release a SOTA-surpassing, or unprecedentedly large model). This is a nuanced point about EAI policy that tends to get lost in outside communication.
We recognize that Conjecture’s line of work is infohazardous. We think it’s almost guaranteed that when working on serious prosaic alignment you will stumble across capabilities increasing ideas (one could argue one of the main constraints on many current models' usefulness/power is precisely their lack of alignment, so incremental progress could easily remove bottlenecks), and we want to have the capacity to handle these kinds of situations as gracefully as possible.
Thanks for your question and giving us the chance to explain!
I really liked this post, though I somewhat disagree with some of the conclusions. I think that in fact aligning an artificial digital intelligence will be much, much easier than working on aligning humans. To point towards why I believe this, think about how many "tech" companies (Uber, crypto, etc) derive their value, primarily, from circumventing regulation (read: unfriendly egregore rent seeking). By "wiping the slate clean" you can suddenly accomplish much more than working in a field where the enemy already controls the terrain.
If you try to tackle "human alignment", you will be faced with the coordinated resistance of all the unfriendly demons that human memetic evolution has to offer. If you start from scratch with a new kind of intelligence, a system that doesn't have to adhere to the existing hostile terrain (doesn't have to have the same memetic weaknesses as humans that are so optimized against, doesn't have to go to school, grow up in a toxic media environment etc etc), you can, maybe, just maybe, build something that circumvents this problem entirely.
That's my biggest hope with alignment (which I am, unfortunately, not very optimistic about, but I am even more pessimistic about anything involving humans coordinating at scale), that instead of trying to pull really hard on the rope against the pantheon of unfriendly demons that run our society, we can pull the rope sideways, hard.
Of course, that "sideways" might land us in a pile of paperclips, if we don't solve some very hard technical problems....
This was an excellent post, thanks for writing it!
But, I think you unfairly dismiss the obvious solution to this madness, and I completely understand why, because it's not at all intuitive where the problem in the setup of infinite ethics is. It's in your choice of proof system and interpretation of mathematics! (Don't use non-constructive proof systems!)
This is a bit of an esoteric point and I've been planning to write a post or even sequence about this for a while, so I won't be able to lay out the full arguments in one comment, but let me try to convey the gist (apologies to any mathematicians reading this and spotting stupid mistakes I made):
Joe, I don’t like funky science or funky decision theory. And fair enough. But like a good Bayesian, you’ve got non-zero credence on them both (otherwise, you rule out ever getting evidence for them), and especially on the funky science one. And as I’ll discuss below, non-zero credence is enough.
This is where things go wrong. The actual credence of seeing a hypercomputer is zero, because a computationally bounded observer can never observe such an object in such a way that differentiates it from a finite approximation. As such, you should indeed have a zero percent probability of ever moving into a state in which you have performed such a verification, it is a logical impossibility. Think about what it would mean for you, a computationally bounded approximate bayesian, to come into a state of belief that you are in possession of a hypercomputer (and not a finite approximation of a hypercomputer, which is just a normal computer. Remember arbitrarily large numbers are still infinitely far away from infinity!). What evidence would you have to observe for this belief? You would need to observe literally infinite bits, and your credence to observing infinite bits should be zero, because you are computationally bounded! If you yourself are not a hypercomputer, you can never move into the state of believing a hypercomputer exists.
This is somewhat analogous to how Solomonoff inductors cannot model a universe containing themselves. Solomonoff inductors are "one step up in the halting hierarchy" from us and cannot model universes that have "super-infinite objects" like themselves in it. Similarly, we cannot model universes that contain "merely infinite" objects (and by transitivity, any super-infinite objects either) in it, either, our bayesian reasoning does not allow it!
I think the core of the problem is that, unfortunately, modern mathematics implicitly accepts classical logic as its basis of formalization, which is a problem because the Law of Excluded Middle is an implicit halting oracle. The LEM says that every logical statement is either true or false. This makes intuitive sense, but is wrong. If you think of logical statements as programs whose truth value we want to evaluate by executing a proof search, there are, in fact three "truth values": True, false and uncomputable! This is a necessity because any axiom system worth its salt is Turing complete (this is basically what Gödel showed in his incompleteness theorems, he used Gödel numbers because Turing machines didn't exist yet to formalize the same idea) and therefor has programs that don't halt. Intuitionistic Logic (the logic we tend to formalize type theory and computer science with) doesn't have this problem of an implicit halting oracle, and in my humble opinion should be used for the formalization of mathematics, on peril of trading infinite universes for an avocado sandwich and a big lizard if we use classical logic.
My own take, though, is that resting the viability of your ethics on something like “infinities aren’t a thing” is a dicey game indeed, especially given that modern cosmology says that our actual concrete universe is very plausibly infinite
Note that us using constructivist/intuitionistic logic does not mean that "infinities aren't a thing", it's a bit more subtle than that (and something I have admittedly not fully deconfused for myself yet). But basically, the kind of "infinities" that cosmologists talk about are (in my ontology) very different from the "super-infinities" that you get in the limit of hypercomputation. Intuitively, it's important to differentiate "inductive infinities" ("you need arbitrarily many steps to complete this computation") and "real infinities" ("the solution only pops out after infinity steps have been complete" i.e. a halting oracle).
The difference makes the most sense from the perspective of computational complexity theory. The universe is a "program" of complexity class PTIME/BQP (BQP is basically just the quantum version of PTIME), which means that you can evaluate the "next state" of the universe with at most PTIME/BQP computation. Importantly, this means that even if the universe is inflationary and "infinite", you could evaluate the state of any part of it in (arbitrarily large) finite time. There are no "effects that emerge only at infinity". The (evaluation of a given arbitrary state of the) universe halts. This is very different to the kinds of computations a hypercomputer is capable of (and less paradoxical). Which is why I found the following very amusing:
baby-universes/wormholes/hyper-computers etc appear much more credible, at least, than “consciousness = cheesy-bread.”
Quite the opposite! Or rather, one of those three things is not like the other. baby-universes are in P/BQP, wormholes are in PSPACE (assuming by wormholes you mean closed timelike curves, which is afaik the default interpretation), and hyper-computers are halting-complete which is ludicrously insanely not even remotely like the other two things. So in that regard, yes, I think consciousness being equal to cheesy-bread is more likely than finding a hypercomputer!
To be clear when I talk about "non-constructive logic is Bad™" I don't mean that the actual literal symbolic mathematics is somehow malign (of course), it's the interpretation we assign to it. We think we're reasoning about infinite objects, but we're really reasoning about computable weaker versions of the objects, and these are not the same thing. If one is maximally careful with ones interpretations, this is (theoretically) not a problem, but this is such a subtle difference of interpretation that this is very difficult to disentangle in our mere human minds. I think this is at the heart of the problems with infinite ethics, because understanding what the correct mathematical interpretations are is so damn subtle and confusing, we find ourselves in bizarre scenarios that seem contradictory and insane because we accidentally naively extrapolate interpretations to objects they don't belong to.
I didn't do the best of jobs formally arguing for my point, and I'm honestly still 20% confused about this all (at least), but I hope I at least gave some interesting intuitions about why the problem might be in our philosophy of mathematics, not our philosophy of ethics.
P.S. I'm sure you've heard of it before, but on the off chance you haven't, I can not recommend this wonderful paper by Scott Aaronson highly enough for a crash course in many of these kinds of topics relevant to philosophers.
I haven't read Critch in-depth, so I can't guarantee I'm pointing towards the same concept he is. Consider this a bit of an impromptu intuition dump, this might be trivial. No claims on originality of any of these thoughts and epistemic status "¯\_(ツ)_/¯"
The way I currently think about it is that multi-multi is the "full hard problem", and single-single is a particularly "easy" (still not easy) special case.
In a way we're making some simplifying assumptions in the single-single case. That we have one (pseudo-cartesian) "agent" that has some kind of definite (or at least bounded-ly complicated) values that can be expressed. This means we kind of have "just" the usual problems of a) expressing/extracting/understanding the values, in so far as that is possible (outer alignment) and b) making sure the agent actually fulfills those values (inner alignment).
Multi principals then relaxes this assumption into saying we don't have a "single" function, but multiple, which introduces another "necessary ingredient": Some kind of social choice theory "synthesis function", that can take in all the individual functions and spit out a "super utility function" that represents some morally acceptable amalgamation of the other functions (whatever that means). The single case is a simpler special case in that the synthesis function is the equivalent of the identity function, but that no longer works if you have multiple inputs.
In a very simplistic sense, multi is "harder" because we are introducing an additional "degree of freedom". So you might argue we have outer alignment, inner alignment and "even-more-outerer alignment" or "multi-outer alignment" (which would be the synthesis problem), and you probably have to make hard (potentially irreconcilable) moral choices for at least the latter (probably for all).
In multi-multi, if the agents serve (or have different levels of alignment towards) different subsets of principals, this would then add the additional difficulty of game theory between the different agents and how they should coordinate. We can call that the "multi-inner alignment problem" or something, the question of how to get the amalgamation of competing agents to be "inner aligned" and not blow everything up and getting stuck in defect-defect spirals or whatever. (This reminds me a lot of what CLR works on)
I tbh am not sure if single-multi would be harder/different from single-single just "applied multiple times". Maybe if the agents have different ideas of what the principal wants they could compete, but that seems like a failure of outer alignment, but maybe it would be better cast as a kind of failure of "multi-inner alignment".
So in summary I think solutions (in so far as such a thing even exists in an objective fashion, which it may or may not) to the multi-multi problem are a superset of solutions to multi-single, single-multi and single-single. Vaguely, outer alignment = normativity/value learning, inner alignment = principal agent problem, multi-outer alignment = social choice, multi-inner alignment = game theory, and you need to solve all four to solve multi-multi. If you make certain simplifying assumptions which correspond to introducing "singles", you can ignore one or more of these (i.e. a single agent doesn't need game theory, a single principal doesn't need social choice).
Or something. Maybe the metaphor is too much of a stretch and I'm seeing spurious patterns.
I am so excited about this research, good luck! I think it's almost impossible this won't turn up at least some interesting partial results, even if the strong versions of the hypothesis don't work out (my guess would be you run into some kind of incomputability or incoherence results in finding an algorithm that works for every environment).
This is one of the research directions that make me the most optimistic that alignment might really be tractable!
This is a great intuition pump, thanks! It makes me appreciate just how, in a sense, weird it is that abstractions work at all. It seems like the universe could just not be constructed this way (though one could then argue that probably intelligence couldn't exist in such chaotic universes, which is in itself interesting). This makes me wonder if there is a set of "natural abstractions" that are a property of the universe itself, not of whatever learning algorithm is used to pick up on them. Seems highly relevant to value learning and the like.
Great writeup, thanks!
To add to whether or not kludge and heuristics are part of the theory, I've asked the Numenta people in a few AMAs about their work, and they've made clear they are working solely on the neocortex (and the thalamus), but the neocortex isn't the only thing in the brain. It seems clear that the kludge we know from the brain is still present, just maybe not in the neocortex. Limbic or other areas could implement kludge style shortcuts which could bias what the more uniform neocortex learns or outputs. Given my current state of knowledge of neuroscience, the most likely interpretation of this kind of research is that the neocortex is a kind of large unsupervised world model that is connected to all kinds of other hardcoded, RL or other systems, which all in concert produce human behavior. It might be similar to Schmidhuber's RNNAI idea, where a RL agent learns to use an unsupervised "blob" of compute to achieve its goals. Something like this is probably happening in the brain since, at least as far as Numenta's theories go, there is no reinforcement learning going on in the neocortex, which seems to contradict how humans work overall.