Posts
Comments
Thanks! I think your discussion of the new Meaning Alignment Institute publication (the substack post and the paper) in the Aligning a Smarter Than Human Intelligence is Difficult section is very useful.
I wonder if it makes sense to republish it as a separate post, so that more people see it...
People are certainly looking at the alternatives.
For example, DSPy ("demonstrate-search-predict") looks very promising: https://github.com/stanfordnlp/dspy
(Their README contains Section 6 "FAQ", which compares a variety of existing alternatives, including LangChain.)
Emmett Shear continues his argument that trying to control AI is doomed
I think that a recent tweet thread by Michael Nielsen and the quoted one by Emmett Shear represent genuine progress towards making AI existential safety more tractable.
Michael Nielsen observes, in particular:
As far as I can see, alignment isn't a property of an AI system. It's a property of the entire world, and if you are trying to discuss it as a system property you will inevitably end up making bad mistakes
Since AI existential safety is a property of the whole ecosystem (and is, really, not too drastically different from World existential safety), this should be the starting point, rather than stand-alone properties of any particular AI system.
Emmett Shear writes:
Hopefully you’ve validated whatever your approach is, but only one of these is stable long term: care. Because care can be made stable under reflection, people are careful (not a coincidence, haha) when it comes to decisions that might impact those they care about.
And Zvi responds
Technically I would say: Powerful entities generally caring about X tends not to be a stable equilibrium, even if it is stable ‘on reflection’ within a given entity. It will only hold if caring more about X provides a competitive advantage against other similarly powerful entities, or if there can never be a variation in X-caring levels between such entities that arises other than through reflection, and also reflection never causes reductions in X-caring despite this being competitively advantageous. Also note that variation in what else you care about to what extent is effectively variation in X-caring.
Or more bluntly: The ones that don’t care, or care less, outcompete the ones that care.
Even the best case scenarios here, when they play out the ways we would hope, do not seem all that hopeful.
That all, of course, sets aside the question of whether we could get this ‘caring’ thing to operationally work in the first place. That seems very hard.
Let's now consider this in light of what Michael Nielsen is saying.
I am going to only consider the case where we have plenty of powerful entities with long-term goals and long-term existence which care about their long-term goals and long-term existence. This seems to be the case which Zvi is considering here, and it is the case we understand the best, because we also live in the reality with plenty of powerful entities (ourselves, some organizations, etc) with long-term goals and long-term existence. So this is an incomplete consideration: it only includes the scenarios where powerful entities with long-term goals and long-terms existence retain a good fraction of overall available power.
So what do we really need? What are the properties we want the World to have? We need a good deal of conservation and non-destruction, and we need the interests of weaker, not the currently most smart or most powerful members of the overall ecosystem to be adequately taken into account.
Here is how we might be able to have a trajectory where these properties are stable, despite all drastic changes of the self-modifying and self-improving ecosystem.
An arbitrary ASI entity (just like an unaugmented human) cannot fully predict the future. In particular, it does not know where it might eventually end up in terms of relative smartness or relative power (relative to the most powerful ASI entities or to the ASI ecosystem as a whole). So if any given entity wants to be long-term safe, it is strongly interested in the ASI society having general principles and practices of protecting its members on various levels of smartness and power. If only the smartest and most powerful are protected, then no entity is long-term safe on the individual level.
This might be enough to produce effective counter-weight to unrestricted competition (just like human societies have mechanisms against unrestricted competition). Basically, smarter-than-human entities on all levels of power are likely to be interested in the overall society having general principles and practices of protecting its members on various levels of smartness and power, and that's why they'll care enough for the overall society to continue to self-regulate and to enforce these principles.
This is not yet the solution, but I think this is pointing in the right direction...
Thanks, this is very interesting.
I wonder if this approach is extendable to learning to predict the next word from a corpus of texts...
The first layer might perhaps still be embedding from words to vectors, but what should one do then? What would be a possible minimum viable dataset?
Perhaps, in the spirit of PoC of the paper, one might consider binary sequences of 0s and 1s, and have only two words, 0 and 1, and ask what would it take to have a good predictor of the next 0 or 1 given a long sequence of those as a context. This might be a good starting point, and then one might consider different examples of that problem (different examples of (sets of) sequences of 0 and 1 to learn from).
This looks interesting, thanks!
This post could benefit from an extended summary.
In lieu of such a summary, in addition to the abstract
This paper introduces semantic features as a candidate conceptual framework for building inherently interpretable neural networks. A proof of concept model for informative subproblem of MNIST consists of 4 such layers with the total of 5K learnable parameters. The model is well-motivated, inherently interpretable, requires little hyperparameter tuning and achieves human-level adversarial test accuracy - with no form of adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at this https URL
I'll quote a paragraph from Section 1.2, "The core idea"
This paper introduces semantic features as a general idea for sharing weights inside a neural network layer. [...] The concept is similar to that of "inverse rendering" in Capsule Networks where features have many possible states and the best-matching state has to be found. Identifying different states by the subsequent layers gives rise to controlled dimensionality reduction. Thus semantic features aim to capture the core characteristic of any semantic entity - having many possible states but being at exactly one state at a time. This is in fact a pretty strong regularization. As shown in this paper, choosing appropriate layers of semantic features for the [Minimum Viable Dataset] results in what can be considered as a white box neural network.
Is there a simple way to run against a given Kaggle competition after that particular competition is over?
These are reasonable benchmarks, but is there a way to make them less of a moving target, so that the ability to run against a given competition extends into the future?
When I ponder all this I usually try to focus on the key difficulty of AI existential safety. ASIs and ecosystems of ASIs normally tend to self-modify and self-improve rapidly. Arbitrary values and goals are unlikely to be preserved through radical self-modifications.
So one question is: what are values and goals which might be natural, what are values and goals many members of the ASI ecosystem including entities which are much smarter and much more powerful than unaugmented humans might be naturally inclined to preserve through drastic self-modifications, and which might also have corollaries which are good from viewpoints of various human values and interests?
Basically, values and goals which are likely to be preserved through drastic self-modifications are probably non-anthropocentric, but they might have sufficiently anthropocentric corollaries, so that even the interests of unaugmented humans are protected (and also so that humans who would like to self-augment or to merge with AI systems can safely do so).
So what do we really need? We need a good deal of conservation and non-destruction, and we need the interests of weaker, not the currently most smart or most powerful members of the overall ecosystem to be adequately taken into account.
What might the road towards adoption of those values (of preservation and protection of weaker entities and their interests) and incorporation of those kinds of goals be, so that these values and goals are preserved as the ASI ecosystem and many of its members self-modify drastically? One really needs the situation where many entities on varying levels of smartness and power care a lot about those values and goals.
An arbitrary ASI entity (just like an unaugmented human) cannot fully predict the future. In particular, it does not know where it might eventually end up in terms of relative smartness or relative power (relative to the most powerful ASI entities or to the ASI ecosystem as a whole). So if any given entity wants to be long-term safe, it is strongly interested in the ASI society having general principles and practices of protecting its members on various levels of smartness and power. If only the smartest and most powerful are protected, then no entity is long-term safe on the individual level.
This might be a reasonable starting point towards having values and goals which are likely to be preserved through drastic self-modifications and self-improvements and which are likely to imply good future for unaugmented humans and for the augmented humans as well.
Perhaps, we can nudge things towards making this kind of future more likely. (The alternatives are bleak, unrestricted drastic competition and, perhaps, even warfare between supersmart entities which would probably end badly not just for us, but for the ASI ecosystem as well)...
The worlds were artificial superintelligence (ASI) is coming very soon with only roughly current levels of compute, and where ASI by default goes catastrophically badly, are not worlds I believe we can afford to save.
For those kinds of worlds, we probably can't afford to save them by heavy-handed measures (meaning that not only the price is too much to pay, given that it would have to be paid in all other worlds too, but also that it would not be possible to avoid evasion of those measures, if it is already relatively easy to create ASI).
But this does not mean that more light-weight measures are hopeless. For example, when we look at Ilya's thought process he has shared with us last year, what he has been trying to ponder have been various potentially feasible relatively light-weight measures to change this default of ASI going catastrophically badly to more acceptable outcomes: Ilya Sutskever's thoughts on AI safety (July 2023): a transcript with my comments.
More people should do more thinking of this kind. AI existential safety is a pre-paradigmatic field of study. A lot of potentially good approaches and possible non-standard angles have not been thought of at all. A lot of potentially good approaches and possible non-standard angles have been touched upon very lightly and need to be explored further. The broad consensus in the "AI existential safety" community seems to be that we currently don't have good approaches we can comfortably rely upon for future ASI systems. This should encourage more people to look for completely novel approaches (and in order to cover the worlds where timelines are short, some of those approaches have to be simple rather than ultra-complicated, otherwise they would not be ready by the time we need them).
In particular, it is certainly true that
A smarter thing that is more capable and competitive and efficient and persuasive being kept under indefinite control by a dumber thing that is less capable and competitive and efficient and persuasive is a rather unnatural and unlikely outcome. It should be treated as such until proven otherwise.
So it makes sense spending more time brainstorming the approaches which would not require "a smarter thing being kept under indefinite control by a dumber thing". What might be a relatively light-weight intervention which would avoid ASI going catastrophically badly without trying to rely on this unrealistic "control route"? I think this is underexplored. We tend to impose too many assumptions, e.g. the assumption that ASI has to figure out a good approximation to our Coherent Extrapolated Volition and to care about that in order for things to go well. People might want to start from scratch instead. What if we just have a goal of having good chances for a good trajectory conditional on ASI, and we don't assume that any particular feature of earlier approaches to AI existential safety is a must-have, but only this goal is a must-have. Then this gives one a wider space of possibilities to brainstorm: what other properties besides caring about a good approximation to our Coherent Extrapolated Volition might be sufficient for the trajectory of the world with ASI to likely be good from our point of view?
I don't see a non-paywalled version of this paper, but here are two closely related preprints by the same authors:
"Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features", https://arxiv.org/abs/2212.13881
"Mechanism of feature learning in convolutional neural networks", https://arxiv.org/abs/2309.00570
Paul Christiano: "Catastrophic Misalignment of Large Language Models"
Talk recording: https://www.youtube.com/watch?v=FyTb81SS_cs (no transcript or chat replay yet)
Talk page: https://pli.princeton.edu/events/2024/catastrophic-misalignment-large-language-models
Sam Altman watch: He invests $180 million into Retro Bio to fight aging. I have no idea if they can execute, but this kind of investment is great and bodes many good things. Kudos.
Note that this looks like a relatively old investment, it's not another $180 million, it looks like the $180 million you might have heard about a while ago... (Gwern also thinks so.)
Jack Clark: Salary for tech staff of EU AI Office (develop and assess evaluations for gen models like big LLMs) is… 3,877 – 4,387 Euros a month, aka $4242 – $4800 USD.
Most tech co internships $8k+ a month.
He seems to be comparing an EU AI Office job with American tech salaries.
I think people tend to underestimate how much lower tech salaries tend to be in EU compared to American standards.
(Of course, things do strongly depend on the location as well, both within the EU and with the US).
Like I said, if we try to apply forceful measures we might delay it for some time (at the price of people aging and dying from old age and illnesses to the tune of dozens of millions per year due to the continuing delay; but we might think that the risks are so high that the price is worth it, and we might think that the price of everyone who is alive today eventually dying of old age is worth it, although some of us might disagree with that and might say that taking the risk of foom is better than the guaranteed eventual death of old age or other illness; there is a room for disagreement on which of these risks it is preferable to choose).
But if we are talking about avoiding foom indefinitely, we should start with asking ourselves, how easy or how difficult is it to achieve. How long before a small group of people equipped with home computers can create foom?
And the results of this analysis are not pretty. Computers are ultimate self-modifying devices, they can produce code which programs them. Methods to produce computer code much better than we do it today are not all that difficult, they are just in the collective cognitive blindspot, like backpropagation was for a long time, like ReLU activations were for decades, like residual connectivity in neural machines was in the collective cognitive blindspot for unreasonably long time. But this state of those enhanced methods of producing new computer code being relatively unknown would not last forever.
And just like backpropagation, ReLU, or residual connections, these methods are not all that difficult, it's not like if a single "genius" who might discover them would refrain from sharing them, they would remain unknown. People keep rediscovering and rediscovering those methods, they are not that tricky (backpropagation was independently discovered more than 10 times between 1970 and 1986, before people stopped ignoring it and started to use it).
It's just the case that the memetic fitness of those methods is currently low, just like memetic fitness of backpropagation, ReLU, and residual connections used to be low in the strangest possible ways. But this would not last, the understanding of how to have competent self-improvement in small-scale software on ordinary laptops will gradually form and proliferate. At the end of the day, we'll end up having to either phase out universal computers (at least those which are as powerful as our home computers today) or to find ways to control them very tightly, so that people are no longer free to program their own computers as they see fit.
Perhaps, humans will chose to do that, I don't know. Nor do I know whether they would succeed in a Butlerian jihad of this kind, or whether some of the side effects of trying to get rid of computers would become X-risks themselves. In any case, in order to avoid foom indefinitely, people will have to take drastic, radical measures which would make everyone miserable, would kill a lot of people, and might create other existential risks.
I think it's safer if the most competent leading group tries to go ahead, that our eventual chances of survival are higher along this path, compared to the likely alternatives.
I do think that the risks on the alternative paths are very high too; the great powers are continuing to inch towards major nuclear confrontation; we are enabling more and more people to create diverse super-covid-like artificial pandemics with 30% mortality or more; things are pretty bad in terms of major risks this civilization is facing; instead of asking "what's your P(doom)" we should be asking people, "what's your P(doom) conditional on foom and what's your P(doom) conditional on no foom happening". My P(doom) is not small, but is it higher conditional on foom, than conditional on no foom? I don't know...
I do expect a foom (via AGI or via other route), and my timelines are much shorter than 5 years.
But algorithms for AI are improving faster than hardware (people seem to quote doubling in compute efficiency approximately each 8 months), so if one simply bans training runs above fixed compute thresholds, one trades off a bit of extra time before a particular achievement vs increase of number of active players achieving it a bit later (basically, this delays the most well-equipped companies a bit and democratizes the race, which is not necessarily better).
We can make bans progressively tighter, so we can buy some time, but as the algorithms progress further, it is not unlikely that we might at some point face the choice of banning computers altogether or facing a foom. So eventually we are likely going to face huge risks anyway.
I do think it's time to focus not on "aligning" or "controlling" self-modifying ecosystems of self-modifying super-intelligences, but on figuring out how to increase the chances that a possible foom goes well for us instead of killing us. I believe that thinking only in terms of "aligning" or "controlling" limits the space of possible approaches to AI existential safety, and that approaches not based on notions of "alignment" and "control" might be more fruitful.
And, yes, I think our chances are better if the most thoughtful of practitioners achieve that first. For example, Ilya Sutskever's thinking on the subject has been very promising (which is why I tend to favor OpenAI if he continues to lead the AI safety effort there, but I would be much more skeptical of them otherwise).
Militaries are certainly doing that, I agree.
However, I am not sure they are creative enough and not-control-freaks enough to try to build seriously self-modifying systems. They also don't mind spending tons of money and allocating large teams, so they might not be aiming for artificial AI researchers all that much. And they are afraid to lose control (they know how to control people, but artificial self-modifying systems are something else).
Whereas a team in a garage is creative, is short on resources and quite interested in creating a team of artificial co-workers to help them (a success in that leads to a serious recursive self-improvement situation automatically), and might not hesitate to try other recursive self-improvements schemas (we are seeing more and more descriptions of novel recursive self-improvements schemas in recent publications), so they might end up with a foom even before they build more conventional artificial AI researchers (a sufficiently powerful self-referential metalearning schema might result in that; a typical experience is that all those recursive self-improvement schemas saturate disappointingly early, so the teams will be pushing harder at them trying to prevent premature saturation, and someone might succeed too well).
Basically, having "true AGI" means being able to create competent artificial AI researchers, which are sufficient for very serious recursive self-improvement capabilities, but one might also obtain drastic recursive self-improvement capabilities way before achieving anything like "true AGI". "True AGI" is sufficient to start a far reaching recursive self-improvement, but there is no reason to think that "true AGI" is necessary for that (being more persistent at hacking the currently crippled self-improvement schemas and at studying ways to improve them might be enough).
Both of these points precisely reflect our current circumstances.
No, there is plenty of room between the current circumstances and the bottom. We might be back to Eliezer's "an unknown random team creates a fooming AI in a garage" old threat model, if we curtail the current high-end too much.
Just like there is plenty of room between legal pharmacy and black market for pain relievers (even when the name of the drug is the same).
It's very easy to make things worse.
It may not even be possible to accidentally make these two things worse with regulation.
It's probably possible. But regulation is often good, and we do need more regulation for AI.
In this post we are not talking about regulation, we are talking about prohibition, which is a different story.
a technical remark:
(2) Covered Company.
(A) In General. The term ‘”covered company'” means an entity that operates, directly or indirectly (including through a parent company, subsidiary, or affiliate), a website, desktop application, mobile application, or augmented or immersive technology application that meets all of the following criteria:
Permits user account creation:
Allows users to create an account or profile to generate, share, and view text, images, videos, real-time communications, or similar content.
Has a large user base: Has more than 1,000,000 monthly active users for at least two out of the three months preceding a relevant determination by the President under subsection (3)(B).
Enables user-generated content: Enables one or more users to generate or distribute content that can be viewed by other users of the platform.
Allows viewing of user-generated content: Enables one or more users to view content generated by other users of the platform.
(B) EXCLUSION.—The term ‘‘covered company’’ does not include an entity that operates a website, desktop application, mobile application, or augmented or immersive technology application whose primary purpose is to allow users to post product reviews, business reviews, or travel information and reviews.
So it needs to be some form of social media, of some kind: You need to have a million users, account creation and viewing and creation of user-generated content.
Note that Russian or Chinese equivalents of Gmail also seem to match. A large user base (the law does not even state that this has to be US user base), and users certainly do generate and view content.
(And yes, a foreign adversary control over e-mail servers used by people living here might easily be a natsec problem, even if there are no viral influence aspects.)
So, deliberately or accidentally, the true technical reach of the bill seems to be somewhat larger than just social media...
There are some world-destroying things that we have to ban, for now; for everything else, there's Mastercard libertarian techno-optimism.
This seems to suggest that gradually banning and phasing out almost all computers is not on the table (which is probably a good thing, as I am "not sure" we would want to live in a computer-less society). I can imagine something as radical as gradually banning and phasing out almost all computers actually working (or not, depending on the structure of global society, and also noting that the sacrifice would be pretty bad).
But if this kind of radical measures are not on the table, I think, the main effects of a Prohibition will be similar to the effects of our drug war (the availability is lower, the quality is often pretty bad, and the potency is occasionally very high, there might indeed be an additional lag before this kind of extra high potency emerges).
So, translating to advanced AI: "the quality is often pretty bad" translates to all kinds of safety measures often being non-existent, "the potency is occasionally very high" translates to completely unregulated and uncontrolled spikes of capability (possibly including "true foom"), "there might indeed be an additional lag before this kind of extra high potency emerges" translates to indeed buying some time in exchange for the consequences down the road.
I am not sure that the eventual P(doom) is lower, if we do try to go down this road, and I would not be surprised if this would make the eventual P(doom) higher.
Discussion: https://news.ycombinator.com/item?id=39673265
WaybackMachine link: https://web.archive.org/web/20240312002413/https://www.newyorker.com/magazine/2024/03/18/among-the-ai-doomsayers
Time to AGI, in weeks
I like it better than the usual habit of giving this estimate in years.
Especially, with the mode of https://www.metaculus.com/questions/5121/date-of-artificial-general-intelligence/ being near the border between 2026 and 2027.
The link no longer works, but here are currently working links for this paper:
https://arxiv.org/abs/1711.09883
https://github.com/google-deepmind/ai-safety-gridworlds
(And the original link is presumably replaced by this one: https://deepmind.google/discover/blog/specifying-ai-safety-problems-in-simple-environments/)
Yes, it's a great topic. The aspect which seems to be missing from "AI capabilities can be significantly improved without expensive retraining", https://arxiv.org/abs/2312.07413 is that post-training is a particularly fertile ground for rapid turnaround self-modification and recursive self-improvement, as post-training tends to be rather lightweight and usually does not include a delay of training a novel large model.
Some recent capability works in that direction include, for example
-
"Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation", https://arxiv.org/abs/2310.02304
-
"Language Agents as Optimizable Graphs", https://arxiv.org/abs/2402.16823
People who are specifically concerned with rapid foom risks might want to focus on this aspect of the situation. These self-improvement methods currently saturate in a reasonably safe zone, but they are getting stronger both due to novel research, and due to improvements of the underlying LLMs they tend to rely upon.
Yes, the ULMFiT paper is one of the first papers using the notion of "pretraining" (it might be the one which actually introduces this terminology).
Then it appears in other famous 2018 papers:
Improving Language Understanding by Generative Pre-Training (Radford et al., June 2018)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
There are plenty of systems like this, and people will build more. But they don't do enough. So this will not preclude development of other kinds of systems...
Moreover, as soon as one has a system like this, it is trivial to write a wrapper making multiple calls to the system and having a memory.
And when one has many specialized systems like this, it is not difficult to write a custom system which takes turns calling many of them, having a memory, and having various additional properties.
As soon as one has a capability, it is usually not too difficult to build on top of that capability...
I would feel better if I knew Ilya was back working at Superalignment.
Same here...
I wonder what is known about that.
It seems that in mid-January OpenAI and Ilya have still been discussing what is going to happen, and I have not seen any further information since then. There are tons of Manifold Ilya-related markets, and they also reflect high uncertainty and no additional information.
Actually, we do have a bit of new information today: Ilya is listed as one of the authors of the OpenAI new blog post commenting on Elon's lawsuit
But yes, it would be really nice to find out more...
March 8 update, from the call with reporters: When Altman was asked about Sutskever’s status on the Zoom call with reporters, he said there were no updates to share. “I love Ilya... I hope we work together for the rest of our careers, my career, whatever,” Altman said. “Nothing to announce today.”
So, it seems this is still where it has been in mid-January (up in the air, with eventual resolution being uncertain).
On one hand, I never believed the standard "we are almost certainly in a simulation" argument. But since I first learned about Simulation Hypothesis, I thought it made sense to keep an open mind about it, and to keep my priors on this flexible and at a healthy distance from 0 and 1.
Janus' Simulator Theory did make me update my priors towards Simulation Hypothesis being more plausible, while still keeping my priors on this flexible and at a healthy distance from 0 and 1.
And the reason was that if any inference performed by a sufficiently advanced generative model was somewhat similar to a simulation, this eliminated the need for a specific motivation like a desire to create an ancestral simulation.
Instead, advanced entities would run all kinds of generative models for all kinds of reasons, and creating simulations of various flavors would be a side effect of many of those runs, and suddenly there are way more simulations out there than one would have from a desire to specifically create this or that specific simulation with particular properties...
Eliezer Yudkowsky: I’ve given up (actually never endorsed in the first place) the term “AI safety”; “AI alignment” is the name of the field worth saving. (Though if I can, I’ll refer to it as “AI notkilleveryoneism” instead, since “alignment” is also coopted to mean systems that scold users.)
I just use "AI existential safety".
It has exactly the same number of letters as "AI notkilleveryoneism" (counting the single space in "AI notkilleveryoneism" and 2 spaces in "AI existential safety" as letters).
Although I see that my resume still lists the following among my main objectives: "To contribute to AI safety research". I think I'll go ahead and insert "existential" there in order to disambiguate...
It's great, thanks!
Yes, there is also an earlier Microsoft Research paper which is, in some sense, more modest, but is more directly pointing towards recursive self-improvement via scaffolding+LLM generating a better scaffolding for the same LLM, and then repeating this operation several times with better and better scaffolding.
One particularly interesting piece of data there is Figure 4 on page 6 which shows the dependency on the quality of the underlying LLM (the process actually does not work and leads to degradation with GPT-3.5, and the same process successfully self-improves for a few iterations (but then saturates) with GPT-4).
So one might ask how big this self-improvement might be with a better underlying LLM.
Orthogonally to the quality of the underlying LLM, I think it is not too difficult to improve methods for scaffolding generation quite a bit (there are various ways to make it much better than in these papers, even with the current LLMs). So one indeed wonders how soon this becomes a major contributor to the take-off speed...
If, to give it its own motivation, an ASI is built from the start as a human hybrid, we better all hope they pick the right human for the job!
Right.
Basically, however one slices it, I think that the idea that superintelligent entities will subordinate their interests, values, and goals to those of unmodified humans is completely unrealistic (and trying to force it is probably quite unethical, in addition to being unrealistic).
So what we need is for superintelligent entities to adequately take interests of "lesser beings" into account.
So we actually need them to have a much stronger ethics compared to typical human ethics (our track record of taking interests of "lesser beings" into account is really bad; if superintelligence entities end up having ethics as defective as typical human ethics, things will not go well for us).
How does it work? There is a technical report. Mostly it seems like OpenAI did standard OpenAI things, meaning they fed in tons of data, used lots of compute, and pressed the scaling button super hard. The innovations they are willing to talk about seem to be things like ‘do not crop the videos into a standard size.’
That does not mean there are not important other innovations. I presume that there are. They simply are not talking about the other improvements.
Actually, the key, most important thing they do disclose is that this is a Diffusion Transformer.
This architecture has been introduced by William Peebles, Saining Xie, Scalable Diffusion Models with Transformers, https://arxiv.org/abs/2212.09748
The first author is now a co-lead of Sora: https://www.wpeebles.com/ and https://twitter.com/billpeeb.
My take on Sora has been that its performance really seems to validate the central claim of their 2022 paper that Transformer-based diffusion models should work better than diffusion models based on older neural nets. This might have various implications beyond video generation. Intuitively, Transformers + Diffusion does feel like an attractive combination. The success of Sora might motivate people to try to use this combination more widely.
I'd like to mention three aspects. The first two point to a somewhat optimistic direction, while the third one is very much in the air.
ASI(s) would probably explore and adopt some kind of ethics
Assuming that it is not a singleton (and also taking into account that a singleton also has an internal "society of mind"), ASIs would need to deal with various potential conflicting interests and viewpoints, and would also face existential risks of their own (very powerful entities can easily destroy their reality, themselves, and everything in the vicinity, if they are not careful).
It seems that some kind of ethics is necessary to handle complicated situations like this, so it is likely that ASIs will explore ethical issues (or they would need to figure out a replacement for ethics).
The question is whether what we do before ASI arrival can make things better (or worse) in this sense (I wrote a somewhat longer exploration of that last year: Exploring non-anthropocentric aspects of AI existential safety)
ASI(s) would probably have access to direct human and animal experiences
Here I am disagreeing with
- An ASI likely won’t have a human body and direct experiences of pain and pleasure and emotions - it won’t be able to “try things on” to verify if its reasoning on ethics is “correct”
The reason is that some ASIs are likely to be curious enough to explore hybrid consciousness with biological entities via brain-computer interfaces and such, and, as a result, would have the ability to directly experience the inner world of biological entities.
The question here is whether we should try to accelerate this path from our side (I tend to think that this can be done relatively fast via high-end non-invasive BCI, but risks associated with this path are pretty high).
There is still a gap
The previous two aspects do point in a somewhat optimistic direction (ASIs are likely to develop ethics or some equivalent, and they are likely to know how we feel inside, and we might also be able to assist these developments and probably should).
But this is still not enough for us. What would it take for this ethics to adequately take interests of humans into account? That's a rather long and involved topic, and I've seen various proposals, but I don't think we know (it's not like our present society is sufficiently taking interests of humans into account; we would really like the future to do better than that).
The easiest, most rapid way is probably via non-invasive BCI, but the risk management is, of course, non-trivial...
My answer has two parts.
On one hand, I prefer an agnostic position (that is, there is a multiverse of possible realities, some might be deterministic, and some might be non-deterministic, and I don't really know which of the realities I am in (or even if one can sometimes "migrate" between the realities, so these "implicit hidden fundamental properties" of the reality being deterministic/non-deterministic underneath might change through subjective time)). So if there is some aspect of demotivation here I am really struggling about myself, that would be how difficult it is to make real tangible scientific and philosophical progress towards better understanding the world I am in.
On other hand, when I really imagine the world being deterministic, I remind myself that my current state of elation/desperation/boredom/excitement/demotivation/being highly driven, myself currently writing this comment, and so on, is predetermined (under this assumption). The resulting reflection has interesting effects and changes my inner state quite a bit... It feels almost like a meditation-induced insight sometimes...
In some sense, my insufficient ability to believe in determinism is what interferes with my ability to have those nice meditation-like insights which change my inner state so much...
The visual match can quite easily show you that one match is stronger than another but they don't tell you how good a match in question actually happens to be. There are ways to measure whether something is a good match with numbers.
Yes, an independent reproduction would also evaluate if their methodology is actually good in this sense (I can imagine all kinds of methodological "underwater stones").
I did not mean to give an impression that I had made up my mind about the outcome of this potential further exploration. I had made up my mind that it's worth further exploration, but I would not predict the results. Unfortunately, it is not all that easy to arrange (we do know that neutral prior here is important, rather than someone heavily leaning towards one side doing it, because there is always room for pushing results towards this or that direction; for example, I've spent too much "quality time" with this paper to be considered a fully neutral person, although I would certainly make an effort to avoid the bias if I were to do this work; then one might be unsure how safe it would be to publish on this, even today, and so on).
The idea that a country would just spend that much research capital and do that without it leaving any trace in their research publications and other public communication seems farfetched.
They would report to the government (if the order to reproduce things comes from the government). The government would decide what to make public and what to keep for more restricted use. It's very natural (especially if the subject is potentially "dual-use", or, at least, is considered relevant to national defense).
Why are you using vague terms like "very high" instead of being specific about the numbers you consider to be "very high"?
It's a visual illustration, no? Visually this looks rather strong (Figures 1 and 2 on pages 16-17).
I don't think they had the virus. They had what's was published, not the materials.
And I do think that China might have enough research manpower to just routinely reproduce all published findings of this kind of experiments if they want to (I don't know if they actually do that; I would not be too surprised if they do that as a routine though; this depends on what are their actual policies; I have no means to investigate that).
(So, I would presume they would have reproduced it earlier, and were using the results of that reproduction for various things as needed without thinking much, and one of those subsequent things leaked.)
My impression is different. If what they say does reproduce, then to me this would look likely that the pandemic has been the result of research which has been partially reproducing this particular published 2008 work.
In any case, this would need to be publicly discussed, not hidden under covers (I know some details of how decisions whether to accept this for publication have been made in some of the cases and it has never been about the quality of the paper, when it has been explicitly discussed it has always been about possible repercussions for a journal in question and for its editors (and when it has not been discussed, it has always been a technicality in the style of "in this particular case it looks like the venue profile is not a fit for this text", even if the venue has published on similar topics)).
To summarize the claim:
- this is the best match among all sequences ever submitted before the pandemic,
- and the critical region match is very high,
- and the critical region match is not high for any other sequences for which the overall match is approaching this one.
So, the claim is not just about this being the best match among all sequences ever submitted before the pandemic, it's quite a bit more than that.
It is rather obvious that Covid is not derived directly from this virus (the difference is too big), but it does look to me and to many other readers of this preprint, although not to everyone, that it's likely that people who have synthesized the original Covid have been looking at the published 2008 sequence while doing their work.
All that is conditional on the claim actually reproducing (which is why it is very annoying that the database admins has made it more difficult to try to reproduce it, either deliberately or accidentally).
There is enough here to discuss publicly, I think (and yes, enough room to disagree about the interpretation of the findings).
we need a breakthrough (or two)
The needed breakthroughs (on the scale of discovery of Transformers) are probably published already, but have been neglected and half-forgotten.
We have plenty of examples: backpropagation was discovered and rediscovered by many people and groups between 1970 and 1986, and was mostly ignored till late 1980s. ReLU was known for decades, and its good properties became published in Nature in 2000; it was still ignored till approximately 2011. LSTMs made the need for something like residual connections obvious in 1997, yet the field waited to apply this to very deep feedforward nets till 2015 (highway nets and ResNets). And so on...
So there should be plenty of promising things hidden in the published literature and not widely known.
So it might be that all that is needed to surface those breakthroughs which are still buried in various lightly cited papers is a modestly competent automated AI researcher who can understand published papers, can generate moderately competent ML code, and can comb the literature for promising ideas and synthesize and run experiments based on various combinations of those ideas automatically. Can one implement a system like this based on GPT-4, as an intelligent wrapper of GPT-4? It's not clear, but overall we are don't seem to be very far from being able to do something like this (perhaps we do need to wait till the next generation of LLMs, but the system able to do this does not have to be a superintellect or even an AGI itself, it only needs limited moderate competence to have a good chance to unearth the required breakthroughs).
I've seen a preprint which is very close to being a smoking gun, although its claim needs to be independently reproduced. The preprint says that when they have filtered the standard sequence database by the date of submission (keeping only sequences which were submitted before the start of the pandemic (as opposed to being marked as discovered before the start of the pandemic)), then there is a clear match with a particular 2008 PNAS paper and the sequence published in connection with that 2008 paper.
If this reproduces, this would make it very likely that it has been a lab leak, and that moreover, the research involved has been reproducing the synthetic virus published in 2008 (and would not be possible without that 2008 publication). For obvious reasons, the paper has not been welcomed at all by any established outlets, since it makes both US and China scientific establishments and their collective practices look really bad. Nevertheless, here it is: https://www.researchgate.net/publication/353031350_The_possible_laboratory_origins_of_SARS-CoV-2_the_likelihood_of_a_subsequent_deadlier_COVID_pandemic_and_necessity_to_introduce_blockchain_practices_for_verifying_and_tracking_scientific_data
To make sure those claims are correct someone would need to independently reproduce its findings. It's not all that straightforward, because people running the sequence database in question has turned the ability to filter by the date of submission off since then (which might has been done for related or unrelated reasons, I would not know for sure). However, I think one can still download the whole thing and run the needed searches locally and let people know whether claims do reproduce or not, although it would require some effort.
The 2008 paper which is supposedly involved is this one: https://www.pnas.org/doi/10.1073/pnas.0808116105
I think this is a good place for this link, thanks!
This is great, thanks!
I think the link to your initial post on Neuronpedia might be useful to create better context: Neuronpedia - AI Safety Game
Thanks for including the link in your edit.
One factor which is important to consider is how likely a goal or a value to persist during self-improvements (those self-improvements might end up being quite radical, and also fairly rapid).
An arbitrary goal or value is unlikely to persist (this is why the "classical formulation of alignment problem" is so difficult, the difficulties come from many directions, but the most intractable one is how to make it so that the desired properties are preserved during radical self-modifications). That's the main obstacle to asking AIs to research and implement this on their own as they get smarter and smarter. The question is always: "why would AIs keep caring about this?"
But there might be "natural properties" ("natural" values and goals) which AIs might want to preserve because of their own reasons (because they might be interested in the world around them not being utterly destroyed, because they might be interested in existing in a reasonably comfortable and safe society, and so on). With such "natural properties" it might be easier to delegate it to AIs to research, implement, and maintain those properties, because AIs might have intrinsic reasons to keep caring even through drastic changes.
And then, of course, the question is: can one formulate such "natural properties" that a reasonable level of AI safety for humans would be a corollary to those "natural properties"?
But this is why "alignment" might be a terminology which is less than optimal (because this terminology tends to focus our attention at the human-oriented properties and values which are unlikely to be invariant with respect to recursive self-improvements on their own, although they can be corollaries of properties which might be feasible to keep invariant).
Of course, there can be different approaches to finding those "natural properties" and making sure they hold through self-improvements; the paper I linked is just one of many of such possible approaches.
It's not clear if this ends up working as intended, but there are proposals to that effect.
For example, "Safety without alignment", https://arxiv.org/abs/2303.00752 proposes to explore a path which is closely related to what you are suggesting.
(It would be helpful to have a link to Tim Urban's article.)
This one might actually be doable without super-powerful AIs. Current progress in non-invasive brain-computer interfaces is rather impressive...
I do think people should organize and make it go faster, but this should be achievable with the current level of AI.
(Unlike practical immortality and eventually becoming God-like, which both do seem to require super-intelligence and which are what a lot of people really want. Being able to personally know, understand, and experience all which is worth to experience in the current world, and more beyond. This does require power of super-intelligence.)
Altman says he isn’t sure what Ilya Sutskever’s exact employment status is. I feel like if I was CEO of OpenAI then I would know this.
This seems be the author of that Axios article editorializing. In reality, here is what seems to have been said:
https://twitter.com/tsarnick/status/1747807508981514716 - 24 seconds of Sam's talking (the summary is that Sam says he is hopeful that they'll find a way to keep working together (and, to my ear, he is sounding like he is really very emphatically hoping for success in this sense, but that he can't be sure of success in this sense yet, so they have still been discussing this situation as of last week))
(Via a discussion at https://manifold.markets/Ernie/what-will-ilya-sursever-be-doing-on)
But I don't know to what extent productive studies in philosophy at the top level of competence in philosophy are at all compatible with safety concerns. It's not an accident that people using base models show nice progress in joint human-AI philosophical brainstorms, whereas people using tamed models seem to be saying that those models are not creative enough, and that those models don't think in sufficiently non-standard ways.
It's might be a fundamental problem which might not have anything to do with human-AI differences. For example, Nietzsche is an important radical philosopher, and we need biological or artificial philosophers performing not just on that level, but on a higher level than that, if we want them to properly address fundamental problems, but Nietzsche is not "safe" in any way, shape, or form.
Thanks, that's very informative.
Humans have a history of making philosophical progress. We lack similar empirical evidence for AIs.
Hybrid philosophical discourse done by human-AI collaborations can be very good. For example, I feel that Janus has been doing very strong work in this sense with base models (so, not with RLHF'd, Constitutional, or otherwise "lesioned" and "mode-collapsed" models we tend to mostly use these days).
But, indeed, this does not tell us much about what would AIs do on their own.
Imagine that one day we or our descendants build or become superintelligent super-competent philosophers who after exhaustively investigating moral philosophy for millions of years, decide that some moral theory or utility function is definitely right.
But what is the reason to think that we or our descendants would have a better chance of finding this kind of "definitely right" moral theory or utility function than other AIs or their descendants?
In some sense, the point of OP is that the difference between "us" and "not-us" here might be more nebulous than we usually believe, and that a more equal treatment is called for.
Otherwise, one might also argue (in a symmetric fashion) that we would destroy moral option value by preventing other entities who might have a better chance of building or becoming "superintelligent super-competent philosophers" from having a shot at that...
There seem to be some mild technical problems.
This is AI #47 post, and the previous one in this series is AI #45. The AI #46 does not seem to exist here or on substack.
Also looking at this and other recent Zvi posts, they look like linkposts on GreaterWrong, with links to thezvi.wordpress.com, but they don't look like linkposts on LessWrong. It's a bit unusual in this sense...
My general sense is that there's more confidence in the plasma physics community that CFS will succeed than that Helion will succeed.
That is, indeed, an important indicator.
Otherwise, tokamaks being an old design works as an argument in the opposite direction for me (more or less along the following lines: tokamak design has been known for ages, and they still have not succeeded with it; perhaps an alternative and less tried design would have better chances, since at the very least it does not have the accumulated history of multi-decade-long delays associated with it).
(I guess, my assumption is that the mainstream plasma community has been failing us for a long time, feeding us more promises than actual progress for decade after decade, and that I would rather bet on something from the "left field" at this point, at least in terms of the chances to achieve commercial viability relatively soon, as opposed to the ability to attract funding or boost headcounts.)
Basically, yes, one thing we are comparing is their (Helion and CFS) respective 2024 and 2025 promises regarding Q>1, but more importantly from my viewpoint, Helion's promise to actually ship electricity to the customers in 2028 does seem overoptimisitic, but perhaps not outrageously so, whereas with tokamaks, what's our forecast for when they have a chance to actually ship electricity to the customers?