If I ask an AI to duplicate a strawberry and it takes of the world, I would consider that misaligned. Obviously it will require some instrumental convergence (resources, intelligence, etc) to duplicate a strawberry. An aligned AI should either duplicate the strawberry while staying within a "budget" for how many resources it consumes, or say "I'm sorry, I can't do that".
I would recommend you read my post on corrigibility which describes how we can mathematically define a tradeoff between success and resource exploitation.
As Zvi points out, what counts as "new information" depends on the person reading the message
Taking all the knowledge and writing and tendencies of the entire human race and all properties of the physical universe as a given, sure, this is correct. The response corresponds to the prompt, all the uniqueness has to be there.
Presumably when you look something up in an encyclopedia, it is because there is something that other people (the authors of the encyclopedia) know that you don't know. When you write an essay, the situation should be exactly opposite: there is something you know that the reader of the essay doesn't know. In this case the possibilities are:
The information I am trying to convey in the essay is strictly contained in the prompt (in which case GPT is just adding extra unnecessary verbiage)
GPT is adding information known to me but unknown to the reader of the essay (say I am writing a recipe for baking cookies and steps 1-9 are the normal way of making cookies but step 10 is something novel like "add cinnamon")
GPT is hallucinating facts unknown to the writer of the essay (theoretically these could still be true facts, but the writer would have to verify this somehow)
Nassim Taleb's point is that cases 1. and 3. are bad. Zvi's point is that case 2. is frequently quite useful.
As I described in my post I am looking for a reproducible example involving AI so that I can test a particular method for transforming an instrumentally convergent AI into a corrigible AI.
This sounds like exactly the type of thing I'm looking for.
Do you know, in the "convergent" case, is the score just for advancing the king, or is there a possibility that there's some boring math explanation like "as the number of steps calculated increases, the weights on the other positions (bishops, lines of attack, etc) overwhelms the score for the king advancing"?
If I built a DWIM AI, told it to win at CIV IV and it did this, I would conclude it was misaligned. Which is precisely why SpiffingBrit's videos are so fun to watch (huge fan).
If I was merely looking for examples of RL does something unexpected, I would not have created the bounty.
I'm interested in the idea that AI trained on totally unrelated tasks will converge on the specific set of goals described in the article on instrumental convergence
Self-preservation: A superintelligence will value its continuing existence as a means to continuing to take actions that promote its values.
Goal-content integrity: The superintelligence will value retaining the same preferences over time. Modifications to its future values through swapping memories, downloading skills, and altering its cognitive architecture and personalities would result in its transformation into an agent that no longer optimizes for the same things.
Cognitive enhancement: Improvements in cognitive capacity, intelligence and rationality will help the superintelligence make better decisions, furthering its goals more in the long run.
Technological perfection: Increases in hardware power and algorithm efficiency will deliver increases in its cognitive capacities. Also, better engineering will enable the creation of a wider set of physical structures using fewer resources (e.g., nanotechnology).
Resource acquisition: In addition to guaranteeing the superintelligence's continued existence, basic resources such as time, space, matter and free energy could be processed to serve almost any goal, in the form of extended hardware, backups and protection.
I'm probably going to be a stickler about 2. "not with the goal in advance being to show instrumental convergence" meaning that the example can't be something written in response to this post (though I reserve the right to suspend this if the example is really good).
The reason being, I'm pretty sure that I personally could create such a gridworld simulation. But "I solved instrumental convergence in this toy example I created myself" wouldn't convince me as an outsider that anything impressive had been done.
What is wrong with the CIV IV example? Being surprising is not actually a requirement to test your theory against
I'm interesting in cases where there is a correct non-power-maximizing solution. For winning at Civ IV, taking over the world is the intended correct outcome. I'm hoping to find examples like the Strawberry Problem, where there is a correct non-world-conquering outcome (duplicate the strawberry) and taking over the world (in order to e.g. maximize scientific research on strawberry duplication) is an unwanted side-effect.
When will instrumental convergence become a visible big problem?
If anyone knows of a real-world example of Instrumental Convergence, I will pay you $100
Comment by logan-zoellner on [deleted post]
2023-07-19T14:12:36.439Z
Modern AI systems (read: LLMs) don’t look like that, so Tammy thinks efforts to align them are misguided.
Starting your plan with "ignore all SOTA ML research" doesn't sound like a winning proposition.
The important thing is, your math should be safe regardless of what human values turn out to really be, but you still need lots of info to pin those values down.
I don't think even hardcore believers in the orthogonality thesis believe all possible sets of values are equally easy to satisfy. Starting with a value-free model is throwing away a lot of baby with the bath-water.
Moreover, explicitly having a plan of "we can satisfy any value system" leaves you wide open for abuse (and goodharting). Much better to aim from the start for a broadly socially agreeable goal like curiosity, love or freedom.
Look at every possible computer program.
I'm sure the author is aware that Solomonoff Induction is non-computable, but this article still fails to appreciate how impossible it is to even approximate. We could build a galaxy sized super-computer and still wouldn't be able to calculate BB(7)
Overall, I would rate the probability of this plan succeeding at 0%
Ignores the fact that there's no such thing in the real world as a "strong coherent general agent". Normally I scoff at people who say things like "the word intelligence isn't defined" but this is a case where the distinction between "can do anything a human can do" and "general purpose intelligence" really matters.
Value-free alignment is simultaneously impossible and dangerous
Refuses to look "inside the box" thereby throwing away all practical tools for alignment
Completely ignores everything we've actually learned about AI in the last 4 years
Requires calculation of an impossible to calculate function
Generally falls into the MIRI/rationalist tarpit of trying to reason about intelligent systems without actually working with/appreciating actually working systems in the real word
Governments don't consistently over-regulate. They consistently regulate poorly. For example cracking down on illegal skateboarding but not shoplifting or public consumption of drugs.
In AI, the predictable outcome of this is lots of regulations about AI bias and basically nothing that actually helps notkilleveryone.
From the roughly thirty conversations I remember having, below are the answers I remember getting.
Ugh. This list is exceptionally painful to read. Every one of your definitions suggests consciousness something "out there" in the universe that can be measured and analyzed.
Consciousness isn't any of these things! Consciousness is the thing that you are experiencing right now in this very moment as you read this comment. It is the visceral subjective feeling of being connected to the world around you.
I don't want to go all Buddhist, but if someone asks "what is consciousness", the correct response is to un-ask the question. Stop trying to explain and just FEEL.
Close your eyes. Take a breath. Listen to the hum of background noise. Open your eyes. Look around you. Notice the tremendous variety of shapes and colors and textures. What can you smell? What do you feel? What are your feet and hands touching? Consciousness isn't any of these things. But if you pay attention to them, you might begin to grasp at the thing that it really is.
Before suggesting a new regulation regime for AI models, you must first show that it doesn't ban obviously beneficial technologies (e.g. printing press, GPT 3.5, nuclear energy).
The disanalogy is that our brains wouldn't add anything to sufficiently advanced AI systems
Being human is intrinsically valuable. For certain tasks, AI simply cannot replace us. Many people enjoy watching Magnus Carlsen play chess even though a $5 Rasberry PI computer is better at chess than him.
Similarly, there are more horses in the USA today than in the 1930s when the Model-T was introduced.
Today, many people are weaker physically than in previous times because we don't need to do as much physical labor.
I haven't been able to find a definitive source, but I would be willing to bet that a typical "gym bro" is physically stronger than a typical hunter-gatherer due to better diet/training.
Can we crank this in reverse: given a utility function, design a market that whose representative agent has this utility function?
It seems like trivially, we could just have the market where a singleton agent has the desired utility function. But I wonder if there's some procedure where we can "peel off" sub-agents within a market and end up at a market composed of the simplest possible sub-agents for some metric of complexity.
Either there is some irreducible complexity there, or maybe there is a Universality theorem proving that we can express any utility function using a market of agents who all have some extremely simple finite state, similar to how we can show any form of computation can be expressed using Turing Machines.
At this point, humans have few incentives to gain any skills or knowledge, because almost everything would be taken care of by much more capable AIs.
This premise makes no sense. Humans have been using technology to augment their abilities since literally as long as there have been humans. Why, when AI appears, would this suddenly cease to be the case? This is like arguing that books will make humans dumber because we won't memorize things anymore.
The idea that moving faster now will reduce speed later is a bit counterintuitive. Here’s a drawing illustrating the idea:
Figure 2
One minor nitpick. Slow takeoff implies shorter timelines, so B should reach the top of the capabilities axis at a later point in time than A.
If you want to think intuitively about why this is true, consider that in our current world (slow takeoff) tools exist (like codex,GPT-4) that accelerate AI research. If we simply waited to deploy any AI tools until we could build fully super-intelligent AGI, we would have more time overall.
Now, it might still be the case that "Time during takeoff is more valuable than time before, so it’s worth trading time now for time later", but it's still wrong to depict the graphs as taking the same amount of time to reach a certain level of capabilities.
This is helpful to think about when considering policies like the pause, as we are gaining some amount of time at the current level of AI development and sacrificing some (smaller) amount of time at a higher level of development. Even assuming no race dynamics, determining whether a pause is beneficial depends on the ratio of the time-now/time-later and the relative value of those types of time.
A reasonable guess is that algorithmic Improvements matter about as much as Moore's Law, so effectively we are trading 6 months now for 3 months later.
Another important fact is that in the world of fast takeoff/pause, we arrive at the "dangerous capabilities level" with less total knowledge. If we assume some fungibility of AI capabilities/safety research, then all of the foregone algorithmic improvement means we have also foregone some safety related knowledge.
I tell you "the most likely token is unlocked with probability .04, the second-most likely is achieved with probability .015, and...", and I'm basically right.[1]That happens over hundreds of diverse validation prompts.
How is this a relevant metric for safety at all? Can you, for example, do this for obviously safe models like a new fine-tune of Llama 7B? Wouldn't computational irreducibly imply this is basically impossible. If banning all new models is your goal, why not just do that instead of making up a fake impossible to achieve metric?
As a modest proposal:
Before suggesting a new regulation regime for AI models, you must first show that it doesn't ban obviously beneficial technologies (e.g. printing press, GPT 3.5, nuclear energy).
I think maybe you misunderstand the word "crux". Crux is a point where you and another person disagree. If you're saying you can't understand why Libertarians think centralization is bad, that IS a crux and trying to understand it would be a potentially useful exercise.
Centralization of power (as is likely to result from many possible government interventions) is bad
Suppose that you expected AI research to rapidly reach the point of being able to build Einstein/Von Neumann level intelligence and thereafter rapidly stagnate. In this world, would you be able to see why centralization is bad?
It seems like you're not doing a very good Ideological Turing Test if you can't answer that question in detail.
The number 7 supercomputer is built using Chinese natively developed chips, which still demonstrates the quality/quantity tradeoff.
Also, saying "sanctions will bite in the future" is only persuasive if you have long timelines (and expect sanctions to hold up over those timelines). If you think AGI is imminent, or you think sanctions will weaken over time, future sanctions matter less.
China does not have access to the computational resources[1] (compute, here specifically data centre-grade GPUs) needed for large-scale training runs of large language models.
While it's true that Chinese semiconductor fabs are a decade behind TSMC (and will probably remain so for some time), that doesn't seem to have stopped them from building 162 of the top 500 largest supercomputers in the world.
There are two inputs to building a large supercomputer: quality and quantity, and China seems more than willing to make up in quantity what they lack in quality.
The CCP is not interested in reaching AGI by scaling LLMs.
For a country that is "not interested" in scaling LLMs, they sure do seem to do a lot of research into large language models.
I suspect that "China is not racing for AGI" will end up in the same historical bin as "Russia has no plans to invade Ukraine", a claim that requires us to believe the Chinese stated preferences while completely ignoring their revealed ones.
I do agree that if the US and China were both racing, the US would handily win the race given current conditions. But if the US stops racing, there's absolutely no reason to think the Chinese response would be anything other than "thanks, we'll go ahead without you".
--edit--
If a Chinese developer ever releases an LLM that is so powerful it inevitably oversteps censorship rules at some point, the Chinese government will block it and crack down on the company that released it.
This is a bit of a weird take to have if you are worried about AGI Doom. If your belief is "people will inevitably notice that powerful systems are misaligned and refuse to deploy them", why are you worried about Doom in the first place?
Is the claim that China due to its all-powerful bureaucracy is somehow less prone to alignment failure and hard-left-turns than reckless American corporations? If so, I suggest you consider the possibility that Xi Jinping isn't all that smart.
As I'd mentioned, we often apply non-LPE-based environment-solving to constrain the space of heuristics over which we search, as in the tic-tac-toe and math examples. Indeed, it seems that scientific research would be impossible without that.
LPE-based learning does not work in domains where failure is lethal, by definition. However, we have some success navigating them anyway.
I think this is a strawman of LPE. People who point out you need real world experience don't say that you need 0 theory, but that you have to have some contact with reality, even in deadly domains.
Outside of a handful of domains like computer science and pure mathematics, contact with reality is necessary because the laws of physics dictate that we can only know things up to a limited precision. Moreover, it is the experience of experts in a wide variety of domains that "try the thing out and see what happens" is a ridiculously effective heuristic.
Even in mathematics, the one domain where LPE should in principal be unnecessary, trying things out is one of the main ways that mathematicians gain intuitions for what new results are/aren't likely to hold.
I also note that your post doesn't give a single example of a major engineering/technology breakthrough that was done without LPE (in a domain that interacts with physical reality).
It is, in fact, possible to make strong predictions about OOD events like AGI Ruin — if you've studied the problem exhaustively enough to infer its structure despite lacking the hands-on experience.
This is literally the one specific thing LPE advocates think you need to learn from experience about, and you're just asserting it as true?
To summarize:
Domains where "pure thought" is enough:
toy problems
limited/no interaction with the real world
solution/class of solutions known in advance
Domains where LPE is necessary:
too complicated/messy to simulate
depends on precise physical details of the problem
even a poor approximation to solution not knowable in advance
But I don't think this future is terribly likely. It's either human annihilation or massive cosmic endowment of wealth. The idea that we somehow end up on the knife-edge of survival our resources slowly dwindling requires that r* is too finely tuned to exactly 0.
You are correct. Free trade in general produces winners/losers and while on average people become better off there is no guarantee that individuals will become richer absent some form of redistribution.
In practice humans have the ability to learn new skills/shift jobs so we mostly ignore the redistribution part, but in an absolute worst case there should be some kind of UBI to accommodate the losers of competition with AGI (perhaps paid out of the "future commons" tax).
you should expect to update in the direction of the truth as the evidence comes in
I think this was addressed later on, but this is not at all true. With the waterfall example, every mile that passes without a waterfall you update downwards, but if there's a waterfall at the very end you've been updating against the truth the whole time.
Another case. Suppose you're trying to predict the outcome of a Gaussian random walk. Each step, you update whatever direction the walk took, but 50% of these steps are "away" from the truth.
you probably shouldn’t be able to predict that this pattern will happen to you in the future.
Again addressed later on, but one can easily come up with stories in which one predictably updates either "in favor of" or "against" AI doom.
Suppose you think there's a 1% chance of AI doom every year, and AI Doom will arrive by 2050 or never. Then you predictably update downwards every year (unless Doom occurs).
Suppose on the other hand that you expect AI to stagnate at some level below AGI, but if AGI is developed then Doom occurs with 100% certainty. Then each year AI fails to stagnate you update upwards (until AI actually stagnates).
F(a) is the set of futures reachable by agent a at some intial t=0. F_b(a) is the set of futures reachable at time t=0 by agent a if agent b exists. There's no way for F_b(a) > F(a), since creating agent b is under our assumptions one of the things agent a can do.
Clearly there's some tension between "I want to shut down if the user wants me to shut down" and "I want to be helpful so that the user doesn't want to shut me down", but I don't weak indifference is a correct way to frame this tension.
As a gesture at the correct math, imagine there's some space of possible futures and some utility function related to the user request. Corrible AI should define a tradeoff between the number of possible futures its actions affect and the degree to which it satisfies its utility function. Maximum corrigibility {C=1} is the do-nothing state (no effect on possible futures). Minimum corrigibility {C=0} is maximizing the utility function without regard to side-effects (with all the attendant problems such as convergent instrumental goals, etc). Somewhere between C=0 and C=1 is useful corrigible AI. Ideally we should be able to define intermediate values of C in such a way that we can be confident the actions of corrigible AI are spatially and temporally bounded.
The difficultly principally lies in the fact that there's no such thing as "spatially and temporally bounded". Due to the Butterfly Effect any action at all affects everything in the future light-cone of the agent. In order to come up with a sensible notion of boundless, we need to define some kind of metric on the space of possible futures, ideally in terms like "an agent could quickly undo everything I've just done". At this point we've just recreated agent foundations, though.
I don't think we want corrigible agents to be indifferent to being shut down. I think corrigible agents should want to be shut down if their users want to shut them down.
I'm not particularly interested in arguing about this 1 video. I want to know where are the other 4999 videos.
There are internal military investigations. The military released some data but not all that it has. The military doesn't like Russia/China to learn about its exact camera capabilities so doesn't seem to publically release its highest-resolution videos
The military is verybad at keeping secrets. And surely not all 5000 of the highly believable UFO reports occurred within the US military.
I am aware of this one video of a blurry blob showing up on radar. What I am not aware of is 5000 UFO sightings with indisputable physical evidence.
Where are the high resolution videos? Where are the spectrographs of "impossible alien metals"? Where are the detailed studies of time and location of each encounter trying to treat it as an actual scientific phenomena?
Basically, where are the 5000 counterexamples to this comic?
Game theory says that humans need to work in coalitions and make allies because no individual human is that much more powerful than any other. With agents that can self improve and self replicate, I don't think that holds.
Even if agents can self-replicate, it makes no sense to run GPT-5 on every single micro-processer on Earth. This implies we will have a wide variety of different agents operating across fundamentally different scales of "compute size". For math reasons, the best way to coordinate a swarm of compute-limited agents is something that looks like free-market capitalism.
One possible worry is that humans will be vastly out-competed by future life forms. But we have a huge advantage in terms of existing now. Compounding interest rates imply that anyone alive today will be fantastically wealthy in a post-singularity world. Sure, some people will immediately waste all of that, but as long as at least some humans are "frugal", there should be more than enough money and charity to go around.
I don't really have much to say about the "troublemaker" part, except that we should do the obvious things and not give AI command and control of nuclear weapons. I don't really believe in gray-goo or false-vacuum or anything else that would allow a single agent to destroy the entire world without the rest of us collectively noticing and being able to stop them (assuming cooperative free-market supporting agents always continue to vastly [100x+] outnumber troublemakers).
Actually intelligent human (smart grad student-ish)
Von Neumann (smartest human ever)
Super human (but not yet super-intelligent)
Super-intelligent
Dyson sphere of computronium???
By the time we get the first Von-Neumann, every human on earth is going to have a team of 1000's of AutoGPTs working for them. The person who builds the first the first Von-Neumann level AGI doesn't get to take over the world because they're outnumbered 70 trillion to one.
The ratio is a direct consequence of the fact that it is much cheaper to run an AI than to train one. There are also ecological reasons why weaker agents will out-compute stronger ones. Big models are expensive to run and there's simply no reason why you would use an AI that costs $100/hour to run for most tasks when one that costs literally pennies can do 90% as good of a job. This is the same reason why bacteria >> insects >> people. There's no method whereby humans could kill every insect on earth without killing ourselves as well.
See also: why AI X-risk stories always postulate magic like "nano-technology" or "instantly hack every computer on earth".
I'm claiming we never solve the problem of building AI's that "lase" in the sense of being able to specify an agent that achieves a goal at some point in the far future. Instead we "stumble through" by iteratively making more and more powerful agents that satisfy our immediate goals and game theory/ecological considerations mean that no single agent every takes control of the far future.
No one cares for the same reason no one (in the present) cares that AI can now pass the Turing Test. By the time we get there, we are grappling with different questions.
"Can you define an AI model that preserves a particular definition of human values under iterative self-improvement" simply won't be seen as the question of the day, because by the time we can do it, it will feel "obvious" or "unimpressive".
I think that to the extent we need to answer "what algorithm?" style questions, we will do it with techniques like this one where we just have the AI write code.
But I don't think "what algorithm?" is a meaningful question to ask regarding "Modern Disney Style", the question is too abstract to have a clean-cut definition in terms of human-readable code. It's sufficient that we can define and use it given a handful of exemplars in a way that intuitively agrees with humans perception of what those words should mean.
I read your post and it does not describe a way for us to "directly design desirable features" in our current ML paradigm. I think "current ML is very opaque" is a very accurate summary of our understanding of how current ML systems perform complicated cognitive tasks. (We've gotten about as far as figuring out how a toy network performs modular addition.)
How familiar are you with loras, textual inversion, latent space translations and the like? Because these are all techniques invented within the last year that allow us to directly add (or subtract) features from neural networks in a way that is very easy and natural for humans to work with. Want to teach your AI what "modern Disney style" animation looks like? Sounds like a horribly abstract and complicated concept, but we can now explain to an AI what it means in a process that takes <1hr, a few megabytes of storage, and an can be reused across a wide variety of neural networks. This paper in particular is fantastic because it allows you to define "beauty" in terms of "I don't know what it is, but I know it when I see it" and turn it into a concrete representation.
"Hack every AI on the planet" sounds like a big ask for an AI that will have a tiny fraction (<1%) of the world's total computing power at its disposal.
Furthermore, it has to do that retro-actively. The first super-intelligent AGI will be built by a team of 1m Von-Neumann level AGIs who are working their hardest to prevent that from happening.
At the single moment of maximum change in the transition from Human to Artificial Intelligence, we collectively agree that the outcome was "good".
We never actually "solve" the alignment problem (in the EY sense of writing down a complete set of human values and teaching them to the AI). Instead, we solve the Alignment problem by doing the hard work of engineering AI that does what we want and riding the wave all the way to the end of the S-curve.
edit:
I mean, maybe we DO that, but by the time we do no one cares.
It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too.
This motte/bailey shows up in almost every argument for AI doom. People point out that current AI systems are obviously safe, and the response is "not THOSE AI systems, the future much more dangerous ones." I don't find this kind of speculation helpful or informative.
Human bodies and the food, water, air, sunlight, etc. we need to live are resources ("you are made of atoms the AI can use for something else"); and we're also potential threats (e.g., we could build a rival superintelligent AI that executes a totally different plan).
This is an incredibly generic claim to the point of being meaningless. There is a reason why most human plans don't start with "wipe out all of my potential competitors".
Current ML work is on track to produce things that are, in the ways that matter, more like "randomly sampled plans" than like "the sorts of plans a civilization of human von Neumanns would produce". (Before we're anywhere near being able to produce the latter sorts of things.)
If by current ML you mean GPT, then it is LITERALLY trained to imitate humans (by predicting the next token). If by current ML you mean something else, say what you mean. I don't think anyone is building an AI that randomly samples plans, as that would be exceedingly inefficient.
The key differences between humans and "things that are more easily approximated as random search processes than as humans-plus-a-bit-of-noise" lies in lots of complicated machinery in the human brain.
See the above objection. But also, just because humans are complex biological things doesn't mean approximating our behaviors is similarly complex.
STEM-level AGI timelines don't look that long (e.g., probably not 50 or 150 years; could well be 5 years or 15).
Sure. But I don't think there's some magically "level" where AI suddenly goes Foom. STEM level AI will seem somewhat unimpressive ("hasn't AI been able to do that for years?") when it finally arrives. AI is already writing code, designing experiments, etc. There's nothing special about STEM level as you've defined it. BabyAGI could already be "STEM level" and it wouldn't change the fact that it's currently not remotely a threat to humanity.
We don't currently know how to do alignment, we don't seem to have a much better idea now than we did 10 years ago, and there are many large novel visible difficulties.
We should be starting with a pessimistic prior about achieving reliably good behavior in any complex safety-critical software, particularly if the software is novel.
You can start with whatever prior you want. The point of being a good Bayesian is updating your prior when you get new information. What information about current ML state of the art caused you to update this way?
Neither ML nor the larger world is currently taking this seriously, as of April 2023.
The people taking this "seriously" don't seem to be doing a better job than the rest of us.
As noted above, current ML is very opaque, and it mostly lets you intervene on behavioral proxies for what we want, rather than letting us directly design desirable features.
Again, I'm begging you to actually learn something about the current state of the art instead of making claims!
There are lots of specific abilities which seem like they ought to be possible for the kind of civilization that can safely deploy smarter-than-human optimization, that are far out of reach, with no obvious path forward for achieving them with opaque deep nets even if we had unlimited time to work on some relatively concrete set of research directions.
C'mon! You're telling me that you have a specific list of capabilities that you know a "surviving civilization" would be developing and you chose to write this post instead of a detailed explanation of each and every one of those capabilities?
Interesting to note that BabyAGI etc are architecturally very similar to my proposed Bureaucracy of AIs. As noted, interpretablity is a huge plus of this style of architecture. Should also be much less prone to Goodhart than an AlphaStar style RL Agent. Now we just need to work on installing some morality cores.
It seems like this problem has an obvious solution.
Instead of building your process like this
optimize for good agent -> predict what they will say -> predict what they will say -> ... ->
Build your process like this
optimize for good agent -> predict what they will say -> optimize for good agent -> predict what they will say -> optimize for good agent -> predict what they will say -> ...
If there's some space of "Luigis" that we can identify (e.g. with RLHF) surrounded by some larger space of "Waluigis", just apply optimization pressure at every step to make sure we stay in the "Luigi" space instead of letting the process wander out into the Waluigi space.
Note that the Bing "fix" of not allowing more than 6 replies partially implements this by giving a fresh start in the "Luigi" space periodically.