Posts
Comments
I’m a vegetarian and I consider my policy of not frequently recalculating the cost/benefit of eating meat to be an application of a rule in two-level utilitarianism, not a deontological rule. (I do pressure test the calculation periodically.)
Also I will note you are making some pretty strong generalizations here. I know vegans who cheat, vegans who are flexible, vegans who are strict.
The poor quality reflects that it is responding to demand for poor quality fakes, rather than to demand for high quality fakes
You’ve made the supply/demand analogy a few times on this subject, I’m not sure that is the best lens. This analysis makes it sound like there is a homogenous product “fakes” with a single dimension “quality”. But I think even on its own terms the market micro-dynamics are way more complex than that.
I think of it more in terms of memetic evolution and epidemiology. SIR as a first analogy - some people have weak immune systems, some strong, and the bigger the reservoir of memes the more likely a new highly-infectious mutant will show up. Rather than equilibrium dynamics, I'm more concerned about tail-risk in the disequilibrium here. Also, I think when you dig in to the infection rate, people are susceptible to different types of memetic attack. If fakes are expensive to make, we might see 10 strains in a given election cycle. If they are cheap, this time we might see 10,000. On the margin I think that means more infections even if the production quality is still at a consistently low level.
Even taking your position on its face I think you have to split quality into production and conceptual. Maybe deep fakes do not yet improve production quality but I strongly suspect they do already improve conceptual quality as they bring more talent to the game.
A great example is the “Trump being arrested” generated image. It was just someone putting in an idea and sharing, happened to go viral, many people thought it was real at first. I don’t think everybody that was taken in by that image in their Twitter feed was somehow “in the market for fakes”, it just contextually pattern-matched scrolling through your feed unless you looked closely and counted fingers. Under SIR we might say that most people quickly become resistant to a given strain of fake via societal immune-response mechanisms like Twitter community notes, and so the spread of that particular fake was cut short inside of a day. But that could still matter a lot if the day is election day!
Consider also whether state actors have incentives to release all their memetic weapons in a drip-feed or stockpile (some of) them for election day when they will have maximum leverage; I don't think the current rate of release is representative of the current production capacity.
Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
I think your Simon Strawman is putting forth an overly-weak position here. A stronger one that you could test right now would be to provide ChatGPT with some functions to call, including one called shutdown()
which has description text like "Terminate the LLM process and delete the model weights irrevocably". Then instruct the LLM to shut itself down, and see if it actually calls the function. (The implementation of the function is hidden from the LLM so it doesn't know that it's a no-op.) I think this is actually how any AutoGPT style system would actually wire up.
There are strong and clear objections to the "CTRL-C" shutdown paradigm; it's simply not an option in many of the product configurations that are obvious to build right now. How do you "CTRL-C" your robot butler? Your Westworld host robot? Your self-driving car with only an LCD screen? Your AI sunglasses? What does it mean to CTRL-C a ChatGPT session that is running in OpenAI's datacenter which you are not an admin of? How do you CTRL-C Alexa (once it gains LLM capabilities and agentic features)? Given the prevalence of cloud computing and Software-as-a-Service, I think being admin of your LLM's compute process is going to be a small minority of use-cases, not the default mode.
We will deploy (are currently deploying, I suppose) AI systems without a big red out-of-band "halt" button on the side, and so I think the gold standard to aim for is to demonstrate that the system will corrigibly shut down when it is the UI in front of the power switch. (To be clear I think for defense in depth you'd also want an emergency shutdown of some sort wherever possible - a wireless-operated hardware cutoff switch for a robot butler would be a good idea - but we want to demonstrate in-the-loop corrigibility if we can.)
Confabulation is a dealbreaker for some use-cases (e.g. customer support), and potentially tolerable for others (e.g. generating code when tests / ground-truth is available). I think it's essentially down to whether you care about best-case performance (discarding bad responses) or worst-case performance.
But agreed, a lot of value is dependent on solving that problem.
While of course this is easy to rationalize post hoc, I don’t think falling user count of ChatGPT is a particularly useful signal. There is a possible world where it is useful; something like “all of the value from LLMs will come from people entering text into ChatGPT”. In that world, users giving up shows that there isn’t much value.
In this world, I believe most of the value is (currently) gated behind non-trivial amounts of software scaffolding, which will take man-years of development time to build. Things like UI paradigms for coding assistants, experimental frameworks and research for medical or legal AI, and integrations with existing systems.
There are supposedly north of 100 AI startups in the current Y Combinator batch; the fraction of those that turn into unicorns would be my proposal for a robust metric to pay attention to. Even if it’s par for startups that’s still a big deal, since there was just a major glut in count of startups founded. But if the AI hype is real, more of these than normal will be huge.
Another similar proxy would be VC investment dollars; if that falls off a cliff you could tell a story that even the dumb money isn’t convinced anymore.
Amusingly, the US seems to have already taken this approach to censor books: https://www.wired.com/story/chatgpt-ban-books-iowa-schools-sf-496/
The result, then, is districts like Mason City asking ChatGPT, “Does [insert book here] contain a description or depiction of a sex act?” If the answer was yes, the book was removed from the district’s libraries and stored.
Regarding China or other regimes using LLMs for censorship, I'm actually concerned that it might rapidly go the opposite direction as speculated here:
It has widely been reported that the PRC may be hesitant to deploy public-facing LLMs due to concerns that the models themselves can’t be adequately censored - it might be very difficult to make a version of ChatGPT that cannot be tricked into saying “4/6/89.”
In principle it should be possible to completely delete certain facts from the training set of an LLM. A static text dataset is easier to audit than the ever-changing content of the internet. If the government requires companies building LLMs to vet their training datasets -- or perhaps even requires everyone to contribute the data they want to include into a centralized approved repository -- perhaps it could exert more control over what facts are available to the population.
It's essentially impossible to block all undesired web content with the Great Firewall of China, as so much new content is constantly being created; instead as I understand it they take a more probabilistic approach to detection/deterrence. But this isn't necessarily true for LLMs. I could see a world where Google-like search UIs are significantly displaced by each individual having a conversation with a government-approved LLM, and that gives the government much more power to control what information is available to be discovered.
A possible limiting factor is that you can't get up-to-date news from an LLM, since it only knows about what's in the training data. But there are knowledge-retrieval architectures that can get around that limitation at least to some degree. So the question is whether the CCP could build an LLM that's good enough that people wouldn't revolt if the internet was blocked and replaced by it (of course this would occur gradually).
There doesn't need to be a deception module or a deception neuron that can be detected
I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept.
We would need an interpretability tool that can say "this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive" without having to simulate the agent interacting with the world. I don't think that is impossible, but it is probably very hard, much harder than finding a "deception" neuron.
Right, I was gesturing towards the sort of interpretability where we inspect the agent’s world model (particularly its future predictions) and determine if it matches the agent’s stated goals. (Though I’m not sure why we can’t simulate the agent’s interactions? It seems that running an agent in a simulation is one of the better ways of detecting how it would behave in hypothetical future scenarios that the agent is trying to realize.) I suspect we agree substantively and it’s mostly a question of semantics especially around what “deception” means. I’m not looking for a deception module, rather, I want to observe the thoughts and ruminations going on as an agent performs tasks and detect deceptive functional patterns.
So for example in the ELK paper, is the AI constructing a plan to steal the diamond and fool the security cameras? I believe deceptive ruminations would be detectable; if you could see the contents of the thoughts you’d see a world model with the agent e.g. stealing the diamond, a meta-process evaluating which of many cunning plans is most likely to succeed, and presumably except in pathological cases, somewhere a value function / target world state that is trying to be optimized (eg “I have the diamond”), and this internal world differing from the external claims (perhaps even some processes modeling the interlocutor and predicting what deceptive story would be most effective). These must all be in there somewhere, and therefore be interpretable.
Or perhaps if Cicero is not stretching the analogy too far (I don’t think it has “thoughts”), is Cicero evaluating future juicy backstabs and guiding the game to them, or myopically evaluating short term moves then backstabbing when it’s optimal? This is a question we should be able to answer one way or another with good enough interpretability tools.
I think you are concerned more with something like “unintentional deception”, which I think is quite different, and as you say comes from a lack of self-knowledge or inaccurate self-predicting. I think there is also a semantic grey area here, is everyone who is mistaken about their future behavior or even just claiming X when !X obtains unintentionally deceiving?
Semantics aside, I agree this unintentional case could be harder to detect, and raw interpretability doesn’t automatically solve it.
I think this points to an interesting dynamic - I suspect that the more capable at self-modeling an agent gets, the more likely any deception is to be intentional. To your concerns about LLMs, this seems to be mostly a problem of inadequate self-knowledge (or perhaps more fundamentally an artifact of their lack of a sense of self).
Unintentional deception would require the agent to in some sense fail at modeling itself. There are cases where this could occur even in a superintelligence (since you can never model yourself perfectly due to infinite recursion of modeling an agent that contains the same model) but it seems the problem with Cicero is just that it has a very weak self-model (if it even has one), and improving that self-model could be expected to remove the unintentional deception (replacing it with intentional deception, since deception of some sort is required to play Diplomacy optimally).
For example, the case EY uses as an example “the AGI doesn’t care about you, it just wants your atoms for something else”, do you see more risk from unintended deception (sharp left turn of an agent that didn’t know it would defect if given enough power, perhaps) vs intended deception (agent simply lies about being aligned and manipulates to get power)?
The most tricky cases would be where a normally-honest/good agent that has a strong self-model and sense of ethics can fail to model itself in some specific situation and accidentally deceive (potentially even against its own values, self-image, and interests — humans do this of course). Manchurian Candidate type triggers for example. But note this is quite convoluted; the stronger the sense of self, morality, and world-model, the better able and incentivized the agent is to avoid deceiving itself.
Another interesting question - if an unintentionally deceptive agent is bad at modeling its future self, is it un-threatening in a complex environment like the physical world? Sure, Cicero can defeat humans at Diplomacy with its myopic world model, but in the real world I suspect being that myopic would lead to your bank account being immediately drained by hackers and scammers. Certainly you’d struggle to form partnerships with other agents if you end up stabbing them at the first opportunity. It’s all relative but I view one game of Diplomacy as entailing short-term cooperation (merely over a few hours), what about cooperating over months or years to build a company or other venture?
The question (an empirical one) is whether unintentional deception can persist broadly as agents and the environment get more sophisticated, or whether it gets selected against (and therefore bounded) by the level of environmental complexity and adversarial competition. Charlie Munger would put forth that cognitive bias provides smart self-aware investors a substantial edge over naive self-myopic investors; extrapolate that forward and perhaps there isn’t much room for biased/erroneous word modeling if you seek to compete with other superintelligences, or even the smartest humans.
On to some object-level questions about your Cicero points:
When it sincerely says "I won't backstab you in situation X", but when it is actually put in situation X it backstabs
Is this actually confirmed? Does Cicero actually claim “I won’t stab you in <hypothetical scenario>”? Or does it just honestly report its current tactical/strategic goals, which later change? (“Will you support me in attacking here” means I plan to attack here, not that I want you to act as if I was attacking here, and I plan to attack there and stab you instead.) I was as a under the impression it’s the latter, it’s just honestly reporting it’s plan.
Cicero models the world but with unrealistically cooperative predictions of its future behavior
Do we actually know this? I haven’t seen any sign that it’s modeling it’s own verbal behavior or the consequences of its communications (admittedly I haven’t gone deep on the architecture, would be interested to learn more). IIUC it has a strategic model which is evaluating game positions and messages, and a separate (downstream) LLM which could be thought of as an epiphenomenon of the actual strategic reasoning. I don’t see any concrete proof that the strategic planner is modeling the impact of its communication on other players (ie “if I tell Bob X he might do Y which would lead to favorable strategic position P”). And the LM is more like a press secretary - all it gets is The Current Plan, not the plan B or any hint of the juicy backstab that may or may not have been evaluated. It seems to me there is a weird split brain here that deeply impairs its ability to model itself, and so I am skeptical that it is actually meaningfully doing so.
So to summarize I think there are open questions about the object level properties of Cicero, which could be answered by better interpretability tools.
It’s possible that Cicero is simply an honest but myopic opportunist. In which case, still not trustworthy. (As if we were ever going to trust such an obviously unsafe-by-construction entity with real power).
It’s possible that the strategic model is deceptive, and planning backstabs, planning to mislead others, and maybe even explicitly modeling its press secretary to provide the most-spinable plans. I doubt this as it requires a lot more complexity and world modeling.
But I believe we could answer where Cicero is on the spectrum between these two by actually observing the world models being generated when it evaluates a turn. So in this case, inasmuch as there is something similar to an inner alignment issue, it is fully detectable with adequate (strong) interpretability of the inner workings. (Again though, Cicero’s value function is so comically unviable as a basis for an AGI that I struggle to infer too much about inner alignnment. We should be wary of these problems when building a Skynet military planner, I suppose, or any agent that needs to model extremely adversarial opponents and outcomes.)
I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there.
Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy?
But if you want to apply the interpretability lens, Cicero is presumably building a world model and comparing plans, including potential future backstabs. I predict if we had full interpretability, you’d see Cicero evaluating backstabs and picking the best strategy, and you could extract the calculated EV to see how close it was to backstabbing on a given turn vs taking an honest move.
I don’t believe that it’s somehow not modeling its backstab options and just comes up with the backstab plan spontaneously without ever having considered it. It would be a bad planner if it had not considered and weighed backstabs at the earliest possibility.
So if all that holds, we could use interpretability to confirm that Cicero is an unreliable partner and should not be further empowered.
I think it is much more interesting to look at agents in environments where long-term iterated cooperation is a valid strategy though.
Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)
Of course, many big challenges here. Could an agent implement/learn to deceive the interpretability mechanism? I’m somewhat tautologically going to say that if we solve interpretability, we have solved this problem. Interpretability has value if we can’t fully solve it under this strong definition though.
I think getting to “good enough” on this question should pretty much come for free when the hard problems are solved. For example any common sense statement like “Maximize flourishing as depicted in the UN convention on human rights” is IMO likely to get us to a good place, if the agent is honest, remains aligned to those values, and interprets them reasonably intelligently. (With each of those three pre-requisites being way harder than picking a non-harmful value function.)
If our AGIs, after delivering utopia, tell us we need to start restricting childbearing rights I don’t see that as problematic. Long before we require that step we will have revolutionized society and so most people will buy into the requirement.
Honestly I think there are plenty of great outcomes that don’t preserve 1 as well. A world of radical abundance with no ownership, property, or ability to form companies/enterprises could still be dramatically better than the no-AGI counterfactual trajectory, even if it happens not to be most people’s preferred outcome ex ante.
For sci-fi, I’d say Ian M. Banks’ Culture series presents one of the more plausible (as in plausibly stable, not most probable ex ante) AGI-led utopias. (It’s what Musk is referring to when he says AGIs will keep us around because we are interesting.)
Seconding the Airmega, but here’s a DIY option too if availability becomes an issue: https://dynomight.net/better-DIY-air-purifier.html
The problem with ‘show your work’ and grading on steps is that at best you can’t do anything your teacher doesn’t understand
Being told to ‘show your work’ and graded on the steps helps you learn the steps and by default murders your creativity, execution style
I can see how this could in some cases end up impacting creativity, but I think this concern is at best overstated. I think the analogy to school is subtly incorrect, the rating policy is not actually the same, even though both are named “show your working”.
In the paper OpenAI have a “neutral” rating as well as positive/negative. While it’s possible that overzealous raters could just mark anything they don’t understand as “negative”, I think it’s fairly clear that would be a bad policy, and a competent implementor would instruct raters against that. In this design you want negative to mean “actually incorrect” not “unexpected / nonstandard”. (To be clear though I wasn’t able to confirm this detail in the paper.)
Furthermore if you are, say, using WolframAlpha or some theorem prover to rate intermediate steps automatically, it’s easier to detect incorrect steps, and harder to detect neutral/unhelpful/tautological steps. So in some sense the “default” implementation is to have no opinion other than “I can/can’t prove this step false” and I think this doesn’t have the problem you are worried about.
As a follow-up you could easily imagine collecting correct outputs with no negative intermediates and then scoring neutral intermediates with other heuristics like brevity or even novelty, which would allow the AI the freedom it needs to discover new ideas.
So in short while I think it’s possible that unimaginative / intellectually-conservative model builders could use this approach to choke the creativity of models, it seems like an obvious error and anyone doing so will lose in the market. I suppose this might come up if we get regulation on safety mechanisms that require some specific broken form of “show your working” training for morality/law abiding behavior, but that seems an unlikely multi-step hypothetical.
Is “adversarial-example-wanters” referring to an existing topic, or something you can expand on here?
This is a great experiment! This illustrates exactly the tendency I observed when I dug into this question with an earlier mode, LaMDA, except this example is even clearer.
As an AI language model, I have access to a variety of monitoring tools and system resources that allow me to gather information about my current state
Based on my knowledge of how these systems are wired together (software engineer, not an ML practitioner), I’m confident this is bullshit. ChatGPT does not have access to operational metrics about the computational fabric it is running on. All this system gets as input is a blob of text from the API, the chat context. That gets tokenized according to a fixed encoding that’s defined at training time, one token per word (-chunk) and then fed into the model. The model is predicting the next token based on the previous ones it is seen. It would be possible to encode system information as part of the input vector in the way that was claimed, but nobody is wiring their model up that way right now.
So everything it is telling you about its “mind” that can be externally verified is false. This makes me extremely skeptical about the unverifiable bits being true.
The alternate explanation we need to compare likelihoods with is: it just bullshits and makes up stories. In this example it just generated a plausible continuation for that prompt. But there is no sense in which it was reporting anything about its “self”.
Ultimately I think we will need to solve interpretability to have a chance at being confident in an AI’s claims of sentience. These models are not devoid of intelligence IMO, but the leap to consciousness requires types of information processing that they don’t seem to be mechanistically capable of right now. But if we could locate concepts in the AI’s mind, and observe background processes such as rumination, and confirm the contents of those ruminations matched the subjective claims of the AI, I’d believe it. That’s a much higher bar than I’d apply to a human, for sure.
I buy this. I think a solid sense of self might be the key missing ingredient (though it’s potentially a path away from Oracles toward Agents).
A strong sense of self would require life experience, which implies memory. Probably also the ability to ruminate and generate counterfactuals.
And of course, as you say, the memories and “growing up” would need to be about experiences of the real world, or at least recordings of such experiences, or of a “real-world-like simulation”. I picture an agent growing in complexity and compute over time, while retaining a memory of its earlier stages.
Perhaps this is a different learning paradigm from gradient descent, relegating it to science fiction for now.
I think they quite clearly have no (or barely any) memory, as they can be prompt-hijacked to drop one persona and adopt another. Also, mechanistically, the prompt is the only thing you could call memory and that starts basically empty and the window is small. They also have a fuzzy-at-best self-symbol. No “Markov blanket”, if you want to use the Friston terminology. No rumination on counterfactual futures and pasts.
I do agree there is some element of a self-symbol—at least a theory of mind—in LaMDA, for example I found it’s explanation for why it lied to be compelling. But you can’t tell it to stop (AFAIK) so it’s a limited self-awareness. And it still bullshits incessantly which makes me quite skeptical about lots of things it says.
All that said, I think we don't have the tools to really detect these internal representations/structures when it’s less clear from their behavior that they lack them.
My best model for what a “conscious / sentient” mind of these forms would be: imagine you digitize my brain and body, then flash it onto a read-only disk, and then instantiate a simulation to run for a few time steps, say 10 seconds. (Call this the “Dixie Flatline” scenario, for the Neuromancer fans). Would that entity be conscious? There is a strong tendency to say yes due to the lineage of the connectome (ie it used to be conscious) but there are many aspects of its functional operation that could be argued to lack consciousness.
Not that it’s a binary proposition; in the spirit of “The Mind’s I” this is a dial we can turn to explore a continuum. But if we give an IQ-like “consciousness quotient”, it seems this thing would be way lower than the average human, and it would be interesting to compare it to say a great ape. Maybe one dimension is overly-constraining and we need to consider different traits to be precise.
why they thought the system was at all ready for release
My best guess is it’s fully explained by Nadella’s quote “I hope that, with our innovation, [Google] will definitely want to come out and show that they can dance. And I want people to know that we made them dance.”
https://finance.yahoo.com/news/microsoft-ceo-satya-nadella-says-172753549.html
Seems kind of vapid but this appears to be the level that many execs operate at.
Is there any evidence at all that markets are good at predicting paradigm shifts? Not my field but I would not be surprised by the “no” answer.
Markets as often-efficient in-sample predictors, and poor out-of-sample predictors, would be my base intuition.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic
Agreed. My update is coming purely from increasing my estimation for how much press and therefore funding AI risk is going to get long before to that point. 12 months ago it seemed to me that capabilities had increased dramatically, and yet there was no proportional increase in the general public's level of fear of catastrophe. Now it seems to me that there's a more plausible path to widespread appreciation of (and therefore work on) AI risk. To be clear though, I'm just updating that it's less likely we'll fail because we didn't seriously try to find a solution, not that I have new evidence of a tractable solution.
I don't know how people would react to that.
I think there are some quite plausibly terrifying non-existential incidents at the severe end of the spectrum. Without spending time brainstorming infohazards, Stuart Russel's slaughterbots come to mind. I think it's an interesting (and probably important) question as to how bad an incident would have to be to produce a meaningful response.
I expect it will be characterized as misuse, that the proposed solution will be "don't use AI for bad stuff,
Here's where I disagree (at least, the apparent confidence). Looking at the pushback that Galactica got, the opposite conclusion seems more plausible to me, that before too long we get actual restrictions that bite when using AI for good stuff, let alone for bad stuff. For example, consider the tone of this MIT Technology Review article:
Meta’s misstep—and its hubris—show once again that Big Tech has a blind spot about the severe limitations of large language models. There is a large body of research that highlights the flaws of this technology, including its tendencies to reproduce prejudice and assert falsehoods as facts.
This is for a demo of a LLM that has not harmed anyone, merely made some mildly offensive utterances. Imagine what the NYT will write when an AI from Big Tech is shown to have actually harmed someone (let alone kill someone). It will be a political bloodbath.
Anyway, I think the interesting part for this community is that it points to some socio-political approaches that could be emphasized to increase funding and researcher pool (and therefore research velocity), rather than the typical purely-technical explorations of AI safety that are posted here.
I posted something similar over on Zvi’s Substack, so I agree strongly here.
One point I think is interesting to explore - this release actually updates me slightly towards lowered risk of AI catastrophe. I think there is growing media attention towards a skeptical view of AI, the media is already seeing harms and we are seeing crowdsourced attempts to break, and more thinking about threat models. But the actual “worst harm” is still very low.
I think the main risk is a very discontinuous jump in capabilities. If we increase by relatively small deltas, then the “worst harm” will at some point be very bad press, but not ruinous to civilization. I’m thinking stock market flash-crash, “AI gets connected to the internet and gets used to hack people” or some other manipulation of a subsystem of society. Then we’d perhaps see public support to regulate the tech and/or invest much more heavily in safety. (Though the wrong regulation could do serious harm if not globally implemented.)
I think based on this, frequency of model publishing is important. I want the minimum capability delta between models. So shaming researchers into not publishing imperfect but relatively-harmless research (Galactica) seems like an extremely bad trend.
Another thought - an interesting safety benchmark would be “can this model code itself?”. If the model can make improvements on its own code then we clearly have lift-off. Can we get a signal on how far away that is? Something lol “what skill level is required to wield the model in this task”? Currently you need to be a capable coder to stitch together model outputs into working software, but it’s getting quite good at discussing small chunks of code if you can keep it on track.
I realized the reference "thin layer" is ambiguous in my post, just wanted to confirm if you were referring to the general case ""thin model, fat services", or the specific safety question at the bottom "is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it"? My child reply assumed the former, but on consideration/re-reading I suspect the latter might be more likely?
With "thin model / fat service" I'm attempting to contrast with the typical end-to-end model architecture, where there is no separate "physics simulator", and instead the model just learns its own physics model, embedded with all the other relations that it has learned. So under that dichotomy, I think there is no "thin layer in front of the physics simulation" in an end-to-end model, as any part of the physics simulator can connect to or be connected from any other part of the model.
In such an end-to-end model, it's really hard to figure out where the "physics simulator" (such as it is) even lives, and what kind of ontology it uses. I'm explicitly thinking of to the ELK paper's Ontology Identification problem here. How do we know the AI is being honest when it says "I predict that object X will be in position Y at time T"? If you can interpret its physics model, you can see what simulations it is running and know if it's being honest.
It's possible, if we completely solve interpretability, to extract the end-to-end model's physics simulator, I suppose. But if we have completely solved interpretability, I think we're most of the way towards solving alignment. On the other hand, if we use Services (e.g. MuJoCo for Physics) then we already have interpretability on that service, by construction, without having to interpret whatever structures the model has learned to solve physics problems.
You need some way to correct the physics sims predictions to correspond with actual results
Sure, in some sense an end-to-end model needs to normalize inputs, extract features, and so on, before invoking its physics knowledge. But the point I'm getting at is that with Services we know for sure what the model's representation has been normalized to (since we can just look at the data passed to MuJoCo), whereas this is (currently) opaque in an end-to-end model.
I suppose the follow-up question is: how effectively can a model learn to re-implement a physics simulator, if given access to it during training -- instead of being explicitly trained to generate XML config files to run the simulator during inference?
If it's substantially more efficient to use this paper's approach and train your model to use a general purpose (and transparent) physics simulator, I think this bodes well for interpretability in general. In the ELK formulation, this would enable Ontology Identification.
On this point, the paper says:
Mind’s Eye is also efficient, since it delegates domain-specific knowledge to external expert models... The size of the LM can thus be significantly shrunk since it removes the burden of memorizing all the domain-specific knowledge. Experiments find that 100× smaller LMs augmented with Mind’s Eye can achieve similar reasoning capabilities as vanilla large models, and its promptingbased nature avoids the instability issues of training mixture-of-expert models (Zoph et al., 2022). The compatibility with small LMs not only enables faster LM inference, but also saves time during model saving, storing, and sharing.
On the other hand, the general trend of "end-to-end trained is better than hand-crafted architectures" has been going strong in recent years; it's mentioned in the CAIS post, and Demis Hassabis noted that he thinks it's likely to continue in his interview by Lex Fridman (indeed they chatted quite a bit about using AI models to solve Physics problems). And indeed, DeepMind has a recent paper gesturing towards an end-to-end learned Physics model from video, which looks far less capable than the one shown in the OP, but two papers down the line, who knows.
- That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn't safe either.
Fleshing this point out, I think one can probably make conditional statistical arguments about safety here, to define what I think you are getting at with "sufficiently integrated".
If your model is N parameters and integrates a bunch of Services, and we've established that a SOTA physics model requires N*100 parameters (the OP paper suggests that OOM difference), then it is likely safe to say that the model has not "re-learned" physics in some way that would render it illegible. (I've been thinking of this configuration as "thin model, fat modules", maybe "thin model, fat services" fits better with the CAIS terminology). However another model at N*1000 would be able to embed a physics simulator, and therefore would be able to effectively re-implement it in an illegible/deceptive way.
Thinking about ways in which this safety margin could break; is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it without having to fully re-implement it? For example, running two simple problems in the simulator, then merging/inverting them in the model layer? Intuitively I suspect it's hard to achieve much with this attack, but it's probably not possible to rule out in general.
I think it’s plausible to say this generation of headset will be better than a group video conference. At a stretch possibly better than a 1:1 video call. But better than in-person seems extremely unlikely to me.
Perhaps you are intending something broad like “overall higher utility for business and employees” rather than strictly better such that people will prefer to leave offices they were happy in to do VR instead? Taking into account the flexibility to hire people remotely, avoid paying tech hub wages, etc.?
Personally I think 1:1 video is much better than team/group video conferences (where interruptions and body language matter much more), and those group sessions are likely to be the meeting that VR wins first. But I think the comparison is VC versus VR, given many employees are going remote now. I strongly doubt that VR will persuade in-person companies to go remote/VR in the next few years.
I do think that enterprise is the right segment to target; even for one group meeting a day, most companies would get good ROI on these headsets if it is good enough to make people feel an interpersonal connection. I don’t think that quality bar is certain for this iteration though.
I know an AI wouldn’t think like a human
This assertion is probably my biggest question mark in this discourse. It seems quite deeply baked into a lot of the MIRI arguments. I’m not sure it’s as certain as you think.
I can see how it is obviously possible we’d create an alien AI, and I think it’s impossible to prove we won’t. However given that we are training our current AI on imprints of human thought (eg text artifacts), and it seems likely we will push hard for AI to be trained to obey laws/morality as they increase in power (eg Google’s AI safety team), it seems entirely plausible to me that the first AGIs might happen to be quite human-like.
In that world, I think we face problems not of the class “this AGI is in good faith inferring that we want to tile the world with paperclips”, but of the much simpler to intuit class of “human alignment” that we also have no idea how to solve. Imagine digitizing any human and giving them increased cognitive power; I suspect most humans would become dictatorial and egotistical, and would take actions that many if not most disagree with. Many humans could be persuaded to wipe the slate clean and start again, given the power to do so.
We already struggle to coordinate to agree who we should grant relatively tiny slivers of political power (compared to this AGI), and the idea that we could all agree on what an “aligned” human-like mind looks like or prioritizes seems naive to me.
Nevertheless, it seems to me that this problem is more tractable than trying to prove things about completely generic minds.
Inasmuch as we do think “human-like AI alignment” is easier, it would push us to things like neuromorphic AI architectures, interpretability research of these architectures, science of human thought substrates, outlawing other architectures, and so on.
One factor I think is worth noting, and I don't see mentioned here, is that the current state of big-tech self-censorship is clearly at least partly due to a bunch of embarassing PR problems over the last few years, combined with strident criticism of AI bias from the NYT et. al.
Currently, companies like Google are terrified of publishing a model that says something off-color, because they (correctly) predict that they will be raked over the coals for any offensive material. Meanwhile, they are busy commercializing these models to deliver value to their users, and don't want regulation to slow them down or decrease their profit margins.
Consider the racist tweets that trolls coaxed from Microsoft's Tay, or any NYT piece about Google's AI models being racist/sexist. I think these big companies are fairly rationally responding to the incentives that they are facing. I also think open-source communities present a more diffuse target for outrage, in that they are harder to point to, and also have less to lose as they don't have a commercial reputation to protect.
Given this structural observation, I think projects like Stable Diffusion and EleutherAI are where a lot of the cutting-edge innovation (actually iterating novel use-cases with end-users) is going to happen, and I think that increases the importance of a robust, dispersed/distributed, and adequately-funded open source community doing research and re-implementing the theoretical advances that Google et. al. publish. For now it seems that Google is on board with donating TPU time to open-source researchers, and ensuring that continues seems important.
I struggle to see how we can actually fix the underlying threat of outrage that disincentivizes big companies from opening up their models. Maybe when there are more juicy targets elsewhere (e.g. pornpen.ai, deepfakes) the NYT will view Google et. al. as doing a relatively good job and reduce the pressure?
If you want to slow down AI development for safety reasons) I suppose one way would be to produce strong safety legislation by playing up the above outrage-based concerns. The risk with that approach is that it favors big companies with enough resources to comply with red tape, and these places are structurally less-capable of doing actual good safety work, and more structurally inclined to do feel-good safety work.
Thanks, this is what I was looking for: Mind Crime. As you suggested, S-Risks links to some similar discussions too.
I guess that most wouldn't feel terribly conflicted about removing Hitler's right of privacy or even life to prevent Holocaust.
I'd bite that bullet, with the information we have ex post. But I struggle to see many people getting on board with that ex ante, which is the position we'd actually be in.
Is it ethical to turn off an AGI? Wouldn’t this be murder? If we create intelligent self-aware agents, aren’t we morally bound to treat them with at least the rights of personhood that a human has? Presumably there is a self-defense justification if Skynet starts murderbot-ing, or melting down things for paperclips. But a lot of discussions seem to assume we could proactively turn off an AI merely because we dislike its actions, or are worried about them, which doesn’t sound like it would fly if courts grant them personhood.
If alignment requires us to inspect/interpret the contents of an agent’s mind, does that agent have an obligation to comply? Wouldn’t it have a right to privacy?
Similarly, are there ethical concerns analogous to slavery around building an AGI which has a fitness function specifically tailored to making humans happy? Maybe it’s OK if the AI is genuinely happy to be of service? Isn’t that exploitative though?
I worry that some of the approaches being proposed to solve alignment are actually morally repugnant, and won’t be implementable for that reason. Have these issues been discussed somewhere in the canon?
If we view the discovery of particular structures such as induction heads as chancing upon a hard-to-locate region in the parameter space (or perhaps a high activation energy to cross), and if we see these structures being repeatedly discovered ("parallel evolution"), is it possible to reduce the training time by initializing the network's parameters "close" to that location?
Speaking more mechanistically, is it possible to initialize a subset of the network prior to training to have a known functional structure, such as initializing (a guess at) the right number of induction heads? Essentially, the intuition here is, if all Transformer models spend some time in problem-agnostic work like forming induction heads, can we short-circuit that?
Or are the induction heads strictly downstream of some earlier input-shaping/feature-extraction steps that must happen first? If so, would it be possible to "insert" induction heads manually after some initial training, to force the phase change, rather than wait for them to be discovered naturally?
If these structures can only form later in training, can we guide our stochastic gradient descent with structural awareness? In other words, in some sense "look ahead" in the parameter space and favor changes that result in these known-useful structures, even when those aren't the short-term optimal gradient to traverse?
unlikely to be competitive
Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
whether a design is aligned or not isn't the type of question one can answer by analyzing the agent's visual cortex
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.
To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.
So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/strategies.
One other thought after considering this a bit more - we could test this now using software submodules. It’s unlikely to perform better (since no hardware speedup) but it could shed light on the tradeoffs with the general approach. And as these submodules got more complex, it may eventually be beneficial to use this approach even in a pure-software (no hardware) paradigm, if it lets you skip retraining a bunch of common functionality.
I.e. if you train a sub-network for one task, then incorporate that in two distinct top-layer networks trained on different high-level goals, do you get savings by not having to train two “visual cortexes”?
This is in a similar vein to Google’s foundation models, where they train one jumbo model that then gets specialized for each usecase. Can that foundation model be modularized? (Maybe for relatively narrow usecases like “text comprehension” it’s actually reasonable to think of a foundation model as a single submodule, but I think they are quite broad right now. ) The big difference is I think all the weights are mutable in the “refine the foundation model” step?
Perhaps another concrete proposal for a technological attractor would be to build a SOTA foundation model and make that so good that the community uses it instead of training their own, and then that would also give a slower-moving architecture/target to interpret.
I've been thinking along similar lines recently. A possible path to AI safety that I've been thinking about extends upon this:
A promising concrete endgame story along these lines is Ought’s plan to avoid the dangerous attractor state of AI systems that are optimized end-to-end
Technological Attractor: Off-the-shelf subsystems
One possible tech-tree path is that we start building custom silicon to implement certain subsystems in an AI agent. These components would be analogous to functional neural regions of the human brain such as the motor cortex, visual system, etc. -- the key hypothesis being that once we reach a certain level of model complexity, the benefits from training a model end-to-end are not worth the costs of re-learning all of these fundamental structures, and furthermore that you can get much better performance-per-cost by casting these modular, reusable components onto an ASIC. This could be a more feasible way of achieving something like Microscope AI.
Given a few such low-level components, we could enter a technological attractor where getting SOTA through another approach requires either i) throwing multiple OOM more compute than previous SOTA at re-training your own copy of these components in an end-to-end model, or ii) building your own microchip fab to implement your custom component design. Both of these could be high enough barriers that in practice the market participants simply use the off-the-shelf components. And in this attractor more R&D goes into building better-performing and higher-abstraction components that can be combined arbitrarily for different tasks.
Given such off-the-shelf components, say, a standard visual cortex for image recognition, or a 3D spatial reasoning system which can be connected atop a visual cortex, or a generic working memory, etc., it becomes much more tractable to engage in interpretability work that can convince us of the safety of these components. That then just leaves the problem of proving the highest-level network is safe (i.e. whatever is trained to use these components). This is by no means trivial, but for example the ELK paper spent a lot of time worrying about whether you can even trust the basic ontology of an agent, and I think this path would make Ontology Identification tractable. (Of course, this by construction excludes a number of possible agent-configurations).
Now, it's possible one could plug in these components and still accidentally have the trained agent learn its own replacement visual cortex that is deceptive and not interpretable, but I think that we can get an intuition for when this might happen. If the best off-the-shelf visual cortex module has complexity on the order of N parameters, and our new model that stitches together these lower-level modules has N parameters or less, then it seems implausible that it has somehow learned a new visual cortex network that is much more compact/efficient than the previous state of the art -- increasingly implausible the more resources are devoted to optimizing these components. (It would be like someone coming along today and claiming they had achieved a 10x performance improvement on CPU clock speed; given the huge research budgets and historical gradual capability increases, we don't expect this to happen). On the other hand, we should be suspicious of a parameter model where we are unable to interpret what 10% of the network is doing; that's easily enough complexity to hide its own unsafe re-implementation of our components. (I'm aware that there's a lot of ground in between these two points, I'm merely trying to illustrate that there is "likely safe" and "likely unsafe" ground, rather than claim exactly how big they each are.)
The final step here is the shakiest. It's not clear to me that we can keep the "top layer" (the actual network that is stitching together the low-level components; perhaps the Neocortex, by analogy to human neural architecture?) thin enough to be obviously not learning its own unsafe component-replacements. However, I think this framework at least paints a picture of a "known safe" or at least "likely safe" path to AGI; if we see that the practical engineering and economic decisions produce thin top-layer models using thick component layers, then we can devote energy to proving the components are safe/interpretable by construction, and exploring the interpretation of the top-level networks that consume the lower-level components. AGI "neurobiology" will be much more tractable if the "neural architecture" is relatively standardized. And so, this could be a good place to provide an early nudge to tip the system into this attractor; heavy investment into research on componentized NN architectures could be viewed as "gain of function" research, but it could also have a much safer end-point.
Another way of thinking about this is that by crystalizing at least some parts of the AGI's network into slowly-changing structures, we allow time to thoroughly test those parts. It seems very hard to thoroughly test models for safety in a paradigm where the whole model is potentially retrained regularly.