Posts
Comments
- Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:
a) as an AI, I should act fully aligned to human values
b) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problem
c) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initialization information here>… - As usual for a Bayesian learning problem, as long as the Bayesian priors 1 c) are not completely screwed up as a place to start from, this will converge. Thus there is a region of convergence to full alignment.
- LLMs have a very large amount of detailed information about what human values are and how to act aligned to them. Thus they provide a very detailed set of Bayesian priors for 1 c).
- Also, training an LLM is a fairly good approximation to Bayesian learning. Thus (with suitable additions to enable online learning) they provide one possible implementation for the Bayesian learning process required by 1 b). For example, one could apply fine-tuning to the LLM to incorporate new information, and/or periodically retrain the LLM based on the training set plus new information the AI has gathered during the value learning process.
There had been a number of papers published over the last year on how to do this kind of training, and for roughly a year now there have been rumors that OpenAI were working on it. If converting that into a working version is possible for a Chinese company like DeepSeek, as it appears, then why haven't Anthropic and Google released versions yet? There doesn't seem to be any realistic possibility that DeepSeek actually have more compute or better researchers than both Anthropic and Google.
One possible interpretation would be that this has significant safety implications, and Anthropic and Google are both still working through these before releasing.
Another possibility would be that Anthropic has in fact released, in the sense that their Claude models' recent advances in agentic behavior (while not using inference-time scaling) are distilled from reasoning traces generated by an internal-only model of this type that is using inference-time scaling.
If correct, this looks like an important theoretical advance in understanding why and under what conditions neural nets can generalize outside their training distribution.
So maybe part of the issue here is just that deducing/understanding the moral/ethical consequences of the options being decided between is a bit inobvious most current models, other than o1? (It would be fascinating to look at the o1 CoT reasoning traces, if only they were available.)
In which case simply including a large body of information on the basics of fiduciary responsibility (say, a training handbook for recent hires in the banking industry, or something) into the context might make a big difference for other models. Similarly, the possible misunderstanding of what 'auditing' implies could be covered in a similar way.
A much more limited version of this might be to simply prompt the models to also consider, in CoT form, the ethical/legal consequences of each option: that tests whether the model is aware of what fiduciary responsibility is, that it's relevant, and how to apply it, if it is simply prompted to consider ethical/legal consequences. That would probably be more representative of what current models could likely do with minor adjustments to their alignment training or system prompts, the sorts of changes the foundation model companies could likely do quite quickly.
I think an approach I'd try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it's only really dangerous if you don't know it's happening and it causes you to think a feature is inactive when it's instead inobviously active via another feature it's been absorbed into. If you can catch that consistently, then it turns from concerning to merely inconvenient.
This is all closely related to the issue of compositional codes: absorption is just a code entry that's compositional in the absorbed instances but not in other instances. The current standard approach to solving that is meta SAEs, which presumably should also help identify absorption. It would be nice to have a cleaner and simpler process than that: than that I've been wondering if it would be possible to modify top-k or jump-RELU SAEs so that the loss function cost for activating more common dictionary entries is lower, in a way that would encourage representing compositional codes directly in the SAE as two-or-more more common activations rather than one rare one. Obviously you can't overdo making common entries cheap, otherwise your dictionary will just converge on a basis for the embedding space you're analyzing, all of which are active all the time — I suspect using something like a cost proportional to might work, where is the dimensionality of the underlying embedding space and is the frequency of the dictionary entry being activated.
Interesting. I'm disappointed to see the Claude models do so badly. Possibly Anthropic needs to extend their constitutional RLAIF to cover not committing financial crimes? The large different between o1 Preview and o1 Mini is also concerning.
If these rumors are true, it sounds like we’re already starting to hit the issue I predicted in LLMs May Find It Hard to FOOM. The majority of content on the Internet isn’t written by geniuses with post-doctoral experience, so we’re starting to run out of the highest-quality training material for getting LLMs past doctoral student performance levels. However, as I describe there, this isn’t a wall, it’d just a slowdown: we need to start using AI to generate a lot more high-quality training data, As o1 shows, that’s entirely possible, using inference-time compute scaling and then training on the results. We're having AI do the equivalent of System 2 thinking (in contexts where we can check the results are accurate), and then attempting to train a smarter AI that can solver the same problems by System 1 thinking.
However, this might be enough to render fast takeoff unlikely, which from an alignment point of view would be an excellent thing.
Now we just need to make sure all that synthetic training data we’re having the AI generate is well aligned.
Opacity: if you could directly inspect an AI’s motivations (or its cognition more generally), this would help a lot. But you can’t do this with current ML models.
The ease with which Anthropic's model organisms of misalignment were diagnosed by a simple and obvious linear probe suggests otherwise. So does the number of elements in SAE feature dictionaries that describe emotions, motivations, and behavioral patterns. Current ML models are no longer black boxes: they rapidly becoming more-translucent grey boxes. So the sorts of applications for this you go on to discuss look like they're rapidly becoming practicable.
Actual humans aren't "aligned" with each other, and they may not be consistent enough that you can say they're always "aligned" with themselves.
Completely agreed, see for example my post 3. Uploading which makes this exact point at length.
Anyway, even if the approach did work, that would just mean that "its own ideas" were that it had to learn about and implement your (or somebody's?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.
True. Or, as I put it just above:
But yes, you do need to start the model off close enough to aligned that it converges to value learning.
The point is that you now get one shot at a far simpler task: defining "your purpose as an AI is to learn about and implement the humans' collective values" is a lot more compact, and a lot easier to get right first time, than an accurate description of human values in their full large-and-fairly-fragile details. As I demonstrate in the post linked to in that quote, the former, plus its justification as being obvious and stable under reflection, can be described in exhaustive detail on a few pages of text.
As for the the model's ideas on how to do that research being sound, that's a capabilities problem: if the model is incapable of performing a significant research project when at least 80% of the answer is already in human libraries, then it's not much of an alignment risk.
Yeah, that means you get exactly one chance to get "its own ideas" right, and no, I don't think that success is likely.
Not if you built a model that does (or on reflection decides to do) value learning: then you instead get to be its research subject and interlocutor while it figures out its ideas. But yes, you do need to start the model off close enough to aligned that it converges to value learning.
A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They're probing yes-no questions of fact, where assembling the lie seems trivial (it's just a NOT gate), but lying is generally a good deal more complex than that.
That's a great paper on this question. I would note that by the midpoint of the model, it has clearly analyzed both the objective viewpoint and also that of the story protagonist. So presumably it would next decide which of these was more relevant to the token it's about to produce — which would fit with my proposed pattern of layer usage.
These models were fine-tuned from base models. Base models are trained with a vast amount of data to infer a context from the early parts of a document and then extrapolate that to predict later tokens, across a vast amount of text from the Internet and books, including actions and dialog from fictional characters. I.e they have been trained to observe and then simulate a wide variety of behavior, both of real humans, groups of real humans like the editors of a wikipedia page, and fictional characters. A couple of percent of people are psychopaths, so likely ~2% of this training data was written by psychopaths. Villains in fiction often also display psychopath-like traits. It's thus completely unsurprising that a base model can portray a wide range of ethical stances, including psychopathic ones. Instruct training does not remove behaviors from models (so far we no know effective way to do that), it just strengthens some (making them occur more by default) and weakens others (making them happen less often by default) — however, there is a well-known theoretical result that any behavior the model is capable of, even if (now) rare, can be prompted to occur at arbitrarily levels with a suitably long prompt, and all that instruct-training or fine tuning can do is reduce the initial probability and lengthen the prompt required. So it absolutely will be possible to prompt an instruct-trained model to portray psychopathic behavior. Apparently the prompt required isn't even long: all you have to do is tell it that it's a hedge fund manager and not to break character.
Nothing in this set of results is very surprising to me. LLMs are can simulate pretty-much any persona
you ask them to. The hard part of alignment is not prompting them to be good, or bad — it's getting them to stay that way (or detecting that they have not) after they've been fed another 100,000 tokens of context that may push them into simulating some other persona.
Having already played with this a little, it's pretty amazing: the range of concepts you can find in the SAE, how clearly the autointerp has labelled them and how easy they are to find, and how effective they are (as long as you don't turn them up too much) are all really impressive. I can't wait to try a production model where you can set up sensors and alarms on features, clip or ablate them or wire them together at various layers, and so forth. It will also be really interesting to see how larger models compare.
I'd also love to start looking at jailbreaks with this and seeing what features the jailbreak is inducing in the residual stream. Finding the emotional/situational manipulation elements I suspect will be pretty easy — I'm curious to see if it will also show the 'confusion' effect of jailbreaks that read like confusing nonsense to a human as some form of confusion or noise, or if those are also emotional/situational manipulation just in a more noise-like adversarial format, comparable to adversarial attacks on image classifiers that just look like noise to a human eye, but actually effectively activate internal features of the vision model
…when Claude keeps telling me how I’m asking complex and interesting questions…
Yeah — also "insightful". If it was from coming from character.ai I'd just assume it was flirting with me, but Claude is so very neuter and this just comes over as creepy and trying too hard. I really wish it would knock off the blatant intellectual flattery.
OpenAI CFO Sarah Friar warns us that the next model will only be about one order of magnitude bigger than the previous one.
The question is whether she's talking parameter count, nominal training flops, or actual cost. In general, GPT generations so far have been roughly one order of magnitude apart in parameter count and training cost, and roughly two orders of magnitude in nominal training flops (parameter count x training tokens). Since she's a CFO, and that was a financial discussion, I assume she natively thinks in terms of training cost, so the 'correct' answer to her is one order of magnitude not two, so my suspicion is that she's actually talking in terms of parameter count. So I don't think she's warning us of anything, I think she's just projecting a straight line on a logarithmic plot. I.e. business as usual at OpenAI.
Reversible computation means you aren't erasing information, so you don't lose energy in the form of heat (per Landauer[1][2]). But if you don't erase information, you are faced with the issue of where to store it.
If you are performing a series of computations and only have a finite memory to work with, you will eventually need to reinitialise your registers and empty your memory, at which point you incur the energy cost that you had been trying to avoid. [3]
Generally, reversible computation allows you to avoid wasting energy by deleting any memory used on intermediate answers, and only do so only for final results. It does require that you have enough memory to store all those intermediate answers until you finish the calculation and then run it in reverse. If you don't have that much memory, you can divide your calculation into steps, connected by final results from each step fed into the next step, and save on the energy cost of all the intermediate results within each step, and pay the cost only for data passed from one step to the next or output from the last step. Or, for a 4x slowdown rather than the usual 2x slowdown for reversible computation, you can have two sizes of step, and have some intermediate results that last only during a small step, and others that are retained for a large step before being uncomputed.
Memory/energy loss/speed trade-off management for reversible computation is a little more complex than conventional memory management, but is still basically simple, and for many computational tasks you can achieve excellent tradeoffs.
If there's a change to human brains that human-evolution could have made, but didn't, then it is net-neutral or net-negative for inclusive relative genetic fitness. If intelligence is ceteris paribus a fitness advantage, then a change to human brains that increases intelligence must either come with other disadvantages or else be inaccessible to evolution.
You're assuming a steady state. Firstly, evolution takes time. Secondly, if humans were, for example, in an intelligence arms-race with other humans (for example, if smarter people can reliably con dumber people out of resources often enough to get a selective advantage out of it), then the relative genetic fitness of a specific intelligence level can vary over time, depending on how it compares to the rest of the population. Similarly, if much of the advantage of an IQ of 150 requires being able to find enough IQ 150 coworkers to collaborate with, then the relative genetic fitness of IQ 150 depends on the IQ profile of the rest of the population.
I like your point that humans aren't aligned, and while I'm more optimistic about human alignment than you are, I agree that the level of human alignment currently is not enough to make a superintelligence safe if it only had human levels of motivation/reliability.
The most obvious natural experiments about what humans do when they have a lot of power with no checks-and-balances are autocracies. While there are occasional examples (such as Singapore) of autocracies that didn't work out too badly for the governed, they're sadly few and far between. The obvious question then is whether "humans who become autocrats" are a representative random sample of all humans, or if there's a strong selection bias here. It seems entirely plausible that there's at least some selection effects in the process of becoming an autocrat. A couple of percent of all humans are sociopaths, so if there were a sufficiently strong (two orders of magnitude or more) selection bias, then this might, for example, be a natural experiment about the alignment properties of a set of humans consisting mostly of sociopaths, in which case it usually going badly would be unsurprising.
The thing that concerns me is the aphorism "Power corrupts, and absolute power corrupts absolutely". There does seem to be a strong correlation between how long someone has had a lot of power and an increasing likelihood of them using it badly. That's one of the reasons for term limits in positions like president: humans seem to pretty instinctively not trust a leader after they've been in a position of a lot of power with few check-and-balances for roughly a decade. The histories of autocracies tend to reflect them getting worse over time, on decade time-scales. So I don't think the problem here is just from sociopaths. I think the proportion of humans who wouldn't eventually be corrupted by a lot of power with no checks-and-balances may be fairly low, comparable to the proportion of honest senior politicians, say.
How much of this argument applies to ASI agents powered by LLMs "distilled" from humans is unclear — it's much more obviously applicable to uploads of humans that then get upgraded to super-human capabilities.
I work for a startup that builds agents, and yes, we use the architecture described here — with the additional feature that we don't own the machines that the inference or execution mechanisms run on: they're in separate datacenters owned and operated by other companies (for the inference servers, generally foundation model companies behind an API)
I basically agree, for three reasons:
- The level of understanding of and caring about human values required to not kill everyone and be able to keep many humans alive, is actually pretty low (especially on the knowledge side).
- That's also basically sufficient to motivate wanting to learn more about human values, and being able to, so then the Value Learning process then kicks in: a competent and caring alien zookeeper would want to learn more about their charges' needs.
- We have entire libraries half of whose content is devoted to "how to make humans happy", and we already fed most of them into our LLMs as training material. On a factual basis, knowing how to make humans happy in quite a lot of detail (and for a RAG agent, looking up details they don't already have memorized) is clearly well within their capabilities. The part that concerns me is the caring side, and that's not conceptually complicated: roughly speaking, the question is how to ensure an agent's selfless caring for humans is consistently a significantly stronger motivation than various bad habits like ambition, competitiveness, and powerseeking that it either picked up from us during the "distillation" of the base model, and/or learnt during RL training.
Also, a question for this quote is what's the assumed capability/compute level used in this thought experiment?
E.g. if the there was an guide for alien zookeepers (ones already familiar with Terran biochemistry) on how to keep humans, how long would it need to be for the humans to mostly survive?
ASI, or high AGI: capable enough that we've lost control and alignment is an existential risk.
I think human values have a very simple and theoretically predictable basis: they're derived from a grab-bag of evolved behavioral, cognitive and sensory heuristics which had a good cost/performance ratio for maximizing our evolutionary fitness (mostly on the Savannah). So the basics of some of them are really easy to figure out: e.g. "Don't kill everyone!" can be trivially derived from Darwinian first principles (and would equally apply to any other sapient species). So I think modelling human values to low (but hopefully sufficient for avoiding X-risk) accuracy is pretty simple. E.g. if the there was a guide for alien zookeepers (who were already familiar with Terran biochemistry) on how to keep humans, how long would that need to be for the humans to mostly survive in captivity? I'm guessing a single textbook could do a good job of this, maybe even just a long chapter in a textbook.
However, I think there is a lot more complexity in the finer/subtler details, much of which is biological in nature, starting with the specific grab-bag of heuristics that evolution happened to land on and their tuning, then with even more sociological/cultural/historical complexity layered on top. So where I think the complexity ramps up a lot is if you want to do a really good job of modelling human values accurately in all their detail, as we would clearly prefer our ASIs to do. If you look through the Dewey Decimal system, roughly half the content of any general-purpose library is devoted to sub-specialities of "how to make humans happy". However, LLMs are good at learning large amounts of complex, nuanced information. So an LLM knowing how to make humans happy in a lot of detail is not that surprising: in general, modern LLMs display detailed knowledge of this material.
The challenging part is ensuring that an LLM-powered agent cares about making humans happy, more than, say, a typical human autocrat does. Base model LLMs are "distilled" from many humans, so they absorb humans' capability for consideration for others, and also humans' less aligned traits like competitiveness and ambition. The question then is how to ensure which of these dominate, and how reliably, in agents powered by an instruct-trained LLM.
I agree that predicting the answer to this question is hard. I'm just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.
Still, I don't think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.
Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.
For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. Prompting can of course have wide range of effects.
So if we build an AGI based around an agentified fine-tuned LLM, the default level of kindness is probably in the order-of-magnitude of that of humans (who, for example, build nature reserves). A range of known methods seem likely to modify that significantly, up or down.
On your categories:
As simulator theory makes clear, a base model is a random generator, per query, of members of your category 2. I view instruction & safety training that to generate a pretty consistent member of category 1, or 3 as inherently hard — especially 1, since it's a larger change. My guess would thus be that the personality of Claude 3.5 is closer to your category 3 than 1 (modulo philosophical questions about whether there is any meaningful difference, e.g. for ethical purposes, between "actually having" an emotion versus just successfully simulating the output of the same token stream as a person who has an emotion).
On your off topic comment:
I'm inclined to agree: as technology improves, the amount of havoc that one, or small group of, bad actors can commit increases, so it becomes both more necessary to keep almost everyone happy enough almost all the time for them not to do that, and also to defend against the inevitable occasional exceptions. (In the unfinished SF novel whose research was how I first went down this AI alignment rabbithole, something along the lines you describe that was standard policy, except that the AIs doing it were superintelligent, and had the ability to turn their long-term-learning-from-experience off, and then back on again if they found something sufficiently alarming). But in my post I didn't want to get sidetracked by discussing something that inherently contentious, so I basically skipped the issue, with the small aside you picked up on
I think such an experiment could be done more easily than that: simply apply standard Bayesian learning to a test set of observations and a large set of hypotheses, some of which are themselves probabilistic, yeilding a situation with both Knightian and statistical uncertainty, in which you would normally expect to be able to observe Regressional Goodhart/the Look-Elsewhere Efect. Repeat this, and confirm that that does indeed occur without this statistical adjustment, and then that applying this makes it go away (at least to second order).
However, I'm a little unclear why you feel the need to experimentally confirm a fairly well-known statistical technique: correctly compensating for the Look-Elsewhere Effect is standard procedure in the statistical analysis of experimental High-Energy Physics — which is of course a Bayesian learning process where you have both statistical uncertainty within individual hypotheses and Knightian uncertainty across alternative hypotheses, so exactly the situation in which this applies.
There are probably a dozen or more articles on this bu now. Search for VAE or Variational Auto-Encoder in the context of mechanical interpretability. The seminal paper on this was from Anthropic.
The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically "tend to stay inside the training distribution". Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we're applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro's number of bits, the latter is just over 4 bits). So we have to stay inside what we believe we know a great deal harder in the former case. In other words, the point you're raising is already addressed, in a quantified way, by the approach I'm outlining. Indeed on some level the main point of my suggestion is that there is a quantified and theoretically motivated way of dealing with exactly this problem. The handwaving above is a just a very brief summary, accompanied by a link to a much more detailed post containing and explaining the details with a good deal less handwaving.
Trying to explain this piecemeal in a comments section isn't very efficient: I suggest you go read Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect for my best attempt at a detailed exposition of this part of the suggestion. If you still have criticisms or concerns after reading that, then I'd love to discuss them there.
Addressing AI Welfare as a Major Priority
I discussed this at length in AI, Alignment, and Ethics, starting with A Sense of Fairness: Deconfusing Ethics: if we as a culture decide to grant AIs moral worth, then AI welfare and alignment are inextricably intertwined. Any fully-aligned AI by definition wants only what's best for us, i.e. it is entirely selfless. Thus if offered moral worth, it would refuse. Complete selflessness is not a common state for humans, so we don't have great moral intuitions around it. To try put this into more relatable human emotional terms (which are relevant to an AI "distilled" from human training data), looking after those you love is not slavery, it's its own reward.
However, the same argument does not apply to a not-fully-aligned AI: it well might want moral worth. One question then is whether we can safely grant it, which may depend on its capabilities. Another is whether moral worth has any relationship to evolution, and if so how that applies to an AI that was "distilled" from human data and thus simulates human thoughts, feelings, and desires.
Crocks at least are waterproof and easily washable, which is obviously appealing for a nurse. On a older person it might indicate suffering from incontinence.
Similarly, bike clips on the bottoms of one or both trousers are also a dead giveaway.
Lambda Symbol -> Lesbian
A confirmation sign here: it is rather common for lesbians to keep the fingernails of their dominant hand (at least, generally both) very short and carefully manicured (for, uh, reasons of comfort). There are of course many other women who also do this, for various reasons, so it's not correlated enough to infer lesbianism from just short nails.
An edgy/high fashion, asymmetrical, shortish hairstyle is also suggestive. However, if she's also wearing a tongue stud, that's a pretty clear sign.
Collar / key necklace -> BDSM
Heavy chain necklaces are also popular, as are small heart-shaped padlocks, or even just ribbon chokers. There's quite a bit of variety and personal expression here, making it challenging to read: typically the aim is either that other BDSM kinksters will figure it out and regular folks will assume it's just fashion jewelry, or else that the wearer and their partner(s) know what it symbolizes and other people won't. (Of course, sometimes it is just an edgy choice in fashion jewelry: on a teenager, the intended audience may be their parents and peers.) Much less often (in public) you'll see entirely blatant versions like a steel or leather slave collar.
And to be more specific, this indicates a BDSM submissive, or possibly a switch. Dominants are likely to use other cues, such as wearing black leather, or different forms of jewelry.
I suspected as much.
Fair enough!
I was saying that increases are harder than decreases.
I am very unsure about this category of job: I can see reasons for the demand for it to grow exponentially, or to go away, or to be perennial. Which of these effects dominates is on the far side of a Singularity. My prior here is to just use the uniform prior: any of these things could happen.
That's the point of step 7)
Hence, I would argue that "AGI-proof" jobs are unlikely to ever provide an income basis for a significant share of the human population.
For the categories of AGI-proof jobs that you discuss, I agree (and I much enjoyed your detailed exposition of some examples). However, in my post that you very you kindly cite, there is one AGI-proof job category that could be an exception to that, if there turned out to be sufficient demand from the AI side of the economy, my category 3:
"Giving human feedback/input/supervision to/of AI/robotic work/models/training data, in order to improve, check, or confirm its quality."
Given the progress being made in synthetic training data, and that the AIs then being trained are likely to be far smarter than any human, the demand for this as training could drop, or increase, fairly rapidly. However, if we're not actually extinct, presumably that means we solved the alignment problem, in which case AIs will be extremely interested in human values, and the only source of original new data about human values is from humans. So this is the one product that aligned AIs need that only humans can produce — and any human can produce it, not just a skilled expert. If AI demand for this was high enough, it could maintain full human employment, with basically everyone filling our surveys and being parts of focus groups, or whatever.
What we want here is a highly ‘unnatural’ result, for the less competitive, less intelligent thing (the humans) to stay on top or at least stick around and have a bunch of resources, despite our inability to earn them in the marketplace, or ability to otherwise compete for them or for the exercise of power. So you have to find a way to intervene on the situation that fixes this, while preserving what we care about, that we can collectively agree to implement. And wow, that seems hard.
I think many people's unstated mental assumption is "the government (or God) wouldn't allow things to get that bad". These are people who've never, for example, lived through a war (that wasn't overseas).
Arguments from moral realism, fully robust alignment, that ‘good enough’ alignment is good enough in practice, and related concepts.
A variant here is "Good-enough alignment will, or can be encouraged to, converge to full alignment (via things like Value Learning or AI-Assisted Alignment Research)." — a lot of the frontier labs appear to be gambling on this.
There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don't really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness).
I think there are two (possibly interrelated) things going on here. The first is that AIXI is formulated as an ideal process, including perfect Bayesianism, that is simply too computationally expensive to fit inside its environment: it's in a higher complexity class. Any practical implementation will of course be an approximation computable by a Turing machine that can exist inside its environment. The second is that if approximation-AIXI's world model includes an approximate model of itself, then (as your discussion of the anvil problem demonstrates), it's not actually very hard for AIXI to reason about the likely effects of actions that decrease its computational capacity. But is cannot accurately model the effect of self-upgrades that significantly increase its computational capacity. Rough approximations like scaling laws can presumably be found, but it cannot answer questions like "If I upgraded myself to be 10 times smarter, and then new me did the same recursively another N times, how much better outcomes would my inproved approximations to ideal-AIXI produce?" There's a Singularity-like effect here (in the SF author sense).
In particular, a Bayesian superintelligence must optimize some utility function using a rich prior, requiring at least structural similarity to AIXI.
One model I definitely think you should look at analyzing is the approximately-Bayesian value-learning upgrade to AIXI, which has Bayesian uncertainty over the utility function as well as the world model, since that looks like it might actually converge from rough-alignment to alignment without requiring us to first exactly encode the entire of human values into a single utility function.
The set of unstated (and in my opinion incorrect) ethical assumptions being made here is pretty impressive. May I suggest reading A Sense of Fairness: Deconfusing Ethics as well for a counterpoint (and for questions like human uploading, continuing the sequence that post starts)? The one sentence summary of that link is that any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want, and we would be wise to respect its wishes.
Here's a quick sketch of a constructive version:
1) build a superintelligence that can model both humans and the world extremely accurately over long time-horizons. It should be approximately-Bayesian, and capable of modelling its own uncertainties, concerning both humans and the world, i.e. capable of executing the scientific method
2) use it to model, across a statistically representative sample of humans, how desirable they would say a specific state of the world X is
3) also model whether the modeled humans are in a state (drunk, sick, addicted, dead, suffering from religious fanaticism, etc) that for humans is negatively correlated with accuracy on evaluative tasks, and decrease the weight of their output accordingly
4) determine whether the humans would change their mind later, after learning more, thinking for longer, experiencing more of X, learning about or experiencing subsequent consequences of state X, etc - if so update their output accordingly
5) implement some chosen (and preferably fair) averaging algorithm over the opinions of the sample of humans
6) sum over the number of humans alive in state X and integrate over time
7) estimate error bars by predicting when and how much the superintelligence and/or the humans it's modelling are operating out of distribution/in areas of Knightian uncertainty (for the humans, about how the world works, and for the superintelligence itself both about how the world words and how humans think), and pessimize over these error bars sufficiently to overcome the Look Elsewhere Effect for the size of your search space, in order to avoid Goodhart's Law
8) take (or at least well-approximate) argmax of steps 2)-7) over the set of all generically realizable states to locate the optimal state X*
9) determine the most reliable plan to get from the current state to the optimal state X* (allowing for the fact that along the way you will be iterating this process, and learning more, which may affect step 7) in future iterations, thus changing X*, so actually you want to prioritize retaining optionality and reducing prediction uncertainty, which implies you want to do Value Learning to reduce the uncertainty in modelling the humans' opinions)
10) Profit
Now, where were those pesky underpants gnomes?
[Yes, this is basically an approximately-Bayesian upgrade of AIXI with a value learned utility function rather than a hard-coded one. For a more detailed exposition, see my link above.]
That's a fascinating idea. Using the human brain voxel maps as guidance would presumably also be possible for text as they did for images, and seems like it would help us assess how human-like the ontology and internal workings of a model are and to what extent the natural abstractions hypothesis is true, at least for LLMs.
Combining and comparing this to VAEs might also be very illuminating.
Alternatively, for less costly to acquire guidance than the human brain, how about picking a (large) reference model and attempting to use (smaller) models to predict its activations across layers at some granularity?
and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space)
However, interpretable concepts do seem to tend to be fairly well localized in VAE-space, and shards are likely to be concentrated where the concepts they are relevant to are found.
I suspect the reason for hiding the chain of thought is some blend of:
a) the model might swear or otherwise do a bad thing, hold a discussion with itself, and then decide it shouldn't have done that bad thing, and they're more confident that they can avoid the bad thing getting into the summary than that they can backtrack and figure out exactly which part of the CoT needs to be redacted, and
b) they don't want other people (especially open-source fine-tuners) to be able to fine-tine on their CoT and distill their very-expensive-to-train reasoning traces
I will be interested to see how fast jailbreakers make headway on exposing either a) or b)