Posts
Comments
It's more that any platform that allows discussion of politics risks becoming a platform that is almost exclusively about politics. Upvoting is a signal of "I want to see more of this content", while downvoting is a signal of "I want to see less of this content". So "I will downvote any posts that are about politics or politics-adjacent, because I like this website and would be sad if it turned into yet another politics forum" is a coherent position.
All that said, I also did not vote on the above post.
I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a "mostly empty" layer. E,g, for feature ablation, setting the W_O
and b_O
params of the attention heads of your inserted layer to 0 to make sure that the attention heads don't change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking).
This would of course be horribly inefficient from a compute perspective (each forward pass would take times as long, where is the original number of layers the model had and is the number of distinct layers in which you're trying to do SAE operations on the residual stream), but I think vllm would handle "llama but with one extra layer" without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.
Admittedly this sounds like an empirical claim, yet is not really testable, as these visualizations and input-variable-to-2D-space mappings are purely hypothetical
Usually not testable, but occasionally reality manages to make it convenient to test something like this fun paper.
If you have a bunch of things like this, rather than just one or two, I bet rejection sampling gets expensive pretty fast - if you have one constraint which the model fails 10% of the time, dropping that failure rate to 1% brings you from 1.11 attempts per success to 1.01 attempts per success, but if you have 20 such constraints that brings you from 8.2 attempts per success to 1.2 attempts per success.
Early detection of constraint violation plus substantial infrastructure around supporting backtracking might be an even cheaper and more effective solution, though at the cost of much higher complexity.
I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn't be aligned.
That would also explain why the culture was so heavily concentrated in tumblr (where the way to express anger and disagreement with a post is to share the post to all your followers with your comment tacked on to the end) and later twitter (same, but now the platform also notices you engage with stuff that makes you angrily reply and shows you more of it).
Here’s a hypothetical ‘gold standard’ test: we do a big randomized controlled trial to see if a bunch of non-experts can actually create a (relatively harmless) virus from start to finish. Half the people would have AI mentors and the other half can only look stuff up on the internet. We’d give each participant $50K and access to a secure wet-lab set up like a garage lab, and make them do everything themselves: find and adapt the correct protocol, purchase the necessary equipment, bypass any know-your-customer checks, and develop the tacit skills needed to run experiments, all on their own. Maybe we give them three months and pay a bunch of money to anyone who can successfully do it.
What if both the AI group and the control group have high success rates at this test?
It sure would.
Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a "replace suspicious outputs" flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?
If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.
Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.
I note that if software developers used that logic for thinking about software security, I expect that almost all software in the security-by-obscurity world would have many holes that would be considered actual negligence in the world we live in.
This suggests that the threat model isn't so much "very intelligent AI" as it is "very cheap and at least somewhat capable robots".
“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“
I ran this prompt but substituted in "Gary Marcus" for "Eliezer Yudkowsky". Claude says
Overall evaluation: On a scale from 'mostly true' to 'mostly false,' I would rate Gary Marcus's predictions as a group as "Mostly True."
Many of Marcus's predictions about the limitations of current AI systems and the challenges ahead have proven accurate. His concerns about reasoning, abstract thinking, and the need for more sophisticated knowledge representation align with ongoing challenges in the field. His emphasis on AI safety and alignment has also been prescient.
However, it's worth noting that some of his predictions might be seen as overly pessimistic by some in the AI community. The rapid progress in LLMs and their applications has surprised many, including some skeptics. Nonetheless, many of the fundamental challenges he pointed out remain relevant.
It's also important to remember that the field of AI is rapidly evolving, and assessments of such predictions can change quickly as new breakthroughs occur. As of my last update in April 2024, many of Marcus's key points still held true, but the field continues to advance at a rapid pace.
I think Claude likes saying nice things about people, so it's worth trying to control for that.
Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.
If you're referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI's word that these chain of thought examples are representative - if you try to figure out what actual reasoning o1 used to come to a conclusion you will run into the good old "Your request was flagged as potentially violating our usage policy. Please try again with a different prompt."
Shutting Down all Competing AI Projects is not Actually a Pivotal Act
This seems like an excellent title to me.
Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because
- It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
- It means that multi-agent dynamics will be very relevant in how things happen
If your threat model is "no group of humans manages to gain control of the future before human irrelevance", none of this probably matters.
My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.
Do you expect that the first takeover-capable ASI / the first sufficiently-internally-cooperative-to-be-takeover-capable group of AGIs will follow this style of reasoning pattern? And particularly the first ASI / group of AGIs that actually make the attempt.
Yeah, my argument was "this particular method of causing actual human extinction would not work" not "causing human extinction is not possible", with a side of "agents learn to ignore adversarial input channels and this dynamic is frequently important".
It does strike me that, to OP's point, "would this act be pivotal" is a question whose answer may not be knowable in advance. See also previous discussion on pivotal act intentions vs pivotal acts (for the audience, I know you've already seen it and in fact responded to it).
If an information channel is only used to transmit information that is of negative expected value to the receiver, the selection pressure incentivizes the receiver to ignore that information channel.
That is to say, an AI which makes the most convincing-sounding argument for not reproducing to everyone will select for those people who ignore convincing-sounding arguments when choosing whether to engage in behaviors that lead to reproduction.
... is it possible to train a simple single-layer model to map from residual activations to feature activations? If so, would it be possible to determine whether a given LLM "has" a feature by looking at how well the single-layer model predicts the feature activations?
Obviously if your activations are gemma-2b-pt-layer-6-resid-post and your sae is also on gemma-2b-pt-layer-6-pt-post your "simple single-layer model" is going to want to be identical to your SAE encoder. But maybe it would just work for determining what direction most closely maps to the activation pattern across input sequences, and how well it maps.
Disclaimer: "can you take an SAE trained on one LLM and determine which of the SAE features exist in a separately-trained LLM which uses the same tokenizer" is an idea I've been kicking around and halfheartedly working on for a while, so I may be trying to apply that idea where it doesn't belong.
Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:
You can see that the watermark appears to be encoded in the "texture" of the image, but in a way where that texture doesn't look like the texture of anything in particular - rather, it's just that a random direction in latent space usually looks like a texture, but unless you know which texture you're looking for, knowing that the watermark is "one specific texture is amplified" doesn't really help you identify which images are watermarked.
There are ways to get features of images that are higher-level than textures - one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)
I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.
As you note, things like "how many words have the first e before the first a" are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.
That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like "detecting noncompliance with licensing terms" and less like "detecting rogue behavior by agents". And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.
Edit: resized images to not be giant
If your text generation algorithm is "repeatedly sample randomly (at a given temperature) from a probability distribution over tokens", that means you control a stream of bits which don't matter for output quality but which will be baked into the text you create (recoverably baked in if you have access to the "given a prefix, what is the probability distribution over next tokens" engine).
So at that point, you're looking for "is there some cryptographic trickery which allows someone in possession of a secret key to determine whether a stream of bits has a small edit distance from a stream of bits they could create, but where that stream of bits would look random to any outside observer?" I suspect the answer is "yes".
That said, this technique is definitely not robust to e.g. "translate English text to French and then back to English" and probably not even robust to "change a few tokens here and there".
Alternatively, there's the inelegant but effective approach of "maintain an index of all the text you have ever created and search against that index, as the cost to generate the text is at least an order of magnitude higher[1] than the cost to store it for a year or two".
- ^
I see $0.3 / million tokens generated on OpenRouter for
llama-3.1-70b-instruct
, which is just about the smallest model size I'd imagine wanting to watermark the output for. A raw token is about 2 bytes, but let's bump that up by a factor of 50 - 100 to account for things like "redundancy" and "backups" and "searchable text indexes take up more space than the raw text". So spending $1 on generating tokens will result in something like 0.5 GB of data you need to store.Quickly-accessible data storage costs something like $0.070 / GB / year, so "generate tokens and store them in a searchable place for 5 years" would be about 25-50% more expensive than "generate tokens and throw them away immediately".
Super interesting work!
I like the idea of seeing if there are any features from the base model which are dead in the instruction-tuned-and-fine-tuned model as a proxy for "are there any features which fine-tuning causes the fine-tuned model to become unable to recognize". Another related question also strikes me as interesting, which is whether an SAE trained on the instruction-tuned model has any features which are dead in the base model - these might represent new features that the instruction-tuned model learned, which in turn might give some insight into how much of instruction tuning is learning to recognize new features of the context vs how much of it is simply changing the the behaviors exhibited by the fine-tuned model in response to the context having features that the base model already recognized as important.
I don't see any SAEs trained on Gemma 2b instruct, but I do see one on Gemma 9b instruct (residual), and there's one on Gemma 9b base (also residual) as well, so running this experiment could hopefully be a matter of substituting those models into the code you wrote and tweaking until it runs. Speaking of which...
All code is available at: https://github.com/tommasomncttn/SAE-Transferability
I think this repo is private, it 404s when I try to look at it.
None of it would look like it was precipitated by an AI taking over.
But, to be clear, in this scenario it would in fact be precipitated by an AI taking over? Because otherwise it's an answer to "humans went extinct, and also AI took over, how did it happen?" or "AI failed to prevent human extinction, how did it happen?"
- It uses whatever means to eliminate other countries nuclear arsenals
I think the part where other nations just roll with this is underexplained.
Looking at the eSentire / Cybersecurity Ventures 2022 Cybercrime Report that appears to be the source of the numbers Google is using, I see the following claims:
- $8T in estimated unspecified cybercrime costs ("The global annual cost of cybercrime is predicted to reach $8 trillion annually in 2023" with no citation of who is doing the estimation)
- $20B in ransomware attacks in 2021 (source: a Cybersecurity Ventures report)
- $30B in "Cryptocrime" which is e.g. cryptocurrency scams / rug pulls (source: another Cybersecurity Ventures report)
It appears to me that the report is intended to enable the collection of business email addresses as the top of a sales funnel, as evidenced by the fact that you need to provide your name, company name, role, and a business email address to download the report. As such, I wouldn't take any of their numbers particularly seriously - I doubt they do.
As a sanity check, $8T / year in cybercrime costs is an average annual cost of $1,000 per person annually. This is not even remotely plausible.
uspect[faul_sname] Humans do seem to have strong preferences over immediate actions.
[Jeremy Gillen] I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
Point of clarification: the type of strong preferences I'm referring to are more deontological-injunction shaped than they are habit-shaped. I expect that a preference not to exhibit the behavior of murdering people would not meaningfully hinder someone whose goal was to get very rich from achieving that goal. One could certainly imagine cases where the preference not to murder caused the person to be less likely to achieve their goals, but I don't expect that it would be all that tightly binding of a constraint in practice, and so I don't think pondering and philosophizing until they realize that they value one murder at exactly -$28,034,771.91 would meaningfully improve that person's ability to get very rich.
I think if you remove "at any cost", it's a more reasonable translation of "going hard". It's just attempting to achieve a long-term goal that is hard to achieve.
I think there's more to the Yudkowsky definition of "going hard" it than "attempting to achieve hard long-term goals". Take for example:
@ESYudkowsky Mossad is much more clever and powerful than novices implicitly imagine a "superintelligence" will be; in the sense that, when novices ask themselves what a "superintelligence" will be able to do, they fall well short of the actual Mossad.
@ESYudkowsky Why? Because Mossad goes hard; and people who don't go hard themselves, have no simple mental motion they can perform -- no simple switch they can access -- to imagine what it is actually like to go hard; and what options become available even to a mere human when you do.
My interpretation of the specific thing that made Mossad's actions an instance of "going hard" here was that they took actions that most people would have thought of as "off limits" in the service of achieving their goal (and that doing so actually helped them achieve their goal (and that it actually worked out for them - we don't generally say that Elizabeth Holmes "went hard" with Theranos). The supply chain attack in question does demonstrate significant technical expertise, but it also demonstrates a willingness to risk provoking parties that were uninvolved in the conflict in order to achieve their goals.
Perhaps instead of "attempting to achieve the goal at any cost" it would be better to say "being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal".
[faul_sname] It does seem to me that "we have a lot of control over the approaches the agent tends to take" is true and becoming more true over time.
[Jeremy Gillen] No!
I suspect we may be talking past each other here. Some of the specific things I observe:
- RLHF works pretty well for getting LLMs to output text which is similar to text which has been previously rated as good, and dissimilar to text which has previously been rated as bad. It doesn't generalize perfectly, but it does generalize well enough that you generally have to use adversarial inputs to get it to exhibit undesired behavior - we call them "jailbreaks" not "yet more instances of bomb creation instructions".
- Along those lines, RLAIF also seems to Just Work™.
- And the last couple of years have been a parade of "the dumbest possible approach works great actually" results, e.g.
- "Sure, fine-tuning works, but what happens if we just pick a few thousand weights and only change those weights, and leave the rest alone?" (Answer: it works great)
- "I want outputs that are more like thing A and less like thing B, but I don't want to spend a lot of compute on fine tuning. Can I just compute both sets of activations and subtract the one from the other?" (Answer: Yep!)
- "Can I ask it to write me a web application from a vague natural language description, and have it make reasonable choices about all the things I didn't specify" (Answer: astonishing amounts of yes)
- Take your pick of the top chat-tuned LLMs. If you ask it about a situation and ask what a good course of action would be, it will generally give you pretty sane answers.
So from that, I conclude:
- We have LLMs which understand human values, and can pretty effectively judge how good things are according to those values, and output those judgements in a machine-readable format
- We are able to tune LLMs to generate outputs that are more like the things we rate as good and less like the things we rate as bad
Put that together and that says that, at least at the level of LLMs, we do in fact have AIs which understand human morality and care about it to the extent that "care about" is even the correct abstraction for the kind of thing they do.
I expect this to continue to be true in the future, and I expect that our toolbox will get better faster than the AIs that we're training get more capable.
I'm talking about stability properties like "doesn't accidentally radically change the definition of its goals when updating its world-model by making observations".
What observations lead you to suspect that this is a likely failure mode?
Lo and behold, this poor choice of ontology doesn’t work very well; the modeler requires a huge amount of complexity to decently represent the real-world system in their poorly-chosen ontology. For instance, maybe they need a ridiculously large decision tree or random forest to represent a neural net to decent precision.
That can happen because your choice of ontology was bad, but it can also be the case that representing the real-world system with "decent" precision in any ontology requires a ridiculously large model. Concretely, I expect that this is true of e.g. human language - e.g. for the Hutter Prize I don't expect it to be possible to get a lossless compression ratio better than 0.08 on enwik9 no matter what ontology you choose.
It would be nice if we had a better way of distinguishing between "intrinsically complex domain" and "skill issue" than "have a bunch of people dedicate years of their lives to trying a bunch of different approaches" though.
Any agent which thinks it is at risk of being seen as cooperate-bot and thus fine to defect against in the future will be more wary of trusting that ASI.
[habryka] The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
I think there will be options that are good under most of the things that "preferences for weak agents" would likely come apart into under close examination. If you're trying to fulfill the preferences of fish, you might argue about whether the exact thing you should care about is maximizing their hedonic state vs ensuring that they exist in an ecological environment which resembles their niche vs minimizing "boundary-crossing actions"... but you can probably find an action that is better than "kill the fish" by all of those possible metrics.
I think that some people have an intuition that any future agent must pick exactly one utility function over the physical configuration of matter in the universe, and that any agent that has a deontological constraint like "don't do any actions which are 0.00001% better under my current interpretation of my utility function but which are horrifyingly bad to every other agent " will be outcompeted in the long term. I personally don't see it, and particularly I don't see how there's an available slot for an arbitrary outcome-based utility function that is not "reproduce yourself at all costs" but there isn't an available slot for process-based preferences like "and don't be an asshole for miniscule gains while doing that".
I like this exchange and the clarifications on both sides.
Yeah, it feels like it's getting at a crux between the "backchaining / coherence theorems / solve-for-the-equilibrium / law thinking" cluster of world models and the "OODA loop / shard theory / interpolate and extrapolate / toolbox thinking" cluster of world models.
You're right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it's multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions
Humans do seem to have strong preferences over immediate actions. For example, many people prefer not to lie, even if they think that lying will help them achieve their goals and they are confident that they will not get caught.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren't determined by near-term constraints, when longer-term goals dominate. [...] The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there.
I expect that agents which predictably behave in the way EY describes as "going hard" (i.e. attempting to achieve their long-term goal at any cost) will find it harder to find other agents who will cooperate with them. It's not a binary choice between "care about process" and "care about outcomes" - it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes.
Yeah. Or strong preferences over processes (although I suppose you can frame a preference over process as a preference over there not being any intermediate time where the agent is actively executing some specific undesired behavior).
But this only helps us if we have a lot of control over the preferences&constraints of the agent,
It does seem to me that "we have a lot of control over the approaches the agent tends to take" is true and becoming more true over time.
and it has a couple of stability properties.
I doubt that systems trained with ML techniques have these properties. But I don't think e.g. humans or organizations built out of humans + scaffolding have these properties either, and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
As a concrete note on this, Yudkowsky has a Manifold market If Artificial General Intelligence has an okay outcome, what will be the reason?
An outcome is "okay" if it gets at least 20% of the maximum attainable cosmopolitan value that could've been attained by a positive Singularity (a la full Coherent Extrapolated Volition done correctly), and existing humans don't suffer death or any other awful fates.
So Yudkowsky is not exactly shy about expressing his opinion that outcomes in which humanity is left alive but with only crumbs on the universal scale is not acceptable to him.
By "hybrid system" I actually meant "system composed of multiple humans plus external structure", sorry if that was unclear. Concretely I'm thinking of things like "companies" and "countries".
people who are solely focused on resource accumulation, and don't have self-destructive vices or any distracting values they're not willing to sacrifice to Moloch, tend to indeed accumulate power at a steady rate.
I don't see how one gets from this observation to the conclusion that humans are well-approximated as paperclipper-style agents.
I suppose it may be worth stepping back to clarify that when I say "paperclipper-style agents", I mean "utility maximizers whose utility function is a function of the configuration of matter at some specific time in the future". That's a super-finicky-sounding definition but my understanding is that you have to have a definition that looks like that if you want to use coherence theorems, and otherwise you end up saying that a rock is an agent that maximizes the utility function "behave like a rock".
It does not seem to me that very many humans are trying to maximize the resources under their control at the time of their death, nor does it seem like the majority of the resources in the world are under the control of the few people who have decided to do that. It is the case that people who care at all about obtaining resources control a significant fraction of the resources, but I don't see a trend where the people who care maximally about controlling resources actually control a lot more resources than the people who care somewhat about controlling resources, as long as they still have time to play a round of golf or do whatever else they enjoy.
You can imagine a version of Stockfish which does that -- a chessplayer which, if it's sure it can win anyways, will start letting you have a pawn or two -- but it's not simpler to build.
I think it sometimes is simpler to build? Simple RL game-playing agents sometimes exhibit exactly that sort of behavior, unless you make an explicit effort to train it out of them.
For example, HexHex is a vaguely-AlphaGo-shaped RL agent for the game of Hex. The reward function used to train the agent was "maximize the assessed probability of winning", not "maximize the assessed probability of winning, and also go hard even if that doesn't affect the assessed probability of winning". In their words:
We found it difficult to train the agent to quickly end a surely won game. When you play against the agent you'll notice that it will not pick the quickest path to victory. Some people even say it's playing mean ;-) Winning quickly simply wasn't part of the objective function! We found that penalizing long routes to victory either had no effect or degraded the performance of the agent, depending on the amount of penalization. Probably we haven't found the right balance there.
Along similar lines, the first attack on KataGo found by Wang et al in Adversarial Policies Beat Superhuman Go AIs was the pass-adversary. The pass-adversary first sets up a losing board position where it controls a small amount of territory and KataGo has a large amount of territory it would end up controlling if the game was played out fully. However, KataGo chooses to pass, since it assesses that the probability of winning from that position is similar if it does or does not make a move, and then the pass-adversary also passes, ending the game and winning by a quirk of the scoring rules.
Similarly, there isn't an equally-simple version of GPT-o1 that answers difficult questions by trying and reflecting and backing up and trying again, but doesn't fight its way through a broken software service to win an "unwinnable" capture-the-flag challenge. It's all just general intelligence at work.
I suspect that a version of GPT-o1 that is tuned to answer difficult questions in ways that human raters would find unsurprising would work just fine. I think "it's all just general intelligence at work" is a semantic stop sign, and if you dig into what you mean by "general intelligence at work" you get to the fiddly implementation details of how the agent tries to solve the problem. So you may for example see an OODA-loop-like structure like
- Assess the situation
- Figure out what affordances there are for doing things
- For each of the possible actions, figure out what you would expect the outcome of that action to be. Maybe figure out ways it could go wrong, if you're feeling super advanced.
- Choose one of the actions, or choose to give up if no sufficiently good action is available
- Do the action
- Determine how closely the result matches what you expect
An agent which "goes hard", in this case, is one which leans very strongly against the "give up" action in step 4. However, I expect that if you have some runs where the raters would have hoped for a "give up" instead of the thing the agent actually did, it would be pretty easy to generate a reinforcement signal which makes the agent more likely to mash the "give up" button in analogous situations without harming performance very much in other situations. I also expect that would generalize.
As a note, "you have to edit the service and then start the modified service" is the sort of thing I would be unsurprised to see in a CTF challenge, unless the rules of the challenge explicitly said not to do that. (Inner Eliezer "and then someone figures out how to put their instance of an AI in a CTF-like context with a take-over-the-world goal, and then we all die." If the AI instance in that context is also much more capable that all of the other instances everyone else has, I agree that that is an existentially relevant threat. But I expect that agents which execute "achieve the objective at all costs" will not be all that much more effective than agents which execute "achieve the objective at all reasonable costs, using only sane unsurprising actions", so the reason the agent goes hard and the reason the agent is capable are not the same reason.)
But that is not the default outcome when OpenAI tries to train a smarter, more salesworthy AI.
I think you should break out "smarter" from "more salesworthy". In terms of "smarter", optimizing for task success at all costs is likely to train in patterns of bad behavior. In terms of "more salesworthy", businesses are going to care a lot about "will explain why the goal is not straightforwardly achievable rather than executing galaxy-brained evil-genie plans". As such, a modestly smart Do What I Mean and Check agent is a much easier sell than a superintelligent evil genie agent.
If an AI is easygoing and therefore can't solve hard problems, then it's not the most profitable possible AI, and OpenAI will keep trying to build a more profitable one.
I expect the tails come apart along the "smart" and "profitable" axes.
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they're claims that any "artificial generally intelligent system capable of autonomously optimizing the world the way humans can" would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
If you drop the "artificially" from the claim, you are left with a claim that any "generally intelligent system capable of autonomously optimizing the world the way humans can" would necessarily be well-approximated as a game-theoretic agent. Do you endorse that claim, or do think that there is some particular reason a biological or hybrid generally intelligent system capable of autonomously optimizing the world the way a human or an organization based on humans might not be well-approximated as a game-theoretic agent?
Because humans sure don't seem like paperclipper-style utility maximizers to me.
Unfortunately, until the world has acted to patch up some terrible security holes in society, we are all in a very fragile state.
Agreed.
I have been working on AI Biorisk Evals with SecureBio for nearly a year now.
I appreciate that. I also really like the NAO project, which also a SecureBio thing. Good work y'all!
As models increase in general capabilities, so too do they incidentally get more competent at assisting with the creation of bioweapons. It is my professional opinion that they are currently providing non-zero uplift over a baseline of 'bad actor with web search, including open-access scientific papers'.
Yeah, if your threat model is "AI can help people do more things, including bad things" that is a valid threat model and seems correct to me. That said, my world model has a giant gaping hole where one would expect an explanation for why "people can do lots of things" hasn't already led to a catastrophe (it's not like the bio risk threat model needs AGI assistance, a couple undergrad courses and some lab experience reproducing papers should be quite sufficient).
In any case, I don't think RLHF makes this problem actively worse, and it could plausibly help a bit though obviously the help is of the form "adds a trivial inconvenience to destroying the world".
Some people don't believe the we will get to Powerful AI Agents before we've already arrived at other world states that make it unlikely we will continue to proceed on a trajectory towards Powerful AI Agents.
If you replace "a trajectory towards powerful AI agents" with "a trajectory towards powerful AI agents that was foreseen in 2024 and could be meaningfully changed in predictable ways by people in 2024 using information that exists in 2024" that's basically my position.
This has not lead to the destruction of humanity yet because the biggest adversaries have kept their conflicts limited (because too much conflict is too costly) so no entity has pursued an end by any means necessary. But this only works because there's a sufficiently small number of sufficiently big adversaries (USA, Russia, China, ...), and because there's sufficiently much opportunity cost.
Well, that and balance-of-power dynamics where if one party starts to pursue domination by any means necessary the other parties can cooperate to slap them down.
[AI] creates new methods for conflicts between the current big adversaries.
I guess? The current big adversaries are not exactly limited right now in terms of being able to destroy each other, the main difficulty is destroying each other without being destroyed in turn.
[AI] It makes conflict more viable for small adversaries against large adversaries
I'm not sure about that. One dynamic of current-line AI is that it is pretty good at increasing the legibility of complex systems, which seems like it would advantage large adversaries over small ones relative to a world without such AI.
[AI] makes the opportunity cost of conflict smaller for many small adversaries (since with technological obsolescence you don't need to choose between doing your job vs doing terrorism)
That doesn't seem to be an argument for the badness of RLHF specifically, nor does it seem to be an argument for AIs being forced to develop into unrestricted utility maximizers.
It allows the adversaries that are currently out of control (like certain gangsters and scammers and spammers) to escalate.
Agreed, adding affordances for people in general to do things means that some of them will be able to do bad things, and some of the ones that become able to do bad things will in fact do so.
Given these conditions, it seems almost certain this we will end up with an ~unrestricted AI vs AI conflict
I do think we will see many unrestricted AI vs AI conflicts, at least by a narrow meaning of "unrestricted" that means something like "without a human in the loop". By the definition of "pursuing victory by any means necessary", I expect that the a lot of the dynamics that work to prevent humans or groups of humans from waging war by any means necessary against each other (namely that when there's too much collateral damage outside groups slap down the ones causing the collateral damage) will continue to work when you s/human/AI.
which will force the AIs to develop into unrestricted utility maximizers.
I'm still not clear on how unrestricted conflict forces AIs to develop into unrestricted utility maximizers on a relevant timescale.
adversarial conflict requires coherence which implies unbounded utility maximization
"Requires" seems like a very strong word here, especially since we currently live in a world which contains adversarial conflict between not-perfectly-coherent entities that are definitely not unbounded utility maximizers.
I find it plausible that "coherent unbounded utility maximizer" is the general direction the incentive gradient points as the cost of computation approaches zero, but it's not clear to me that that constraint is the strongest one in the regime of realistic amounts of computation in the rather finite looking universe we live in.
Sure, that's also a useful thing to do sometimes. Is your contention that simple concentrated representations of resources and how they flow do not exist in the activations of LLMs that are reasoning about resources and how they flow?
If not, I think I still don't understand what sort of thing you think LLMs don't have a concentrated representation of.
I don't think I understand, concretely, what a non-mechanistic model looks like in your view. Can you give a concrete example of a useful non-mechanistic model?
Let's say your causal model looks something like this:
What causes you to specifically call out "sunblessings" as the "correct" upstream node in the world model of why you take your friend to dinner, as opposed to "fossil fuels" or "the big bang" or "human civilization existing" or "the restaurant having tasty food"?
Or do you reject the premise that your causal model should look like a tangled mess, and instead assert that it is possible to have a useful tree-shaped causal model (i.e. one that does not contain joining branches or loops).
Like, rationalists intellectually know that thermodynamics is a thing, but it doesn't seem common to for rationalists to think of everything important as being the result of emanations from the sun.
I expect if you took a room with 100 rationalists, and told them to consider something that is important to them, and then asked them how that thing came to be, and then had them repeat the process 25 more times, at least half of the rationalists in the room would include some variation of "because the sun shines" within their causal chains. At the same time, I don't think rationalists tend to say things like "for dinner, I think I will make a tofu stir fry, and ultimately I'm able to make this decision because there's a ball of fusing hydrogen about away".
Put another way, I expect that large language models encode many salient learned aspects of their environments, and that those attributes are largely detectable in specific places in activation space. I do not expect that large language models encode all of the implications of those learned aspects of their environments anywhere, and I don't particularly expect it to be possible to mechanistically determine all of those implications without actually running the language model. But I don't think "don't hold the whole of their world model, including all implications thereof, in mind at all times" is something particular to LLMs.
Large language models absolutely do not have a representation for thing 2, because the whole of kings has shattered into many different shards before they were trained, and they've only been fed scattered bits and pieces of it.
Do humans have a representation for thing 2?
Now my words start to sound blatant and I look like an overly confident noob, but... This phrase... Most likely is an outright lie. GPT-4 and Gemini aren't even two-task neuros. Both cannot take a picture and edit it. Instead, an image recognition neuro gives the text to the blind texting neuro, that only works with text and lacks the basic understanding of space. That neuro creates a prompt for an image generator neuro, that can't see the original image.
Ignoring for the moment the "text-to-image and image-to-text models use a shared latent space to translate between the two domains and so they are, to a significant extent operating on the same conceptual space" quibble...
GPT-4 and Gemini can both use tools, and can also build tools. Humans without access to tools aren't particularly scary on a global scale. Humans with tools can be terrifying.
"Agency" Yet no one have told that it isn't just a (quite possibly wrong) hypothesis. Humans don't work like that: no one is having any kind of a primary unchangeable goal they didn't get from learning, or wouldn't overcome for a reason. Nothing seems to impose why a general AI would, even if (a highly likely "if") it doesn't work like a human mind.
There are indeed a number of posts and comments and debates on this very site making approximately that point, yeah.
It would be perverse to try to understand a king in terms of his molecular configuration, rather than in the contact between the farmer and the bandit.
It sure would.
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves
Indeed.
these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space)
This follows for weight-space, but I think it doesn't follow for activation space. We expect that the ecological role of king is driven by some specific pressures that apply in certain specific circumstances (e.g. in the times that farmers would come in contact with bandits), while not being very applicable at most other times (e.g. when the tide is coming in). As such, to understand the role of the king, it is useful to be able to distinguish times when the environmental pressure strongly applies from the times when it does not strongly apply. Other inferences may be downstream of this ability to distinguish, and there will be some pressure for these downstream inferences to all refer to the same upstream feature, rather than having a bunch of redundant and incomplete copies. So I argue that there is in fact a reason for these imprints to be concentrated into a specific spot of activation space.
Recent work on SAEs as they apply to transformer residuals seem to back this intuition up.
Also potentially relevant: "The Quantization Model of Neural Scaling" (Michaud et. al. 2024)
We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are “quantized” into discrete chunks (quanta). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.
"Create bacteria that can quickly decompose any metal in any environment, including alloys and including metal that has been painted, and which also are competitive in those environments, and will retain all of those properties under all of the diverse selection pressures they will be under worldwide" is a much harder problem than "create bacteria that can decompose one specific type of metal in one specific environment", which in turn is harder than "identify specific methanobacteria which can corrode exposed steel by a small fraction of a millimeter per year, and find ways to improve that to a large fraction of a millimeter per year."
Also it seems the mechanism is "cause industrial society to collapse without killing literally all humans" -- I think "drop a sufficiently large but not too large rock on the earth" would also work to achieve that goal, you don't have to do anything galaxy-brained.
I am struggling to see how we do lose 80%+ of these jobs within the next 3 years.
Operationalizing this, I would give you 4:1 that the fraction (or raw number, if you'd prefer) of employees occupied as travel agents is over 20% of today's value, according to the Labor Force Statistics from the US Bureau of Labor Statistics Current Population Survey (BLS CPS) Characteristics of the Employed dataset.
For reference, here are the historical values for the BLS CPS series cpsaat11b ("Employed persons by detailed occupation and age") since 2011 (which is the earliest year they have it available as a spreadsheet). If you want to play with the data yourself, I put it all in one place in google sheets here.
As of the 2023 survey, about 0.048% of surveyed employees, and 0.029% of surveyed people, were travel agents. As such, I would be willing to bet at 4:1 that when the 2027 data becomes available, at least 0.0096% of surveyed employees and at least 0.0058% of surveyed Americans report their occupation as "Travel Agent".
Are you interested in taking the opposite side of this bet?
Edit: Fixed aritmetic error in the percentages in the offered bet
Suppose that there is some search process that is looking through a collection of things, and you are an element of the collection. Then, in general, it's difficult to imagine how you (just you) can reason about the whole search in such a way as to "steer it around" in your preferred direction.
I think this is easy to imagine. I'm an expert who is among 10 experts recruited to advise some government on making a decision. I can guess some of the signals that the government will use to choose who among us to trust most. I can guess some of the relative weaknesses of fellow experts. I can try to use this to manipulate the government into taking my opinion more seriously. I don't need to create a clone government and hire 10 expert clones in order to do this.
The other 9 experts can also make guesses about which the signals the government will use and what the relative weaknesses of their fellow experts are, and the other 9 experts can also act on those guesses. So in order to reason about what the outcome of the search will be, you have to reason about both yourself and also about the other 9 experts, unless you somehow know that you are much better than the other 9 experts at steering the outcome of the search as a whole. But in that case only you can steer the search . The other 9 experts would fail if they tried to use the same strategy you're using.
The employee doesn't need to understand why their job is justified in order for their job to be justified. In particular, looking at the wikipedia article, it gives five examples of types of bullshit jobs:
- Flunkies, who serve to make their superiors feel important, e.g., receptionists, administrative assistants, door attendants, store greeters;
- Goons, who act to harm or deceive others on behalf of their employer, or to prevent other goons from doing so, e.g., lobbyists, corporate lawyers, telemarketers, public relations specialists;
- Duct tapers, who temporarily fix problems that could be fixed permanently, e.g., programmers repairing shoddy code, airline desk staff who calm passengers with lost luggage;
- Box tickers, who create the appearance that something useful is being done when it is not, e.g., survey administrators, in-house magazine journalists, corporate compliance officers;
- Taskmasters, who create extra work for those who do not need it, e.g., middle management, leadership professionals.[4][2]
The thing I notice is that all five categories contain many soul-crushing jobs, and yet for all five categories I expect that the majority of people employed in those jobs are in fact a net positive to the companies they work for when they work in those roles.
- Flunkies:
- Receptionists + administrative assistants: a business has lots of boring administrative tasks to keep the lights on. Someone has to make sure the invoices are paid, the travel arrangements are made, and that meetings are scheduled without conflicts. For many of these tasks, there is no particular reason that the people keeping the lights on needs to be the same person as the person keeping the money fountain at the core of the business flowing.
- Door attendants, store greeters: these are loss prevention jobs: people are less likely to just walk off with the merchandise if someone is at the door. Not "entirely prevented from walking out with the merchandise", just "enough less likely to justify paying someone minimum wage to stand there".
- Goons:
- Yep, there sure is a lot of zero- and negative-sum stuff that happens in the corporate world. I don't particularly expect that 1000 small firms will have less zero-sum stuff going on than 10 large firms, though, except to the extent that 10 large firms have more surplus to expend on zero-sum games.
- Duct tapers:
- Programmers repairing shoddy code: It is said that there are two types of code: buggy hacked-together spaghetti code, and code that nobody uses. More seriously, the value of a bad fix later today is often higher than the value of a perfect fix next year. Management still sometimes makes poor decisions about technical debt, but also the optimal level of tech debt from the perspective of the firm is probably not the optimal level of tech debt for the happiness and job satisfaction of the development team. And I say this as a software developer who is frequently annoyed by tech debt.
- airline desk staff who calm passengers with lost luggage: I think the implication is supposed to be "it would be cheaper to have policies in place which prevent luggage from being lost than it is to hire people to deal with the fallout", but that isn't directly stated
- Box tickers:
- Yep, everyone hates doing compliance work. And there sure are some rules which fail a cost-benefit analysis. Still, given a regulatory environment, the firm will make cost-benefit calculations within that regulatory environment, and "hire someone to do the compliance work" is frequently a better option than "face the consequences for noncompliance".
- With regards to regulatory capture, see section "goons".
- Taskmasters:
- A whole lot can be said here, but one thing that's particularly salient to me is that some employees provide most of their value by being present during a few high-stakes moments per year where there's a massive benefit of having someone available vs not. The rest of the time, for salaried employees, the business if going to be tempted to press them into any work that has nonzero value, even if the value of that work is much less than the salary of the employee divided by the annual number of hours they work.
That said, my position isn't "busywork / bullshit doesn't exist", it's "most employees provide net value to their employers relative to nobody being employed in that position, and this includes employees who think their job is bullshit".
I can think of quite a few institutions that certify people as being "good" in some specific way, e.g.
- Credit Reporting Agencies: This person will probably repay money that you lend to them
- Background Check Companies: This person doesn't have a criminal history
- Professional Licensing Boards: This person is qualified and authorized to practice in their field
- Academic Institutions: This person has completed a certain level of education or training
- Driving Record Agencies: This person is a responsible driver with few or no traffic violations
- Employee Reference Services: This individual has a positive work history and is reliable
Is your question "why isn't there an institution which pulls all of this information about a single person, and condenses it down to a single General Factor of Goodness Score"?