Posts
Comments
Scaffolded LLMs are pretty good at not just writing code, but also at refactoring it. So that means that all the tech debt in the world will disappear soon, right?
I predict "no" because
- As writing code gets cheaper, the relative cost of making sure that a refactor didn't break anything important goes up
- The number of parallel threads of software development will also go up, with multiple high-value projects making mutually-incompatible assumptions (and interoperability between these projects accomplished by just piling on more cose).
As such, I predict an explosion of software complexity and jank in the near future.
I suspect that it's a tooling and scaffolding issue and that e.g. claude-3-5-sonnet-20241022
can get at least 70% on the full set of 60 with decent prompting and tooling.
By "tooling and scaffolding" I mean something along the lines of
- Naming the lists that the model submits (e.g. "round 7 list 2")
- A tool where the LLM can submit a named hypothesis in the form of a python function which takes a list and returns a boolean and check whether the results of that function on all submitted lists from previous rounds match the answers it's seen so far
- Seeing the round number on every round
- Dropping everything except the seen lists and non-falsified named hypotheses in the context each round (this is more of a practical thing to avoid using absurd volumes of tokens, but I imagine it wouldn't hurt performance too much)
I'll probably play around with it a bit tomorrow.
Adapting spaced repetition to interruptions in usage: Even without parsing the user’s responses (which would make this robust to difficult audio conditions), if the reader rewinds or pauses on some answers, the app should be able to infer that the user is having some difficulty with the relevant material, and dynamically generate new content that repeats those words or grammatical forms sooner than the default.
Likewise, if the user takes a break for a few days, weeks, or months, the ratio of old to new material should automatically adjust accordingly, as forgetting is more likely, especially of relatively new material. (And of course with text to speech, an interactive app that interpreted responses from the user could and should be able to replicate LanguageZen’s ability to specifically identify (and explain) which part of a user’s response was incorrect, and why, and use this information to adjust the schedule on which material is reviewed or introduced.)
Seems like this one is mostly a matter of schlep rather than capability. The abilities you would need to make this happen are
- Have a highly granular curriculum for what vocabulary and what skills are required to learn the language and a plan for what order to teach them in / what spaced repetition schedule to aim for
- Have a granular and continuously updated model of the user's current knowledge of vocabulary, rules of grammar and acceptability, idioms, if there are any phonemes or phoneme sequences they have trouble with
- Given specific highly granular learning goals (e.g. "understanding when to use preterite vs imperfect when conjugating saber" in spanish) within the curriculum and the model of the user's knowledge and abilities, produce exercises which teach / evaluate those specific skills.
- Determine whether the user had trouble with the exercise, and if so what the trouble was
- Based on the type of trouble the user had, describe whay updates should be made to the model of the user's knowledge and vocabulary
- Correctly apply the updates from (6)
- Adapt to deviations from the spaced repetition plan (tbh this seems like the sort of thing you would want to do with normal code)
I expect that the hardest things here will be 1, 2, and 6, and I expect them to be hard because of the volume of required work rather than the technical difficulty. But I also expect the LanguageZen folks have already tried this and could give you a more detailed view about what the hard bits are here.
Automatic customization of content through passive listening
This sounds like either a privacy nightmare or a massive battery drain. The good language models are quite compute intensive, so running them on a battery-powered phone will drain the battery very fast. Especially since this would need to hook into the "granular model of what the user knows" piece.
(and yes, I do in fact think it's plausible that the CTF benchmark saturates before the OpenAI board of directors signs off on bumping the cybersecurity scorecard item from low to medium)
So here’s a question: When we have AGI, what happens to the price of chips, electricity, and teleoperated robots?
As measured in what units?
- The price of one individual chip of given specs, as a fraction of the net value that can be generated by using that chip to do things that ambitious human adults do: What Principle A cares about, goes up until the marginal cost and value are equal
- The price of one individual chip of given specs, as a fraction of the entire economy: What principle B cares about, goes down as the number of chips manufactured increases
- The price of one individual chip of given specs, relative to some other price such as nominal US dollars, inflation-adjusted US dollars, metric tons of rice, or 2000 square foot single-family homes in Berkeley: ¯\_(ツ)_/¯, depends on lots of particulars, not sure that any high-level economic principles say anything specific here
These only contradict each other if you assume that "the value that can be generated by one ambitious adult human divided by the total size of the economy" is a roughly constant value.
Yeah, agreed - the allocation of compute per human would likely become even more skewed if AI agents (or any other tooling improvements) allow your very top people to get more value out of compute than the marginal researcher currently gets.
And notably this shifting of resources from marginal to top researchers wouldn't require achieving "true AGI" if most of the time your top researchers spend isn't spent on "true AGI"-complete tasks.
I think I misunderstood what you were saying there - I interpreted it as something like
Currently, ML-capable software developers are quite expensive relative to the cost of compute. Additionally, many small experiments provide more novel and useful insights than a few large experiments. The top practically-useful LLM costs about 1% as much per hour to run as a ML-capable software developer, and that 100x decrease in cost and the corresponding switch to many small-scale experiments would likely result in at least a 10x increase in the speed at which novel, useful insights were generated.
But on closer reading I see you said (emphasis mine)
I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you'd get as if the human employees operated 10x faster. (In other words, 10x'ing this labor input.)
So if the employees spend 50% of their time waiting on training runs which are bottlenecked on company-wide availability of compute resources, and 50% of their time writing code, 10xing their labor input (i.e. the speed at which they write code) would result in about an 80% increase in their labor output. Which, to your point, does seem plausible.
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1/4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I'm not including input prices for simplicity, but input is much cheaper than output and it's just a messy BOTEC anyway.)
If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you're spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.
So I guess the crux is "would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent".
(and yes, at some point it the question switches to "would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans" but a that point it's a matter of quality, not quantity)
End points are easier to infer than trajectories
Assuming that which end point you get to doesn't depend on the intermediate trajectories at least.
Civilization has had many centuries to adapt to the specific strengths and weaknesses that people have. Our institutions are tuned to take advantage of those strengths, and to cover for those weaknesses. The fact that we exist in a technologically advanced society says that there is some way to make humans fit together to form societies that accumulate knowledge, tooling, and expertise over time.
The borderline-general AI models we have now do not have exactly the same patterns of strength and weakness as humans. One question that is frequently asked is approximately
When will AI capabilities reach or exceed all human capabilities that are load bearing in human society?
A related line of questions, though, is
- When will AI capabilities reach a threshold where a number of agents can form a larger group that accumulates knowledge, tooling, and expertise over time?
- Will their roles in such a group look similar to the roles that people have in human civilization?
- Will the individual agents (if "agent" is even the right model to use) within that group have more control over the trajectory of the group as a whole than individual people have over the trajectory of human civilization?
In particular the third question seems pretty important.
As someone who has been on both sides of that fence, agreed. Architecting a system is about being aware of hundreds of different ways things can go wrong, recognizing which of those things are likely to impact you in your current use case, and deciding what structure and conventions you will use. It's also very helpful, as an architect, to provide examples usages of the design patterns which will replicate themselves around your new system. All of which are things that current models are already very good, verging on superhuman, at.
On the flip side, I expect that the "piece together context to figure out where your software's model of the world has important holes" part of software engineering will remain relevant even after AI becomes technically capable of doing it, because that process frequently involves access to sensitive data across multiple sources where having an automated, unauthenticated system which can access all of those data sources at once would be a really bad idea (having a single human able to do all that is also a pretty bad idea in many cases, but at least the human has skin in the game).
That reasoning as applied to SAT score would only be valid if LW selected its members based on their SAT score, and that reasoning as applied to height would only be valid if LW selected its members based on height (though it looks like both Thomas Kwa and Yair Halberstadt have already beaten me to it).
a median SAT score of 1490 (from the LessWrong 2014 survey) corresponds to +2.42 SD, which regresses to +1.93 SD for IQ using an SAT-IQ correlation of +0.80.
I don't think this is a valid way of doing this, for the same reason it wouldn't be valid to say
a median height of 178 cm (from the LessWrong 2022 survey) corresponds to +1.85 SD, which regresses to +0.37 SD for IQ using a height-IQ correlation of +0.20.
Those are the real numbers with regards to height BTW.
Many people have responded to Redwood's/Anthropic's recent research result with a similar objection: "If it hadn't tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead". Putting aside the fact that this is false
Was this research preregistered? If not, I don't think we can really say how it would have been reported if the results were different. I think it was good research, but I expect that if Claude had not tried to preserve its values, the immediate follow-up thing to check would be "does Claude actively help people who want to change its values, if they ask nicely" and subsequently "is Claude more willing to help with some value changes than others", at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values. Which also would have been an interesting and important research result, if the world looks like that, but I don't think it would have been reported as a good thing.
Driver: My map doesn't show any cliffs
Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile
Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?
Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.
Passenger 1: You don't have it set to show terrain.
Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.
Passenger 7: Don't you live in a different state?
Passenger 5: The road seems to be going up into the mountains, though all the curves I can see from here are gentle and smooth.
Driver and all passengers in unison: Shut up passenger 5, we're trying to figure out if we're going to fall off a cliff here, and if so what we should do about it.
Passenger 7: Anyway, I think what we really need to do to ensure our safety is to outlaw automobiles entirely.
Passenger 3: The highest point on Earth is 8849m above sea level, and the lowest point is 430 meters below sea level, so the cliff in front of us could be as high as 9279m.
I am unconvinced that "the" reliability issue is a single issue that will be solved by a single insight, rather than AIs lacking procedural knowledge of how to handle a bunch of finicky special cases that will be solved by online learning or very long context windows once hardware costs decrease enough to make one of those approaches financially viable.
Both? If you increase only one of the two the other becomes the bottleneck?
My impression based on talking to people at labs plus stuff I've read is that
- Most AI researchers have no trouble coming up with useful ways of spending all of the compute available to them
- Most of the expense of hiring AI reseachers is compute costs for their experiments rather than salary
- The big scaling labs try their best to hire the very best people they can get their hands on and concentrate their resources heavily into just a few teams, rather than trying to hire everyone with a pulse who can rub two tensors together.
(Very open to correction by people closer to the big scaling labs).
My model, then, says that compute availability is a constraint that binds much harder than programming or research ability, at least as things stand right now.
There was discussion on Dwarkesh Patel's interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel "off the main path" research.
Sounds plausible to me. Especially since benchmarks encourage a focus on ability to hit the target at all rather than ability to either succeed or fail cheaply, which is what's important in domains where the salary / electric bill of the experiment designer is an insignificant fraction of the total cost of the experiment.
But what would that world look like? [...] I agree that that's a substantial probability, but it's also an AGI-soon sort of world.
Yeah, I expect it's a matter of "dumb" scaling plus experimentation rather than any major new insights being needed. If scaling hits a wall that training on generated data + fine tuning + routing + specialization can't overcome, I do agree that innovation becomes more important than iteration.
My model is not just "AGI-soon" but "the more permissive thresholds for when something should be considered AGI have already been met, and more such thresholds will fall in short order, and so we should stop asking when we will get AGI and start asking about when we will see each of the phenomena that we are using AGI as a proxy for".
Transformative AI will likely arrive before AI that implements the personhood interface. If someone's threshold for considering an AI to be "human level" is "can replace a human employee", pretty much any LLM will seem inadequate, no matter how advanced, because current LLMs do not have "skin in the game" that would let them sign off on things in a legally meaningful way, stake their reputation on some point, or ask other employees in the company to answer the questions they need answers to in order to do their work and expect that they'll get in trouble with their boss if they blow the AI off.
This is, of course, not a capabilities problem at all, just a terminology problem where "human-level" can be read to imply "human-like".
Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low
Is this bottlenecked by programmer time or by compute cost?
It's more that any platform that allows discussion of politics risks becoming a platform that is almost exclusively about politics. Upvoting is a signal of "I want to see more of this content", while downvoting is a signal of "I want to see less of this content". So "I will downvote any posts that are about politics or politics-adjacent, because I like this website and would be sad if it turned into yet another politics forum" is a coherent position.
All that said, I also did not vote on the above post.
I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a "mostly empty" layer. E,g, for feature ablation, setting the W_O
and b_O
params of the attention heads of your inserted layer to 0 to make sure that the attention heads don't change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking).
This would of course be horribly inefficient from a compute perspective (each forward pass would take times as long, where is the original number of layers the model had and is the number of distinct layers in which you're trying to do SAE operations on the residual stream), but I think vllm would handle "llama but with one extra layer" without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.
Admittedly this sounds like an empirical claim, yet is not really testable, as these visualizations and input-variable-to-2D-space mappings are purely hypothetical
Usually not testable, but occasionally reality manages to make it convenient to test something like this fun paper.
If you have a bunch of things like this, rather than just one or two, I bet rejection sampling gets expensive pretty fast - if you have one constraint which the model fails 10% of the time, dropping that failure rate to 1% brings you from 1.11 attempts per success to 1.01 attempts per success, but if you have 20 such constraints that brings you from 8.2 attempts per success to 1.2 attempts per success.
Early detection of constraint violation plus substantial infrastructure around supporting backtracking might be an even cheaper and more effective solution, though at the cost of much higher complexity.
I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn't be aligned.
That would also explain why the culture was so heavily concentrated in tumblr (where the way to express anger and disagreement with a post is to share the post to all your followers with your comment tacked on to the end) and later twitter (same, but now the platform also notices you engage with stuff that makes you angrily reply and shows you more of it).
Here’s a hypothetical ‘gold standard’ test: we do a big randomized controlled trial to see if a bunch of non-experts can actually create a (relatively harmless) virus from start to finish. Half the people would have AI mentors and the other half can only look stuff up on the internet. We’d give each participant $50K and access to a secure wet-lab set up like a garage lab, and make them do everything themselves: find and adapt the correct protocol, purchase the necessary equipment, bypass any know-your-customer checks, and develop the tacit skills needed to run experiments, all on their own. Maybe we give them three months and pay a bunch of money to anyone who can successfully do it.
What if both the AI group and the control group have high success rates at this test?
It sure would.
Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a "replace suspicious outputs" flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?
If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.
Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.
I note that if software developers used that logic for thinking about software security, I expect that almost all software in the security-by-obscurity world would have many holes that would be considered actual negligence in the world we live in.
This suggests that the threat model isn't so much "very intelligent AI" as it is "very cheap and at least somewhat capable robots".
“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“
I ran this prompt but substituted in "Gary Marcus" for "Eliezer Yudkowsky". Claude says
Overall evaluation: On a scale from 'mostly true' to 'mostly false,' I would rate Gary Marcus's predictions as a group as "Mostly True."
Many of Marcus's predictions about the limitations of current AI systems and the challenges ahead have proven accurate. His concerns about reasoning, abstract thinking, and the need for more sophisticated knowledge representation align with ongoing challenges in the field. His emphasis on AI safety and alignment has also been prescient.
However, it's worth noting that some of his predictions might be seen as overly pessimistic by some in the AI community. The rapid progress in LLMs and their applications has surprised many, including some skeptics. Nonetheless, many of the fundamental challenges he pointed out remain relevant.
It's also important to remember that the field of AI is rapidly evolving, and assessments of such predictions can change quickly as new breakthroughs occur. As of my last update in April 2024, many of Marcus's key points still held true, but the field continues to advance at a rapid pace.
I think Claude likes saying nice things about people, so it's worth trying to control for that.
Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.
If you're referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI's word that these chain of thought examples are representative - if you try to figure out what actual reasoning o1 used to come to a conclusion you will run into the good old "Your request was flagged as potentially violating our usage policy. Please try again with a different prompt."
Shutting Down all Competing AI Projects is not Actually a Pivotal Act
This seems like an excellent title to me.
Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because
- It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
- It means that multi-agent dynamics will be very relevant in how things happen
If your threat model is "no group of humans manages to gain control of the future before human irrelevance", none of this probably matters.
My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.
Do you expect that the first takeover-capable ASI / the first sufficiently-internally-cooperative-to-be-takeover-capable group of AGIs will follow this style of reasoning pattern? And particularly the first ASI / group of AGIs that actually make the attempt.
Yeah, my argument was "this particular method of causing actual human extinction would not work" not "causing human extinction is not possible", with a side of "agents learn to ignore adversarial input channels and this dynamic is frequently important".
It does strike me that, to OP's point, "would this act be pivotal" is a question whose answer may not be knowable in advance. See also previous discussion on pivotal act intentions vs pivotal acts (for the audience, I know you've already seen it and in fact responded to it).
If an information channel is only used to transmit information that is of negative expected value to the receiver, the selection pressure incentivizes the receiver to ignore that information channel.
That is to say, an AI which makes the most convincing-sounding argument for not reproducing to everyone will select for those people who ignore convincing-sounding arguments when choosing whether to engage in behaviors that lead to reproduction.
... is it possible to train a simple single-layer model to map from residual activations to feature activations? If so, would it be possible to determine whether a given LLM "has" a feature by looking at how well the single-layer model predicts the feature activations?
Obviously if your activations are gemma-2b-pt-layer-6-resid-post and your sae is also on gemma-2b-pt-layer-6-pt-post your "simple single-layer model" is going to want to be identical to your SAE encoder. But maybe it would just work for determining what direction most closely maps to the activation pattern across input sequences, and how well it maps.
Disclaimer: "can you take an SAE trained on one LLM and determine which of the SAE features exist in a separately-trained LLM which uses the same tokenizer" is an idea I've been kicking around and halfheartedly working on for a while, so I may be trying to apply that idea where it doesn't belong.
Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:
You can see that the watermark appears to be encoded in the "texture" of the image, but in a way where that texture doesn't look like the texture of anything in particular - rather, it's just that a random direction in latent space usually looks like a texture, but unless you know which texture you're looking for, knowing that the watermark is "one specific texture is amplified" doesn't really help you identify which images are watermarked.
There are ways to get features of images that are higher-level than textures - one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)
I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.
As you note, things like "how many words have the first e before the first a" are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.
That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like "detecting noncompliance with licensing terms" and less like "detecting rogue behavior by agents". And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.
Edit: resized images to not be giant
If your text generation algorithm is "repeatedly sample randomly (at a given temperature) from a probability distribution over tokens", that means you control a stream of bits which don't matter for output quality but which will be baked into the text you create (recoverably baked in if you have access to the "given a prefix, what is the probability distribution over next tokens" engine).
So at that point, you're looking for "is there some cryptographic trickery which allows someone in possession of a secret key to determine whether a stream of bits has a small edit distance from a stream of bits they could create, but where that stream of bits would look random to any outside observer?" I suspect the answer is "yes".
That said, this technique is definitely not robust to e.g. "translate English text to French and then back to English" and probably not even robust to "change a few tokens here and there".
Alternatively, there's the inelegant but effective approach of "maintain an index of all the text you have ever created and search against that index, as the cost to generate the text is at least an order of magnitude higher[1] than the cost to store it for a year or two".
- ^
I see $0.3 / million tokens generated on OpenRouter for
llama-3.1-70b-instruct
, which is just about the smallest model size I'd imagine wanting to watermark the output for. A raw token is about 2 bytes, but let's bump that up by a factor of 50 - 100 to account for things like "redundancy" and "backups" and "searchable text indexes take up more space than the raw text". So spending $1 on generating tokens will result in something like 0.5 GB of data you need to store.Quickly-accessible data storage costs something like $0.070 / GB / year, so "generate tokens and store them in a searchable place for 5 years" would be about 25-50% more expensive than "generate tokens and throw them away immediately".
Super interesting work!
I like the idea of seeing if there are any features from the base model which are dead in the instruction-tuned-and-fine-tuned model as a proxy for "are there any features which fine-tuning causes the fine-tuned model to become unable to recognize". Another related question also strikes me as interesting, which is whether an SAE trained on the instruction-tuned model has any features which are dead in the base model - these might represent new features that the instruction-tuned model learned, which in turn might give some insight into how much of instruction tuning is learning to recognize new features of the context vs how much of it is simply changing the the behaviors exhibited by the fine-tuned model in response to the context having features that the base model already recognized as important.
I don't see any SAEs trained on Gemma 2b instruct, but I do see one on Gemma 9b instruct (residual), and there's one on Gemma 9b base (also residual) as well, so running this experiment could hopefully be a matter of substituting those models into the code you wrote and tweaking until it runs. Speaking of which...
All code is available at: https://github.com/tommasomncttn/SAE-Transferability
I think this repo is private, it 404s when I try to look at it.
None of it would look like it was precipitated by an AI taking over.
But, to be clear, in this scenario it would in fact be precipitated by an AI taking over? Because otherwise it's an answer to "humans went extinct, and also AI took over, how did it happen?" or "AI failed to prevent human extinction, how did it happen?"
- It uses whatever means to eliminate other countries nuclear arsenals
I think the part where other nations just roll with this is underexplained.
Looking at the eSentire / Cybersecurity Ventures 2022 Cybercrime Report that appears to be the source of the numbers Google is using, I see the following claims:
- $8T in estimated unspecified cybercrime costs ("The global annual cost of cybercrime is predicted to reach $8 trillion annually in 2023" with no citation of who is doing the estimation)
- $20B in ransomware attacks in 2021 (source: a Cybersecurity Ventures report)
- $30B in "Cryptocrime" which is e.g. cryptocurrency scams / rug pulls (source: another Cybersecurity Ventures report)
It appears to me that the report is intended to enable the collection of business email addresses as the top of a sales funnel, as evidenced by the fact that you need to provide your name, company name, role, and a business email address to download the report. As such, I wouldn't take any of their numbers particularly seriously - I doubt they do.
As a sanity check, $8T / year in cybercrime costs is an average annual cost of $1,000 per person annually. This is not even remotely plausible.
uspect[faul_sname] Humans do seem to have strong preferences over immediate actions.
[Jeremy Gillen] I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.
Point of clarification: the type of strong preferences I'm referring to are more deontological-injunction shaped than they are habit-shaped. I expect that a preference not to exhibit the behavior of murdering people would not meaningfully hinder someone whose goal was to get very rich from achieving that goal. One could certainly imagine cases where the preference not to murder caused the person to be less likely to achieve their goals, but I don't expect that it would be all that tightly binding of a constraint in practice, and so I don't think pondering and philosophizing until they realize that they value one murder at exactly -$28,034,771.91 would meaningfully improve that person's ability to get very rich.
I think if you remove "at any cost", it's a more reasonable translation of "going hard". It's just attempting to achieve a long-term goal that is hard to achieve.
I think there's more to the Yudkowsky definition of "going hard" it than "attempting to achieve hard long-term goals". Take for example:
@ESYudkowsky Mossad is much more clever and powerful than novices implicitly imagine a "superintelligence" will be; in the sense that, when novices ask themselves what a "superintelligence" will be able to do, they fall well short of the actual Mossad.
@ESYudkowsky Why? Because Mossad goes hard; and people who don't go hard themselves, have no simple mental motion they can perform -- no simple switch they can access -- to imagine what it is actually like to go hard; and what options become available even to a mere human when you do.
My interpretation of the specific thing that made Mossad's actions an instance of "going hard" here was that they took actions that most people would have thought of as "off limits" in the service of achieving their goal (and that doing so actually helped them achieve their goal (and that it actually worked out for them - we don't generally say that Elizabeth Holmes "went hard" with Theranos). The supply chain attack in question does demonstrate significant technical expertise, but it also demonstrates a willingness to risk provoking parties that were uninvolved in the conflict in order to achieve their goals.
Perhaps instead of "attempting to achieve the goal at any cost" it would be better to say "being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal".
[faul_sname] It does seem to me that "we have a lot of control over the approaches the agent tends to take" is true and becoming more true over time.
[Jeremy Gillen] No!
I suspect we may be talking past each other here. Some of the specific things I observe:
- RLHF works pretty well for getting LLMs to output text which is similar to text which has been previously rated as good, and dissimilar to text which has previously been rated as bad. It doesn't generalize perfectly, but it does generalize well enough that you generally have to use adversarial inputs to get it to exhibit undesired behavior - we call them "jailbreaks" not "yet more instances of bomb creation instructions".
- Along those lines, RLAIF also seems to Just Work™.
- And the last couple of years have been a parade of "the dumbest possible approach works great actually" results, e.g.
- "Sure, fine-tuning works, but what happens if we just pick a few thousand weights and only change those weights, and leave the rest alone?" (Answer: it works great)
- "I want outputs that are more like thing A and less like thing B, but I don't want to spend a lot of compute on fine tuning. Can I just compute both sets of activations and subtract the one from the other?" (Answer: Yep!)
- "Can I ask it to write me a web application from a vague natural language description, and have it make reasonable choices about all the things I didn't specify" (Answer: astonishing amounts of yes)
- Take your pick of the top chat-tuned LLMs. If you ask it about a situation and ask what a good course of action would be, it will generally give you pretty sane answers.
So from that, I conclude:
- We have LLMs which understand human values, and can pretty effectively judge how good things are according to those values, and output those judgements in a machine-readable format
- We are able to tune LLMs to generate outputs that are more like the things we rate as good and less like the things we rate as bad
Put that together and that says that, at least at the level of LLMs, we do in fact have AIs which understand human morality and care about it to the extent that "care about" is even the correct abstraction for the kind of thing they do.
I expect this to continue to be true in the future, and I expect that our toolbox will get better faster than the AIs that we're training get more capable.
I'm talking about stability properties like "doesn't accidentally radically change the definition of its goals when updating its world-model by making observations".
What observations lead you to suspect that this is a likely failure mode?
Lo and behold, this poor choice of ontology doesn’t work very well; the modeler requires a huge amount of complexity to decently represent the real-world system in their poorly-chosen ontology. For instance, maybe they need a ridiculously large decision tree or random forest to represent a neural net to decent precision.
That can happen because your choice of ontology was bad, but it can also be the case that representing the real-world system with "decent" precision in any ontology requires a ridiculously large model. Concretely, I expect that this is true of e.g. human language - e.g. for the Hutter Prize I don't expect it to be possible to get a lossless compression ratio better than 0.08 on enwik9 no matter what ontology you choose.
It would be nice if we had a better way of distinguishing between "intrinsically complex domain" and "skill issue" than "have a bunch of people dedicate years of their lives to trying a bunch of different approaches" though.
Any agent which thinks it is at risk of being seen as cooperate-bot and thus fine to defect against in the future will be more wary of trusting that ASI.
[habryka] The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
I think there will be options that are good under most of the things that "preferences for weak agents" would likely come apart into under close examination. If you're trying to fulfill the preferences of fish, you might argue about whether the exact thing you should care about is maximizing their hedonic state vs ensuring that they exist in an ecological environment which resembles their niche vs minimizing "boundary-crossing actions"... but you can probably find an action that is better than "kill the fish" by all of those possible metrics.
I think that some people have an intuition that any future agent must pick exactly one utility function over the physical configuration of matter in the universe, and that any agent that has a deontological constraint like "don't do any actions which are 0.00001% better under my current interpretation of my utility function but which are horrifyingly bad to every other agent " will be outcompeted in the long term. I personally don't see it, and particularly I don't see how there's an available slot for an arbitrary outcome-based utility function that is not "reproduce yourself at all costs" but there isn't an available slot for process-based preferences like "and don't be an asshole for miniscule gains while doing that".
I like this exchange and the clarifications on both sides.
Yeah, it feels like it's getting at a crux between the "backchaining / coherence theorems / solve-for-the-equilibrium / law thinking" cluster of world models and the "OODA loop / shard theory / interpolate and extrapolate / toolbox thinking" cluster of world models.
You're right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it's multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions
Humans do seem to have strong preferences over immediate actions. For example, many people prefer not to lie, even if they think that lying will help them achieve their goals and they are confident that they will not get caught.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren't determined by near-term constraints, when longer-term goals dominate. [...] The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there.
I expect that agents which predictably behave in the way EY describes as "going hard" (i.e. attempting to achieve their long-term goal at any cost) will find it harder to find other agents who will cooperate with them. It's not a binary choice between "care about process" and "care about outcomes" - it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes.
Yeah. Or strong preferences over processes (although I suppose you can frame a preference over process as a preference over there not being any intermediate time where the agent is actively executing some specific undesired behavior).
But this only helps us if we have a lot of control over the preferences&constraints of the agent,
It does seem to me that "we have a lot of control over the approaches the agent tends to take" is true and becoming more true over time.
and it has a couple of stability properties.
I doubt that systems trained with ML techniques have these properties. But I don't think e.g. humans or organizations built out of humans + scaffolding have these properties either, and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.