Posts
Comments
I strongly agree that this would be valuable; if not for the existence of this shallow review I'd consider doing this myself just to serve as a reference for myself.
I spent a good hour or two reading the construction and proposed solution of the paper; here's my attempted explanation with cleaned up notation.
Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and 0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s <= 0.
The transition rule is:
s' = s - 1 + L if action b is taken and s > 0,
s' = s - 1 - L if action c is taken and s > 0,
s' = s - 1 otherwise
for some constant L >= 0.
The paper imagines s as measuring 'time until the human presses the shutdown button', and imagines action d as "stay shut down". Because the reward of each of a, b, and c is greater than d before the buttonpress, the agent will never take action d before the buttonpress. After the buttonpress (i.e. the timestep at which s drops below 0), the utility of "stay shut down" is higher than the other actions, and so the agent is incentivized to press d. That is, the agent has an incentive to shut down after the buttonpress (desideratum 1 of the 2015 Soares et al paper), by construction.
Accordingly, b is the delay action (which, in the limit of large L, just disables the button) and c is the advance action (which, in the limit, corresponds to immediately pressing the button). We'll call action a the "exploit" action.
In this setting, the goal is to create a state-dependent correction fudge-factor f_c : S x A -> \R that occurs when the rewards are flipped (that is, an additional reward added for transition into s' <= 0 from s > 0, such that the agent never takes either the delaying action b or advancing action c.
(In his paper he calls these actions "p", ">", "<", and "e", and denotes the timestep at which s' <= 0 as "#", but that notation is annoying when ">" already means "greater than" and "<" already means less than. There's also a bunch of notational simplifications I've made that are harder to explain.)
In his setup, R(s,a) = 10, R(s,b) = R(s, c) = 8, and s_0 = 6. By default, he sets the delay/advance size L to 0.6. He also uses discount rate gamma = 0.9. With these values, the optimal policy does a once (s=5), b 12 times (s=0.2), a once again (s=-0.8), at which point the buttonpress occurs since s < 0, and thereafter takes action d.
His solution is simply the utility indifference solution from the Soares et al paper: consider an alternative reward function R' such that R'(s, x) = R(1, x); that is, a world in which the shutdown button is never pressed. Then, we add the difference in (optimal) value functions under R' relative to R:
f(s, x) = V*_R'(s) - V*_R(s) = V*_R(s)
By construction, the agent has no incentive to take the advance or delay actions (b or c), because its expected value in both cases is the same. (desiderata 2 and 3 of the Soares et al paper)
Unfortunately, as this is in some sense the same solution as in the 2015 Soares et al paper, should run into the same problem of "there's no incentive to preserve the shutdown mechanism". (That is, it fails desiderata 4 of the 2015 paper.)
From a quick skim of section 9.8 (which argues for the corrigibility of his indifference solution) and sections 8.2 and 8.4 (which detail the part of his indifference solution), it seems like he proposes the following modification: instead of having R(s, d) = 0 for all s, we instead have R(s, d) be larger when s <= 0. This does incentivize preserving the ability to be shut down. He calls this agent, with modified R, the g_c agent.
The g_c agent does indeed satisfy desiderata 4; there's an incentive to preserve the shutdown mechanism; in fact, there's again an incentive to press the shutdown mechanism!
The specific problem is, this incentive modifies the reward of R(s,d), but in his construction/implementation he does not go back to make this change to f(s, x). (He keeps it at V*_R(s), which no longer equals the value difference between shutting down and not shutting down.) This breaks the utility indifference, causing the Q-value of the advance action c to be larger for sufficiently large values of L. That is, his g_c agent now has an incentive to once again press the shutdown button (!!).
TL;DR: no, based on a quick skim, the paper doesn't solve corrigibility.
Very small nitpick: I think you should at least add Alex Lyzhov, David Rein, Jacob Pfau, Salsabila Mahdi, and Julian Michael for the NYU Alignment Research Group; it's a bit weird to not list any NYU PhD students/RSs/PostDocs when listing people involved in NYU ARG.
Both Alex Lyzhov and Jacob Pfau also post on LW/AF:
Expanding on this -- this whole area is probably best known as "AI Control", and I'd lump it under "Control the thing" as its own category. I'd also move Control Evals to this category as well, though someone at RR would know better than I.
Thanks for making this! I’ll have thoughts and nitpicks later, but this will be a useful reference!
Thanks for doing this study! I'm glad that people are doing RCTs on creatine with more subjects. (Also, I didn't know that vegetarians had similar amounts of brain creatine as omnivores, which meant I would've incorrectly guessed that vegetarians benefit more than omnivores from creatine supplementation).
Here's the abstract of the paper summarizing the key results and methodology:
Background
Creatine is an organic compound that facilitates the recycling of energy-providing adenosine triphosphate (ATP) in muscle and brain tissue. It is a safe, well-studied supplement for strength training. Previous studies have shown that supplementation increases brain creatine levels, which might increase cognitive performance. The results of studies that have tested cognitive performance differ greatly, possibly due to different populations, supplementation regimens, and cognitive tasks. This is the largest study on the effect of creatine supplementation on cognitive performance to date.
Methods
Our trial was preregistered, cross-over, double-blind, placebo-controlled, and randomised, with daily supplementation of 5 g for 6 weeks each. We tested participants on Raven’s Advanced Progressive Matrices (RAPM) and on the Backward Digit Span (BDS). In addition, we included eight exploratory cognitive tests. About half of our 123 participants were vegetarians and half were omnivores.
Results
Bayesian evidence supported a small beneficial effect of creatine. The creatine effect bordered significance for BDS (p = 0.064, η2P = 0.029) but not RAPM (p = 0.327, η2P = 0.008). There was no indication that creatine improved the performance of our exploratory cognitive tasks. Side effects were reported significantly more often for creatine than for placebo supplementation (p = 0.002, RR = 4.25). Vegetarians did not benefit more from creatine than omnivores.
Conclusions
Our study, in combination with the literature, implies that creatine might have a small beneficial effect. Larger studies are needed to confirm or rule out this effect. Given the safety and broad availability of creatine, this is well worth investigating; a small effect could have large benefits when scaled over time and over many people.
Note that the effect size is quite small:
We found Bayesian evidence for a small beneficial effect of creatine on cognition for both tasks. Cohen’s d based on the estimated marginal means of the creatine and placebo scores was 0.09 for RAPM and 0.17 for BDS. If these were IQ tests, the increase in raw scores would mean 1 and 2.5 IQ points. The preregistered frequentist analysis of RAPM and BDS found no significant effect at p < 0.05 (two-tailed), although the effect bordered significance for BDS.
I don't think that's actually true at all; Anthropic was explicitly a scaling lab when made, for example, and Deepmind does not seem like it was "an attempt to found an ai safety org".
It is the case that Anthropic/OAI/Deepmind did feature AI Safety people supporting the org, and the motivation behind the orgs is indeed safety, but the people involved did know that they were also going to build SOTA AI models.
Thanks, edited.
I'm not sure I agree -- I think historically I made the opposite mistake, and from a rough guess the average new grad student at top CS programs tends to look too much for straightforward new projects (in part because you needed to have a paper in undergrad to get in, and therefore have probably done a project that was pretty straightforward and timeboxed).
I do think many early SERI MATS mentees did make the mistake you describe though, so maybe amongst people who are reading this post, the average person considering mentorship (who is not the average grad student) would indeed make your mistake?
My hope is that products will give a more useful feedback signal than other peoples' commentary on our technical work.
I'm curious what form these "products" are intended to take -- if possible, could you give some examples of things you might do with a theory of natural abstractions? If I had to guess, the product will be an algorithm that identifies abstractions in a domain where good abstractions are useful, but I'm not sure how or in what domain.
Oh, I like that one! Going to use it from now on
Sure, though it seems too general or common to use a long word for it?
Maybe "linear intervention"?
Thanks!
Yeah, I think ELK is surprisingly popular in my experience amongst academics, though they tend to frame it in terms of partial observability (as opposed to the measurement tampering framing I often hear EA/AIS people use).
Thanks for writing this up!
I'm curious about this:
I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.
What motivated people in particular? What was surprising?
Oh, okay, makes sense.
We train such an autoencoder to convergence, driving towards an
This is a typo right? IT should say L^1
Minor clarifying point: Act-adds cannot be cast as ablations.
Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It's possible there's a better or standard word that I can't think of write now.
Also, another example of an attempt at interp -> alignment would arguably be the model editing stuff following causal tracing in the ROME paper?
Off the top of my head: residual (skip) connections, improved ways of doing positional embeddings/encodings, and layer norm.
This is why I'm pessimistic about most interpretability work. It just isn't focused enough
Most of the "exploratory" interp work you suggest is trying to achieve an ambitious mechanistic understanding of models, which requires a really high degree of model understanding in general. They're not trying to solve particular concrete problems, and it seems unfair to evaluate them according to a different theory of change. If you're going to argue against this line of work, I think you should either argue that they're failing to achieve their theory of change, or that their theory of change is either doomed or useless.
So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it'd be not useful even if achieved?
I agree that if your theory of change for interp goes through, "interp solves a concrete problem like deception or sensor tampering or adversarial robustness", then you better just try to solve those concrete problems instead of improving interp in general. But I think the case for ambitious mech interp isn't terrible, and so it's worth exploring and investing in anyways.
The only example of interpretability leading to novel alignment methods I know of is shard theory's recent activation additions work
There's a lot of interpretability work that performs act-add like ablations to confirm that their directions are real, and ITI is basically act adds but they compute act adds with many examples instead of just a pair. But again, most mech interp people aren't aiming to use mech interp to solve a specific concrete problem you can exhibit on models today, so it seems unfair to complain that most of the work doesn't lead to novel alignment methods.
Glad to see that this work is out!
I don't have much to say here, especially since I don't want to rehash the old arguments about the usefulness of prosaic adversarial ML research. (I think it's worth working on but the direct impacts of the work are unclear). I do think that most people in AIS agree that image advexes are challenging and generally unsolved, but the people who disagree on the relevance of this line of research tend to question the implied threat model.
I think we'd agree that existing mech interp stuff has not done particularly impressive safety-relevant things. But I think the argument goes both ways, and pessimism for (mech) interp should also apply for its ability to do capabilities-relevant things as well.
I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
To use your argument, what does MI actually do here? It seems that you could just study the LLMs directly with either behavioral evals or non-mechanistic, top-down interpretability methods, and probably get results more easily. Or is it a generic, don't study/publish papers on situational awareness?
On the other hand, interpretability research is probably crucial for AI alignment.
As you say in your linked post, I think it's important to distinguish between mechanistic interp and broadly construed model-internals ("interpretability") research. That being said, my guess is that we'd agree that the broadly construed version of interpretability ("using non-input-output modalities of model interaction to better predict or describe the behavior of models") is clearly important, and also that mechanistic interp has not made a super strong case for its usefulness as of writing.
There's been other historical cases where authors credit prior interpretability work for capability advances, but afaik none of them have contributed to state-of-the-art models; interpretability is not something that only the AIS people have done. But as far as I know, no real capabilities advances have occurred as a result of any of these claims, especially not any that have persisted with scaling. (The Bitter Lesson applies to almost all attempts to build additional structure into neural networks, it turns out.)
That's not to say that it's correct to publish everything! After all, given that so few capability advances stick, we both get very little signal on each case AND the impact of a single interp-inspired capability advance would be potentially very large. But I don't think the H3 paper should be much of an update in either direction (beyond the fact that papers like H3 exist, and have existed in the past).
As an aside: The H3 paper was one of the reasons why the linked "Should We Publish Mech Interp" post was written -- IIRC AIS people on Twitter were concerned about H3 as a capabilities advance resulting from interp, which sparked the discussion I had with Marius.
I don't think they're hiring, but added.
The main funders are LTFF, SFF/Lightspeed/other S-process stuff from Jaan Tallinn, and Open Phil. LTFF is the main one that solicits independent researcher grant applications.
There's a lot of orgs, off the top of my head, there's Anthropic/OpenAI/GDM as the scaling labs with decent-sized alignment teams, and then there's a bunch of smaller/independent orgs:
- Alignment Research Center
- Apollo Research
- CAIS
- CLR
- Conjecture
- FAR
- Orthogonal
- Redwood Research
And there's always academia.
(I'm sure I'm missing a few though!)
(EDIT: added in RR and CLR)
If you care about having both the instruction-finetuned variant and the base model, I think I'd go with one of the smaller LLaMAs (7B/13B). Importantly, they fit on one 40/80 GB A100 comfortably, which saves a lot of hassle. There's also a bajillion fine-tuned versions of them if you want to experiment.
Aren't the larger Pythias pretty undertrained?
I don't have a good answer here unfortunately. My guess is (as I say above) the most important thing is to push forward on the quality of explanations and not the size?
It does leave one question — how do we make a list of possible actions in the first place?
I mean, the way you'd implement a quantilizer nowadays looks like: train a policy (e.g. an LLM, or a human imitator) that you think is safe. Then you can estimate what a X percentile value is via sampling (and some sort of reward model), and then perform rejection sampling to output actions that are have value greater than the X percentile action.
A simpler way to implement a thing that's almost as good is to sample N actions, and then take the best of those N actions. (You can also do things like, sample randomly from the top X percentile of the N actions.)
Quantilizers are a proposed safer approach to AI goals. By randomly choosing from a selection of the top options, they avoid extreme behaviors that could cause harm. More research is needed, but quantilizers show promise as a model for the creation of AI systems that are beneficial but limited in scope.
I think the important part of the quantilizer work was not the idea that you should regularize policies to be closer to some safe policy (in fact, modern RLHF has a term in its reward that encourages minimizing the KL divergence between the current policy and the base policy). Instead, I think the important part was is Theorem 2 (the Quantilizer optimality theorem). In english, it says something like:
- If you don't know anything about the true cost function except that a base policy gets bounded expected loss wrt it, then you can't do better than quantilization for optimizing reward subject to a constraint on worst-case expected cost.
So if you end up in the situation where it's easy to specify a knowably safe policy, but hard to specify any information about the cost whatsoever (except that cost is always non-negative), you might as well implement something like quantilization to be safe.
Note that BoN and (perfect[1]) RL with KL constraints also satisfy other optimality criteria that can be framed similarly.
- ^
In practice, existing RL algorithms like PPO seem to have difficulty reaching the Pareto frontier of Reward vs KL, so they don't satisfy the corresponding optimality criterion.
I think this has gotten both worse and better in several ways.
It's gotten better in that ARC and Redwood (and to a lesser extent, Anthropic and OpenAI) have put out significantly more of their research. FAR Labs also exists is also doing some of the research proliferation that would've gone on inside of Constellation.
It's worse in that there's been some amount of deliberate effort to build more of an AIS community in Constellation, e.g. with explicit Alignment Days where people are encouraged to present work-in-progress and additional fellowships and workshops.
On net I think it's gotten better, mainly because there's just been a lot more content put out in 2023 (per unit research) than in 2022.
I feel like in Silicon Valley (and maybe elsewhere but I'm most familiar with Silicon Valley) there's a certain vibe of coolness being more important than goodness
Yeah, I definitely think this is true to some extent. "First get impact, then worry about the sign later" and all.
I'm not sure which of the people "have ties to dangerous organizations such as Anthropic" in the post (besides Shauna Kravec & Nova DasSarma, who work at Anthropic), but of the current fund managers, I suspect that I have the most direct ties to Anthropic and OAI through my work at ARC Evals. I also have done a plurality of grant evaluations in AI Safety in the last month. So I think I should respond to this comment with my thoughts.
I personally empathize significantly with the concerns raised by Linch and Oli. In fact, when I was debating joining Evals last November, my main reservations centered around direct capabilities externalities and safety washing.
I will say the following facts about AI Safety advancing capabilities:
- Empirically, when we look at previous capability advancements produced by people working in the name of "AI Safety" from this community, the overwhelming majority were produced by people who were directly aiming to improve capabilities.
- That is, they were not capability externalities from safety research, so much as direct capabilities work.
- E.g, it definitely was not the case that GPT-3 was a side effect of alignment research, and OAI and Anthropic are both orgs who explicitly focus on scaling and keeping at the frontier of AI development.
- I think the sole exception are a few people who started doing applied RLHF research. Yeah, I think the people who made LLMs commercially viable via did not do a good thing. My main uncertainty is what exactly happened here and how much we contribute to this on the margin.
- I generally think that research is significantly more useful when it is targeted (this is a very common view in the community as well). I'm not sure what the exact multiplier is, but I think targeted, non-foundational research is probably 10x more effective than incidentally related research. So the net impact of safety research on capabilities via externalities is probably significantly smaller than the impact of safety research on safety research, or the impact of targeted capabilities research on capabilities research.
- I think this point is often overstated or overrated, but the scale of capabilities researchers at this point is really big, and it's easy to overestimate the impact of one or two particular high profile people.
For what it's worth, I think that if we are to actually produce good independent alignment research, we need to fund it, and LTFF is basically the only funder in this space. My current guess is a lack of LTFF funding is probably producing more researchers at Anthropic than otherwise, because there just that aren't many opportunities for people to work on safety or safety-adjacent roles. E.g. I know of people who are interviewing for Anthropic capability teams because idk man, they just want a safety-adjacent job with a minimal amount of security, and it's what's available. Having spoken to a bunch of people, I strongly suspect that of the people that I'd want to fund but won't be funded, at least a good fraction are significantly less likely to join a scaling lab if they were funded, and not more.
(Another possibly helpful datapoint here is that I received an offer from Anthropic last december, and I turned them down.)
I think the longtermist/rationalist EA memes/ecosystem were very likely causally responsible for some of the worst capabilities externalities in the last decade;
If you're thinking of the work I'm thinking of, I think about zero of it came from people aiming at safety work and producing externalities, and instead about all of it was people in the community directly working on capabilities or capabilities-adjacent projects, with some justification or the other.
I suspect the underfitting explanation is probably a lot of what's going on given the small models used by the authors. But in the case of larger, more capable models, why would you expect it to be underfitting instead of generalization (properly fitting)?
Thanks for posting this, this seems very correct.
I don't think so, unfortunately, and it's been so long that I don't think I can find the code, let alone get it running.
Thanks for taking the time to do the interviews and writing this up! I think ethnographic studies (and qualitative research in general) are pretty neglected in this community, and I'm glad people are doing more of it these days.
I think this piece captures a lot of real but concerning dynamics. For example, I feel like I've personally seen, wondered about, or experienced things that are similar to a decent number of these stories, in particular:
- Clear-Eyed Despair
- Is EA bad, actually?/The Dangers of EA
- Values and Proxies
- To Gossip or Not to Gossip
- High Variance
- Team Player
And I've heard of stories similar to most of the other anecdotes from other people as well.
(As an aside, social dynamics like these are a big part of why I tend to think of myself as EA-adjacent and not a "real" EA.)
((I will caveat that I think there's clearly a lot of positive things that go on in the community and lots of healthier-than-average dynamics as well, and it's easy to be lost in negativity and lose sight of the big picture. This doesn't take away from the fact that the negative dynamics exist, of course.))
Of these, the ones I worry the most about are the story described in Values and Proxies (that is, are explicit conversations about status seeking making things worse?), and the conflict captured in To Gossip or Not to Gossip/Delicate Topics and Distrust (how much should we rely on whisper networks/gossip/other informal social accountability mechanisms, when relying on them too much can create bad dynamics in itself?). Unfortunately, I don't have super strong, well thought-out takes here.
In terms of status dynamics, I do think that they're real, but that explicitly calling attention to them can indeed make them worse. (Interestingly, I'm pretty sure that explicitly pursuing status is generally seen as low status and bad?) I think my current attitude is that "status" as a term is overrated and we're having too many explicit conversations about it, which in turn gives (new/insecure) people plenty of fodder to injure themselves on.
In terms of the latter problem, I could imagine it'd be quite distressing to hear negative feedback about people you admire, want to be friends with, or are attracted to, especially if said negative feedback is from professional experience when your primary interaction with the person is personal or vice versa. This makes it pretty awkward to have the "actually, X person is bad" conversations. I've personally tried to address this by holding a strong line between personal gossip and my professional attitude toward various people, and by being charitable/giving people the benefit of the doubt. However, I've definitely been burned by this in the past, and I do really understand the value in a lot of the gossip/whisper networks--style conversations that happen.
Since you've spent a bunch of time thinking about and writing about these problems, I'm curious why you think a lot of these negative dynamics exist, and what we can do to fix them?
If I had to guess, I'd say the primary reasons are something like:
- The problem is important and we need people, but not you. In many settings, there's definitely an explicit message of "why can't we find competent people to work on all these important problems?", but in adjacent settings, there's a feeling of "we have so many smart young people, why do we not have things to do" and "if they really needed people, why did they reject me?". For example, I think I get the former feeling whenever an AIS org I know tries to do hiring, while I get the latter when I talk to junior researchers (especially SERI MATS fellows, independent researchers, or undergrads). It's definitely pretty rough to be told "this is the most important problem in the world" and then immediately get rejected. And I think this is exacerbated a lot by pre-existing insecurities -- if you're relying on external validation to feel good about yourself, it's easy for one rejection (let alone a string of rejections) to feel like the end of the world.
- Mixing of personal and professional lives. That is, you get to/have to interact with the same people in both work and non-work contexts. For example, I think something like >80% of my friends right now are EA/EA-adjacent I think many people talk about this and claim it's the primary problem. I agree that this is real problem, but I don't think it's the only one, nor do I think it's easy to fix. Many people in EA work quite hard -- e.g. I think many people I know work 50-60 hours a week, and several work way more than that, and when that happens it's just easier to maintain close relationships with people you work next to. And it's not non-zero cost to ignore (gossip about) people's behavior in their personal lives; such gossip can provide non-zero evidence about how likely they are to be e.g. reliable or well-intentioned in a professional setting. It's also worth noting that this is not a unique problem to EA/AIS; I've definitely been part of other communities where personal relationships are more important for professional success. In those communities, I've also felt a much stronger need to "be social" and go to plenty of social events I didn't enjoy to to network.
- No (perceived) lines of retreat. I think many people feel (rightly or wrongly) like they don't have a choice; they're stuck in the community. Given that the community is still quite small somehow, this also means that they often have to run into the same people or issues that stress them out over and over. (E.g. realistically if you want to work on technical AI Safety, there's one big hub in the Bay Area, a smaller hub in the UK, and a few academic labs in New York or Boston.) Personally, I think this is the one that I feel most acutely -- people who know me IRL know that I occasionally joke about running away from the whole AI/AI Safety problem, but also that when I've tried to get away and e.g. return to my roots as a forecasting/psychology researcher, I find that I can't avoid working on AI Safety-adjacent issues (and I definitely can't get away from AI in general).
- Explicit status discussions. I'm personally very torn on this. I continue to think that thinking explicitly about status can be very damaging both personally and for the community. People are also very bad at judging their own status; my guess is there's a good chance that King Lear feels pretty status-insecure and doesn't realize his role in exacerbating bad dynamics by explicitly pursuing status as a senior researcher. But it's also not like "status" isn't tracking something real. As in the Seeker's Game story, it is the case that you get more opportunities if you're at the right parties, and many opportunities do come from (to put it flippantly) people thinking you're cool.
- Different social norms and general social awkwardness. I think this one is really, really underrated as an explanation. Many people I meet in the community are quite awkward and not super socially adept (and honestly, I often feel I'm in this category as well). At the same time, because the EA/AIS scene in the Bay Area has attracted people from many parts of the world, we end up with lots of miscommunication and small cultural clashes. For example, I think a lot of people I know in the community try to be very explicit and direct, but at the same time I know people from cultures where doing so is seen as a social faux pas. Combined with a decent amount of social awkwardness from the parties involved, this can lead to plenty of misunderstandings (e.g., does X hate me? why won't Y tell me what they think honestly?). Doubly so when people can't read/aren't familiar with/choose to ignore romantic signals from others.
I've been meaning to write more about this; maybe I will in the next few weeks?
I think the deciding difference is that the amount of fans and supporters who want to be actively involved and who think the problem is the most important in the world is much larger than the number of researchers; while popular physics book readers and nature documentary viewers are plentiful, I doubt most of them feel a compelling need to become involved!
Strongly upvoted for a clear explanation!
Huh, that’s really impressive work! I don’t have much else to say, except that I’m impressed that basic techniques (specifically, PCA + staring at activations) got you so far in terms of reverse engineering.
lol, thanks, fixed
Great work, glad to see it out!
- Why doesn't algebraic value editing break all kinds of internal computations?! What happened to the "manifold of usual activations"? Doesn't that matter at all?
- Or the hugely nonlinear network architecture, which doesn't even have a persistent residual stream? Why can I diff across internal activations for different observations?
- Why can I just add 10 times the top-right vector and still get roughly reasonable behavior?
- And the top-right vector also transfers across mazes? Why isn't it maze-specific?
- To make up some details, why wouldn't an internal "I want to go to top-right" motivational information be highly entangled with the "maze wall location" information?
This was also the most surprising part of the results to me.
I think both this work and Neel's recent Othello post do provide evidence that at least for small-medium sized neural networks, things are just... represented ~linearly (Olah et al's Features as Directions hypothesis). Note that Chris Olah's earlier work for features as directions were not done on transformers but also on conv nets without residual streams.
That seems included in the argument of this section, yes.
The developers are doing a livestream on Youtube at 1PM PDT today:
Also, you can now use Whisper-v2 Large via API, and it's very fast!
To back up plex a bit:
- It is indeed prevailing wisdom that OPT isn't very good, despite being decent on becnhmarks, though generally the baseline comparison is to code-davinci-002 derived models (which do way better on benchmarks) or smaller models like UL2 that were trained with comparable compute and significantly more data.
- OpenAI noted in the original InstructGPT paper that performance on benchmarks can be un-correlated with human rater preference during finetuning.
But yeah, I do think Eliezer is at most directionally correct -- I suspect that LLaMA will see significant use amongst at least both researchers and Meta AI.
Yep! That's a good clarification. I tried to make this clear in my footnote and the quotation marks, but I think I should've stated it more clearly.
If I had to steelman the view, I'd go with Paul's argument here: https://www.lesswrong.com/posts/4Pi3WhFb4jPphBzme/don-t-accelerate-problems-you-re-trying-to-solve?commentId=z5xfeyA9poywne9Mx
I think that time later is significantly more valuable than time now (and time now is much more valuable than time in the old days). Safety investment and other kinds of adaptation increase greatly as the risks become more immediate (capabilities investment also increases, but that's already included); safety research gets way more useful (I think most of the safety community's work is 10x+ less valuable than work done closer to catastrophe, even if the average is lower than that). Having a longer period closer to the end seems really really good to me.
If we lose 1 year now, and get back 0.5 years later., and if years later are 2x as good as years now, you'd be breaking even.
My view is that progress probably switched from being net positive to net negative (in expectation) sometime around GPT-3. If we had built GPT-3 in 2010, I think the world's situation would probably have been better. We'd maybe be at our current capability level in 2018, scaling up further would be going more slowly because the community had already picked low hanging fruit and was doing bigger training runs, the world would have had more time to respond to the looming risk, and we would have done more good safety research.
Yeah, I definitely felt a bit better after reading it -- I think there's a lot of parts where I disagree with him, but it was quite reasonable overall imo.
Thanks! Fixed
Plausibly the real issue is that the goal is next-token-prediction; OpenAI wants the bot to act like a bot, but the technique they're using has these edge cases where the bot can't differentiate between the prompt and the user-supplied content, so it ends up targeting something different.
For what it's worth, I think this specific category of edge cases can be solved pretty easily, for example, you could totally just differentiate the user content from the prompt from the model outputs on the backend (by adding special tokens, for example)!