Posts

[AN #158]: Should we be optimistic about generalization? 2021-07-29T17:20:03.409Z
[AN #157]: Measuring misalignment in the technology underlying Copilot 2021-07-23T17:20:03.424Z
[AN #156]: The scaling hypothesis: a plan for building AGI 2021-07-16T17:10:05.809Z
BASALT: A Benchmark for Learning from Human Feedback 2021-07-08T17:40:35.045Z
[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions 2021-07-08T17:20:02.518Z
[AN #154]: What economic growth theory has to say about transformative AI 2021-06-30T17:20:03.292Z
[AN #153]: Experiments that demonstrate failures of objective robustness 2021-06-26T17:10:02.819Z
[AN #152]: How we’ve overestimated few-shot learning capabilities 2021-06-16T17:20:04.454Z
[AN #151]: How sparsity in the final layer makes a neural net debuggable 2021-05-19T17:20:04.453Z
[AN #150]: The subtypes of Cooperative AI research 2021-05-12T17:20:27.267Z
[AN #149]: The newsletter's editorial policy 2021-05-05T17:10:03.189Z
[AN #148]: Analyzing generalization across more axes than just accuracy or loss 2021-04-28T18:30:03.066Z
FAQ: Advice for AI Alignment Researchers 2021-04-26T18:59:52.589Z
[AN #147]: An overview of the interpretability landscape 2021-04-21T17:10:04.433Z
[AN #146]: Plausible stories of how we might fail to avert an existential catastrophe 2021-04-14T17:30:03.535Z
[AN #145]: Our three year anniversary! 2021-04-09T17:48:21.841Z
Alignment Newsletter Three Year Retrospective 2021-04-07T14:39:42.977Z
[AN #144]: How language models can also be finetuned for non-language tasks 2021-04-02T17:20:04.230Z
[AN #143]: How to make embedded agents that reason probabilistically about their environments 2021-03-24T17:20:05.166Z
[AN #142]: The quest to understand a network well enough to reimplement it by hand 2021-03-17T17:10:04.180Z
[AN #141]: The case for practicing alignment work on GPT-3 and other large models 2021-03-10T18:30:04.004Z
[AN #140]: Theoretical models that predict scaling laws 2021-03-04T18:10:08.586Z
[AN #139]: How the simplicity of reality explains the success of neural nets 2021-02-24T18:30:04.038Z
[AN #138]: Why AI governance should find problems rather than just solving them 2021-02-17T18:50:02.962Z
[AN #137]: Quantifying the benefits of pretraining on downstream task performance 2021-02-10T18:10:02.561Z
[AN #136]: How well will GPT-N perform on downstream tasks? 2021-02-03T18:10:03.856Z
[AN #135]: Five properties of goal-directed systems 2021-01-27T18:10:04.648Z
[AN #134]: Underspecification as a cause of fragility to distribution shift 2021-01-21T18:10:06.783Z
[AN #133]: Building machines that can cooperate (with humans, institutions, or other machines) 2021-01-13T18:10:04.932Z
[AN #132]: Complex and subtly incorrect arguments as an obstacle to debate 2021-01-06T18:20:05.694Z
[AN #131]: Formalizing the argument of ignored attributes in a utility function 2020-12-31T18:20:04.835Z
[AN #130]: A new AI x-risk podcast, and reviews of the field 2020-12-24T18:20:05.289Z
[AN #129]: Explaining double descent by measuring bias and variance 2020-12-16T18:10:04.840Z
[AN #128]: Prioritizing research on AI existential safety based on its application to governance demands 2020-12-09T18:20:07.910Z
[AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment 2020-12-02T18:20:05.196Z
[AN #126]: Avoiding wireheading by decoupling action feedback from action effects 2020-11-26T23:20:05.290Z
[AN #125]: Neural network scaling laws across multiple modalities 2020-11-11T18:20:04.504Z
[AN #124]: Provably safe exploration through shielding 2020-11-04T18:20:06.003Z
[AN #123]: Inferring what is valuable in order to align recommender systems 2020-10-28T17:00:06.053Z
[AN #122]: Arguing for AGI-driven existential risk from first principles 2020-10-21T17:10:03.703Z
[AN #121]: Forecasting transformative AI timelines using biological anchors 2020-10-14T17:20:04.918Z
[AN #120]: Tracing the intellectual roots of AI and AI alignment 2020-10-07T17:10:07.013Z
The Alignment Problem: Machine Learning and Human Values 2020-10-06T17:41:21.138Z
[AN #119]: AI safety when agents are shaped by environments, not rewards 2020-09-30T17:10:03.662Z
[AN #118]: Risks, solutions, and prioritization in a world with many AI systems 2020-09-23T18:20:04.779Z
[AN #117]: How neural nets would fare under the TEVV framework 2020-09-16T17:20:14.062Z
[AN #116]: How to make explanations of neurons compositional 2020-09-09T17:20:04.668Z
[AN #115]: AI safety research problems in the AI-GA framework 2020-09-02T17:10:04.434Z
[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents 2020-08-26T17:20:04.960Z
[AN #113]: Checking the ethical intuitions of large language models 2020-08-19T17:10:03.773Z

Comments

Comment by rohinmshah on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-31T08:16:25.470Z · LW · GW

Yep, that's what I mean.

Then I'm confused what you meant by

I'm not sure what you mean by this part— and  are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in .”

Seems like if the different heads do not share weights then "the parameters in " is perfectly well-defined?

Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing

Yeah, sorry, by "conditioning" there I meant "assuming that the algorithm correctly chose the right world model in the end", I wasn't trying to describe a particular step in the algorithm. But in any case I don't think we need to talk about that 

They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent  maps.

Okay, so iiuc you're relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between  and  ? My understanding of  and  comes from here:

Specifically,  is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding  as a logical statement and unembedding its answer in . Conversely,  is the “mimicry embedding” which just searches for deductions about what a human would say in response to  and outputs that—thus,  just quotes , embedding it as just a string of characters for a human to respond to, rather than actually having to understand it in any meaningful way.

If  and  produce equivalent  maps, doesn't that mean that we've just gotten something that can only respond as well as a human? Wouldn't that be a significant limitation? (E.g. given that I don't know German, if my question to the model is "what does <german phrase> mean", does the model have to respond "I don't know"?)

In addition, since the world model will never produce deduced statements that distinguish between  and , it seems like the world model could never produce decision-relevant deduced statements that the human wouldn't have realized. This seems both (a) hard to enforce and (b) a huge capability hit.

Comment by rohinmshah on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-30T06:45:06.375Z · LW · GW

I'm not sure what you mean by this part— and  are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in .” I don't think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if  and  were separate models such that they couldn't reuse weights between them, then none of the complexity arguments that I make in the post would go through.

I assumed that when you talked about a model with "different heads" you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don't share any weights, and those separate sequences of layers were the "heads"  and . (I'm pretty sure that's how the term is normally used in ML.) I might benefit from an example architecture diagram where you label what  are.

I did realize that I was misinterpreting part of the math -- the  is quantifying over inputs to the overall neural net, rather than to the parts-which-don't-share-weights. My argument only goes through if you quantify the constraint over all inputs to the parts-which-don't-share-weights. Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don't-share-weights can be generated by some  (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don't-share-weights.

In the defender's argument,  sets all the head-specific parameters for both  and  to enforce that  computes  and  computes 

This seems to suggest that  and  are different functions, i.e. there's some input on which they disagree. But then  has to make them agree on all possible . So is the idea that there are some inputs to  that can never be created with any possible ? That seems... strange (though not obviously impossible).

Comment by rohinmshah on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-29T07:55:25.328Z · LW · GW

If memory serves, with BYOL you are using current representations of an input  to predict representations of a related input , but the representation of  comes from an old version of the encoder. So, as long as you start with a non-collapsed initial encoder, the fact that you are predicting a past encoder which is non-collapsed ensures that the current encoder you learn will also be non-collapsed.

(Mostly my point is that there are specific algorithmic reasons to expect that you don't get the collapsed solutions, it isn't just a tendency of neural nets to avoid collapsed solutions.)

but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.

No worries, I think it's still a relevant example for thinking about "collapsed" solutions.

Comment by rohinmshah on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-29T07:11:46.267Z · LW · GW

That in mind, there are various papers (e.g.) that explore the possibility of "collapsed" solutions like the one you mentioned

I haven't read the paper, but in contrastive learning, aren't these solutions prevented by the negative examples?

Comment by rohinmshah on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-29T07:06:55.951Z · LW · GW

The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it's just that some are more/less likely now.

Yeah, sorry, I wasn't clear here -- I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations until both  and  are chosen, and then reasoning about the thing you get afterwards.

Note that there is no relation between  and  or  and —both sets of parameters contribute equally to both heads. Thus,  can enforce any condition it wants on  by leaving some particular hole in how it computes  and  and forcing  to fill in that hole in such a way to make 's computation of the two heads come out equal.

Yes, I think I understand that. (I want to note that since  is chosen randomly, it isn't "choosing" the condition on ; rather the wide distribution over  leads to a wide distribution over possible conditions on . But I think that's what you mean.)

That's definitely not what should happen in that case.

I think you misunderstood what I was claiming. Let me try again, without using the phrase "enforcing the constraint", which I think was the problem.

Imagine there was a bijection between model parameters and resulting function. In Stage 1 you sample  randomly. In Stage 2, you sample , such that it fills in the holes in  and  to make  and  compute the same function. By our bijection assumption, the parameters in  must be identical to the parameters in . Thus, we can conclude the following:

  1. If  contained a parameter from  and  in the same location (e.g. it includes the weight at position (3, 5) in layer 3 in both  and ), then it must have assigned the same value to both of them.
  2. If  contained a parameter from  and  contained the corresponding parameter from , then  must have set that parameter to the same value as in .
  3. If  contained a parameter from  and  in the same location, then it must have assigned the same value to both of them.

These constraints are necessary and sufficient to satisfy the overall constraint that , and therefore any other parameters in  are completely unconstrained and are set according to the original neural net prior.

So it seems to me that (1) any parameters not in  or  are set according to the original neural net prior, and (2) parameters in  must be identical to the corresponding parameters in , but their values are chosen according to the neural net prior.

This seems equivalent to having a single head , sampling its parameters from the original prior, and then copying those parameters into .

 

I think you should already be pretty worried by the fact that this seems to give weird results when assuming a bijection between model parameters and resulting functions, but let's analyze it without the bijection assumption too:

Since  and  have to be identical on all inputs, it doesn't matter what input they get, and therefore there is no constraint on the part of the neural net that is generating the inputs. So, we still get (1): any parameters not in  or  are set according to the original neural net prior. (2) is no longer true, but instead of getting that parameters in  are equivalent to parameters in , we get that the function implemented by  is equivalent to the function implemented by . Since ultimately the generating process is "sample parameters until ", the probability of getting a particular function  is proportional to the square of the probability of generating parameters for that function  (since you have to successfully generate the function twice). So, you are doubling the strength of the neural net prior in the heads, and leaving the strength the same in the world model (i.e. all parts except for the head).

Comment by rohinmshah on Refactoring Alignment (attempt #2) · 2021-07-29T06:20:11.752Z · LW · GW

how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?

That seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call "outer alignment"). I probably should have mentioned that too, I was taking it as a given but I really shouldn't have.

For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.

Comment by rohinmshah on Re-Define Intent Alignment? · 2021-07-28T18:22:28.223Z · LW · GW

This is an assumption about the world -- not all worlds can be usefully described by partial models.

They can't? Why not?

Maybe the "usefully" part is doing a lot of work here -- can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn't seem like any of the technical results in InfraBayes depend on some notion of "usefulness".

(I think it's pretty likely I'm just flat out wrong about something here, given how little I've thought about InfraBayesianism, but if so I'd like to know how I'm wrong.)

Comment by rohinmshah on AXRP Episode 10 - AI’s Future and Impacts with Katja Grace · 2021-07-28T17:56:18.521Z · LW · GW

Planned summary for the Alignment Newsletter:

This podcast goes over various strands of research from [AI Impacts](https://aiimpacts.org/), including lots of work that I either haven’t covered or have covered only briefly in this newsletter:

**AI Impacts’ methodology.** AI Impacts aims to advance the state of knowledge about AI and AI risk by recursively decomposing important high-level questions and claims into subquestions and subclaims, until reaching a question that can be relatively easily answered by gathering data. They generally aim to provide new facts or arguments that people haven’t considered before, rather than arguing about how existing arguments should be interpreted or weighted.

**Timelines.** AI Impacts is perhaps most famous for its [survey of AI experts](https://arxiv.org/abs/1705.08807) on timelines till high-level machine intelligence (HLMI). The author’s main takeaway is that people give very inconsistent answers and there are huge effects based on how you frame the question. For example:

1. If you estimate timelines by asking questions like “when will there be a 50% chance of HLMI”, you’ll get timelines a decade earlier than if you estimate by asking questions like “what is the chance of HLMI in 2030”.

2. If you ask about when AI will outperform humans at all tasks, you get an estimate of ~2061, but if you ask when all occupations will be automated, you get an estimate of ~2136.

3. People whose undergraduate studies were in Asia estimated ~2046, while those in North America estimated ~2090.

The survey also found that the median probability of outcomes approximately as bad as extinction was 5%, which the author found surprisingly high for people working in the field.

**Takeoff speeds.** A common disagreement in the AI alignment community is whether there will be a discontinuous “jump” in capabilities at some point. AI Impacts has three lines of work investigating this topic:

1. Checking how long it typically takes to go from “amateur human” to “expert human”. For example, it took about [3 years](https://aiimpacts.org/time-for-ai-to-cross-the-human-performance-range-in-imagenet-image-classification/) for image classification on ImageNet, [38 years](https://aiimpacts.org/time-for-ai-to-cross-the-human-range-in-english-draughts/) on checkers, [21 years](https://aiimpacts.org/time-for-ai-to-cross-the-human-range-in-starcraft/) for StarCraft, [30 years](https://aiimpacts.org/time-for-ai-to-cross-the-human-performance-range-in-go/) for Go, [30 years](https://aiimpacts.org/time-for-ai-to-cross-the-human-performance-range-in-chess/) for chess, and ~3000 years for clock stability (how well you can measure the passage of time).

2. Checking <@how often particular technologies have undergone discontinuities in the past@>(@​​Discontinuous progress in history: an update@). A (still uncertain) takeaway would be that discontinuities are the kind of thing that legitimately happen sometimes, but they don’t happen so frequently that you should expect them, and you should have a pretty low prior on a discontinuity happening at some specific level of progress.

3. Detailing [arguments](https://aiimpacts.org/likelihood-of-discontinuous-progress-around-the-development-of-agi/) for and against discontinuous progress in AI.

**Arguments for AI risk, and counterarguments.** The author has also spent some time thinking about how strong the arguments for AI risk are, and has focused on a few areas:

1. Will superhuman AI systems actually be able to far outpace humans, such that they could take over the world? In particular, it seems like humans can use non-agentic tools to help keep up.

2. Maybe the AI systems we build won’t have goals, and so the argument from [instrumental subgoals](https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf) won’t apply.

3. Even if the AI systems do have goals, they may have human-compatible goals (especially since people will be explicitly trying to do this).

4. The AI systems may not destroy everything: for example, they might instead simply trade with humans, and use their own resources to pursue their goals while leaving humans alone.

Comment by rohinmshah on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-28T17:16:23.911Z · LW · GW

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

Yeah, I agree that in the future there is a difference. I don't think we know which of these situations we're going to be in (which is maybe what you're arguing). Idk what Gwern predicts.

Comment by rohinmshah on Experimentally evaluating whether honesty generalizes · 2021-07-28T16:49:44.088Z · LW · GW

Planned summary for the Alignment Newsletter:

The highlighted post introduced the notion of optimism about generalization. On this view, if we train an AI agent on question-answer pairs (or comparisons) where we are confident in the correctness of the answers (or comparisons), the resulting agent will continue to answer honestly even on questions where we wouldn’t be confident of the answer.

While we can’t test exactly the situation we care about -- whether a superintelligent AI system would continue to answer questions honestly -- we _can_ test an analogous situation with existing large language models. In particular, let’s consider the domain of unsupervised translation: we’re asking a language model trained on both English and French to answer questions about French text, and we (the overseers) only know English.

We could finetune the model on answers to questions about grammar ("Why would it have been a grammatical error to write Tu Vas?") and literal meanings ("What does Defendre mean here?"). Once it performs well in this setting, we could then evaluate whether the model generalizes to answer questions about tone ("Does the speaker seem angry or sad about the topic they are discussing?"). On the optimism about generalization view, it seems like this should work. It is intentional here that we only finetune on two categories rather than thousands, since that seems more representative of the case we’ll actually face.

There are lots of variants which differ in the type of generalization they are asking for: for example, we could finetune a model on all questions about French text and German text, and then see whether it generalizes to answering questions about Spanish text.

While the experiments as currently suggested probably won’t show good generalization, a variant that could support it would be one in which we train for _plausibility_. In our original example, we finetune on correct answers for grammar and literal meanings, and then we _also_ finetune to have the model give _plausible_ answers to tone (i.e. when asked about tone, instead of saying “en colère means 'angry'”, the model says “the author is angry, as we can see from the use of ‘en colère’”). It seems possible that this combination leads to the model giving actually correct answers about tone, just because “honestly report your best guess” seems like the simplest policy that meets all of these criteria.

Comment by rohinmshah on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-28T08:10:03.358Z · LW · GW

Hmm, I'm not thinking about the complexity part at all right now; I'm just thinking mechanically about what is implied by your equations.

the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.”

I'm not sure exactly what you mean by the parameters specifying some condition. I thought the condition was specified upfront by the designer (though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters). As far as I can tell, the intended condition is "the two heads are identical" in the dataset-less case. Looking directly at the math, the equations you have are:

θ1∼p(θ1)

θ2∼p(θ2 | θ1)⋅I[∀x∈X. ∀q∈Q. Mθ1,θ2|f?(x,q)]

My interpretation is:

  1. Generate θ1 randomly.
  2. Generate θ2 randomly from θ1, subject to the constraint that the two heads output the same value on all possible inputs.

Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters. In which case you could just have generated parameters for the first head, and then copied them over into the second head, rather than go through this complicated setup.

Now, there isn't actually a bijection between model parameters and resulting function. But it seems like the only difference is that you make it more likely that you sample heads which have lots of different implementations in model parameters, i.e. you're doubling the strength of the neural net prior (and that's the only effect). This seems undesirable?

Comment by rohinmshah on Re-Define Intent Alignment? · 2021-07-28T07:45:25.123Z · LW · GW

My central complaint about existing theoretical work is that it doesn't seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do.

I don't currently see how any of the alignment community's tools address that complaint; for example I don't think the InfraBayes work so far is making an interesting assumption about reality. Perhaps future work will address this though?

Comment by rohinmshah on Refactoring Alignment (attempt #2) · 2021-07-28T07:32:51.028Z · LW · GW

The definition of "objective robustness" I used says "aligns with the base objective" (including off-distribution). But I think this isn't an appropriate representation of your approach. Rather, "objective robustness" has to be defined something like "generalizes acceptably". Then, ideas like adversarial training and checks and balances make sense as a part of the story.

Yeah, strong +1.

Comment by rohinmshah on Teaching ML to answer questions honestly instead of predicting human answers · 2021-07-27T12:29:35.356Z · LW · GW

Planned summary for the Alignment Newsletter:

This post presents an algorithm that aims to solve the second problem from the highlighted post. As a reminder, the second problem is that an AI system that already has to make predictions about humans might learn a policy that is just “say what humans would say”, since that is simpler than learning another translation that maps its knowledge to human language (so that it can answer honestly to the best of its knowledge).

The core idea is to have a “simple” labeling process and a “complex” labeling process, where the complex process can catch errors from the simple process. We’ll also assume we have a simple / complex dataset (labeled by the simple / complex processes respectively), where we are confident that all of the answers in the dataset are correct. This means that the complex dataset has more challenging questions than the simple one. The _simple / complex instrumental policy_ is the policy that predicts what the simple / complex labeling process would do.

As an inefficient first algorithm, we can train our model on answers from the simple process where we are confident in their correctness, and then simply check whether the model performs well on new, harder questions where the complex and simple processes disagree. If the model fails this check, we restart from scratch. Intuitively, when the model eventually passes the check, it has probably learned the intended policy, as that’s the one which would generalize zero-shot to complex questions as well (and in particular, the simple instrumental policy would _not_ generalize in this way).

Of course, we need a more efficient algorithm than that. Instead of creating an algorithm where the intended policy would work but the instrumental policy wouldn’t, could we instead change the inductive biases so that the intended policy is _likely_ while the instrumental policy is not? Our original worry was that the intended policy has to bake in two implementations of language -- one in its world model, and one when translating answers into human-understandable concepts. So we could instead try to train a model that learns language from the simple instrumental policy, but is also trained on the complex dataset. The hope would be that the intended policy can learn the second implementation of language “for free” from the simple instrumental policy, while still working on the complex dataset. The actual details are quite complex and I’m not going to go into them here.

[This post](https://www.alignmentforum.org/posts/gEw8ig38mCGjia7dj/answering-questions-honestly-instead-of-predicting-human) by Evan Hubinger points out some problems and potential solutions with the approach.

Comment by rohinmshah on A naive alignment strategy and optimism about generalization · 2021-07-27T12:27:29.493Z · LW · GW

Planned summary for the Alignment Newsletter:

We want to build an AI system that answers questions honestly, to the best of its ability. One obvious approach is to have humans generate answers to questions, select the question-answer pairs where we are most confident in the answers, and train an AI system on those question-answer pairs.

(I’ve described this with a supervised learning setup, but we don’t have to do that: we could also [learn](https://deepmind.com/blog/learning-through-human-feedback/) from [comparisons](https://ai-alignment.com/optimizing-with-comparisons-c02b8c0d7877) between answers, and we only provide comparisons where we are confident in the comparison.)

What will the AI system do on questions where we _wouldn’t_ be confident in the answers? For example, questions that are complex, where we may be misled by bad observations, where an adversary is manipulating us, etc.

One possibility is that the AI system learned the **intended policy**, where it answers questions honestly to the best of its ability. However, there is an **instrumental policy** which also gets good performance: it uses a predictive model of the human to say whatever a human would say. (This is “instrumental” in that the model is taking the actions that are instrumental to getting a low loss, even in the test environment.) This will give incorrect answers on complex, misleading, or manipulative questions -- _even if_ the model “knows” that the answer is incorrect.

Intuitively, “answer as well as you can” feels like a much simpler way to give correct answers, and so we might expect to get the intended policy rather than the instrumental policy. This view (which seems common amongst ML researchers) is _optimism about generalization_: we are hoping that the policy generalizes to continue to answer these more complex, misleading, manipulative questions to the best of its ability.

Are there reasons to instead be pessimistic about generalization? There are at least three:

1. If the answers we train on _aren’t_ perfectly correct, the instrumental policy might get a _lower_ training loss than the intended policy (which corrects errors that humans make), and so be more likely to be found by gradient descent.

2. If the AI already needs to make predictions about humans, it may not take much “additional work” to implement the instrumental policy. Conversely, if the AI reasons at a different level of abstraction than humans, it may take a lot of “additional work” to turn correct answers in the AI’s ontology into correct answers in human ontologies.

3. From [a followup post](https://www.alignmentforum.org/posts/SRJ5J9Tnyq7bySxbt/answering-questions-honestly-given-world-model-mismatches), the AI system might answer questions by translating its concepts to human concepts or observations, and then deduce the answer from those concepts or observations. This will systematically ignore information that the AI system understands that isn’t represented in the human concepts or observations. (Consider the [example](https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/) of the robot hand that only _looked_ like it was grasping the appropriate object.)

A possible fourth problem: if the AI system did the deduction in its own concepts and only as a final step translated it to human concepts, we might _still_ lose relevant information. This seems not too bad though -- it seems like we should at least be able to <@explain the bad effects of a catastrophic failure@>(@Can there be an indescribable hellworld?@) in human concepts, even if we can’t explain why that failure occurred.

A [followup post](https://www.alignmentforum.org/posts/roZvoF6tRH6xYtHMF/avoiding-the-instrumental-policy-by-hiding-information-about) considers whether we could avoid the instrumental policy by <@preventing it from learning information about humans@>(@Thoughts on Human Models@), but concludes that while it would solve the problems outlined in the post, it seems hard to implement in practice.

Comment by rohinmshah on Teaching ML to answer questions honestly instead of predicting human answers · 2021-07-27T12:25:31.703Z · LW · GW

Some confusions I have:

Why does  need to include part of the world model? Why not instead have  be the parameters of the two heads, and  be the parameters of the rest of the model?

This would mean that you can’t initialize  to be equal to , but I don’t see why that’s necessary in the first place -- in particular it seems like the following generative model should work just fine:

(I’ll be thinking of this setup for the rest of my comment, as it makes more sense to me)

When differentiating the consistency test C we should treat the intended head as fixed rather than differentiating through it. This removes SGD’s incentive to achieve consistency by e.g. making sure the world is simple and so all questions have simple answers.

Hmm, why is this necessary? It seems like the whole point of  is to ensure that you have to learn a detailed world model that gets you the right answers. I guess as , that doesn't really help you, but really you shouldn't have  because you shouldn't expect to be able to have .

(Also, shouldn't that be , since it is  and  together that compute answers to questions?)

Comment by rohinmshah on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-27T12:17:19.402Z · LW · GW

except now  is checked over all inputs, not just over the dataset (note that we still update on the dataset at the end—it's just our prior which is now independent of it).

Doesn't this mean that the two heads have to be literally identical in their outputs? It seems like at this point your prior is "generate parameters randomly under the constraint that the two heads are identical", which seems basically equivalent to having a single head and generating parameters randomly, so it seems unintuitive that this can do anything useful.

(Disclaimer: I skimmed the post because I found it quite challenging to read properly, so it's much more likely than usual that I failed to understand a basic point that you explicitly said somewhere.)

Comment by rohinmshah on Refactoring Alignment (attempt #2) · 2021-07-27T08:06:47.987Z · LW · GW

I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that's done.

But how? In prosaic AI, only on-distribution behavior of the loss function can influence the end result.

I can see a few possible responses here.

  1. Double down on the "correct generalization" story: hope to somehow avoid the multiple plausible generalizations, perhaps by providing enough training data, or appropriate inductive biases in the system (probably both).
  2. Achieve objective robustness through other means. In particular, inner alignment is supposed to imply objective robustness. In this approach, inner-alignment technology provides the extra information to generalize the base objective appropriately.

I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense. For (1), I'd note that we need to get it right on the situations that actually happen, rather than all situations. We can also have systems that only need to work for the next N timesteps, after which they are retrained again given our new understanding of the world; this effectively limits how much distribution shift can happen. Then we could do some combination of the following:

  1. Build neural net theory. We currently have a very poor understanding of why neural nets work; if we had a better understanding it seems plausible we could have high confidence in when a neural net would generalize correctly. (I'm imagining that neural net theory goes from how-I-imagine-physics-looked before Newton, and the same after Newton.)
  2. Use techniques like adversarial training to "robustify" the model against moderate distribution shifts (which might be sufficient to work for the next N timesteps, after which you "robustify" again).
  3. Make these techniques work better through interpretability / transparency.
  4. Use checks and balances. For example, if multiple generalizations are possible, train an ensemble of models and only do something if they all agree on it. Or train an actor agent combined with an overseer agent that has veto power over all actions. Or an ensemble of actors, each of which oversees the other actors and has veto power over them.

These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for "all situations" that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).

Comment by rohinmshah on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-27T07:34:46.063Z · LW · GW

I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts.  This appears to contradict the "Second" point above somewhat.

The idea with the "Second" point is that the certification would be something like "we certify that company X has a process Y for analyzing and fixing potential problem Z whenever they build a new algorithm / product", which seems like it is consistent with your belief here? Unless you think that the process isn't enough, you need to certify the analysis itself.

Comment by rohinmshah on Re-Define Intent Alignment? · 2021-07-27T07:30:03.146Z · LW · GW

One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. 

I don't hope this; I expect to use a version of "acceptable" that uses intent. I'm happy with "acceptable" = "trying to do what we want".

If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason?

I'm pessimistic about mesa-objectives existing in actual systems, based on how people normally seem to use the term "mesa-objective". If you instead just say that a "mesa objective" is "whatever the system is trying to do", without attempting to cash it out as some simple utility function that is being maximized, or the output of a particular neuron in the neural net, etc, then that seems fine to me.

One other way in which "acceptability" is better is that rather than require it of all inputs, you can require it of all inputs that are reasonably likely to occur in practice, or something along those lines. (And this is what I expect we'll have to do in practice given that I don't expect to fully mechanistically understand a large neural network; the "all inputs" should really be thought of as a goal we're striving towards.) Whereas I don't see how you do this with a mesa-objective (as the term is normally used); it seems like a mesa-objective must apply on any input, or else it isn't a mesa-objective.

I'm mostly not trying to make claims about which one is easier to do; rather I'm saying "we're using the wrong concepts; these concepts won't apply to the systems we actually build; here are some other concepts that will work".

Comment by rohinmshah on Teaching ML to answer questions honestly instead of predicting human answers · 2021-07-26T10:30:36.383Z · LW · GW

I realize that unit-type-checking ML is pretty uncommon and might just be insane

Nah, it's a great trick.

The two parameter distances seem like they're in whatever distance metric you're using for parameter space, which seems to be very different from the logprobs.

The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as  (minus an irrelevant additive constant), where  is set to imply whatever hyperparameter you used for your weight decay.

This does mean that you are committing to a Gaussian prior over the parameters. If you wanted to include additional information like "moving towards zero is more likely to be good" then you would not have a Gaussian centered at , and so the corresponding log prob would not be the nice simple "L2 distance to ".

My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space.  That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.

I think this intuition is correct, and the typical solution in ML algorithms is to empirically scale all of your quantities such that everything works out (which you can interpret from the unit-checking perspective as "finding the appropriate constant to multiply your quantities by such that they become the right kind of unitless").

Comment by rohinmshah on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-26T08:59:55.725Z · LW · GW

I agree that regulation is harder to do before you know all the details of the technology, but it doesn't seem obviously doomed, and it seems especially-not-doomed to productively think about what regulations would be good (which is the vast majority of current AI governance work by longtermists).

As a canonical example I'd think of the Asilomar conference, which I think happened well before the details of the technology were known. There are a few more examples, but overall not many. I think that's primarily because we don't usually try to foresee problems because we're too caught up in current problems, so I don't see that as a very strong update against thinking about governance in advance.

Comment by rohinmshah on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-25T12:50:32.992Z · LW · GW

Sure, but don't you agree that it's a very confusing use of the term?

Maybe? Idk, according to me the goal of alignment is "create a model that is motivated to help us", and so misalignment = not-alignment = "the mode is not motivated to help us". Feels pretty clear to me but illusion of transparency is a thing.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now.

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯ 

I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious?

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

 Almost all specification gaming examples are subtle, or tricky, or exploiting bugs.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Comment by rohinmshah on Decoupling deliberation from competition · 2021-07-24T06:28:57.706Z · LW · GW

Planned summary for the Alignment Newsletter:

Under a [longtermist](https://forum.effectivealtruism.org/tag/longtermism) lens, one problem to worry about is that even after building AI systems, humans will spend more time competing with each other rather than figuring out what they want, which may then lead to their values changing in an undesirable way. For example, we may have powerful persuasion technology that everyone uses to persuade people to their line of thinking; it seems bad if humanity’s values are determined by a mix of effective persuasion tools, especially if persuasion significantly diverges from truth-seeking.

One solution to this is to coordinate to _pause_ competition while we deliberate on what we want. However, this seems rather hard to implement. Instead, we can at least try to _decouple_ competition from deliberation, by having AI systems acquire <@flexible influence@>(@The strategy-stealing assumption@) on our behalf (competition), and having humans separately thinking about what they want (deliberation). As long as the AI systems are competent enough to shield the humans from the competition, the results of the deliberation shouldn’t depend too much on competition, thus achieving the desired decoupling.

The post has a bunch of additional concrete details on what could go wrong with such a plan that I won’t get into here.

Comment by rohinmshah on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-24T05:06:54.573Z · LW · GW

Like Luke I'm going to take longtermism as an axiom for most purposes (I find it decently convincing given my values), though if you're interested in debating it you could post on the EA Forum. (One note: my understanding of longtermism is "the primary determinant of whether an action is one of the best that you can take is its consequences on the far future"; you seem to be interpreting it as a stronger / more specific claim than that.)

Also, focusing on AI governance is a bit of a strange way to influence AI safety

You're misunderstanding the point of AI governance. AI governance isn't a subset of AI safety, unless you interpret the term "AI safety" very very broadly. Usually I think of AI safety as "how do we build AI systems that do what their designers intend"; AI governance is then "how do we organize society so that humanity uses this newfound power of AI for good, and in particular doesn't use it to destroy ourselves" (e.g. how do we prevent humans from using AI in a way that makes wars existentially risky, that enforces robust totalitarianism, that persuades humans to change their values, etc). I guess part of governance is "how do we make sure no one builds unsafe AI", which is somewhat related to AI safety, but that's not the majority of AI governance.

A lot of these issues don't seem to become that more clarified even with a picture of how AGI will come about, e.g. I have such a picture in mind, and even if I condition on that picture being completely accurate (which it obviously won't be), many of the relevant questions still don't get resolved. This is because often they're primarily questions about human society rather than questions about how AI works.

Comment by rohinmshah on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-24T04:49:39.162Z · LW · GW

Right now it looks like your 19-21 scale corresponds to something like log(# of parameters) (basically every scaling graph with parameters you'll see uses this as its x-axis). So it still requires exponential increase in inputs to drive a linear increase on that scale.

Comment by rohinmshah on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-24T04:43:24.610Z · LW · GW

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. 

Where do you see any assumption of agency/goals?

(I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.)

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture,

Agreed, which is why I didn't say anything like that?

Comment by rohinmshah on Re-Define Intent Alignment? · 2021-07-23T12:42:52.916Z · LW · GW

Are you the historical origin of the robustness-centric approach?

Idk, probably? It's always hard for me to tell; so much of what I do is just read what other people say and make the ideas sound sane to me. But stuff I've done that's relevant:

  • Talk at CHAI saying something like "daemons are just distributional shift" in August 2018, I think. (I remember Scott attending it.)
  • Talk at FHI in February 2020 that emphasized a risk model where objectives generalize but capabilities don't.
  • Talk at SERI conference a few months ago that explicitly argued for a focus on generalization over objectives.

Especially relevant stuff other people have done that has influenced me:

(My views were pretty set by the time Evan wrote the clarifying inner alignment terminology post; it's possible that his version that's closer to generalization-focused was inspired by things I said, you'd have to ask him.)

Comment by rohinmshah on Progress on Causal Influence Diagrams · 2021-07-22T21:53:30.568Z · LW · GW

Planned summary for the Alignment Newsletter:

Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the _wrong incentives_. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are <@graphical models@>(@Understanding Agent Incentives with Causal Influence Diagrams@) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:

1. We can analyze [what happens](https://arxiv.org/abs/2102.07716) when you [intervene](https://arxiv.org/abs/1707.05173) on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.

2. We can <@avoid reward tampering@>(@Designing agent incentives to avoid reward tampering@) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its _current_ reward function.

3. A [multiagent version](https://arxiv.org/abs/2102.05008) allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.

Comment by rohinmshah on Re-Define Intent Alignment? · 2021-07-22T19:30:30.919Z · LW · GW

(Meta: was this meant to be a question?)

In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.

I don't think this is actually a con of the generalization-focused approach. From the post you link, one of the two questions in that approach (the one focused on robustness) is:

How do we ensure the model generalizes acceptably out of distribution?

Part of the problem is to come up with a good definition of "acceptable", such that this is actually possible to achieve. (See e.g. the "Defining acceptable" section of this post, or the beginning of this post.) But if you prefer to bake in the notion of intent, you could make the second question

How do we ensure the model continues to try to help us when out of distribution?

Comment by rohinmshah on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-22T14:37:53.266Z · LW · GW

what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

You're going to need the ease of specification condition, or something similar; else you'll probably run into no-free-lunch considerations (at which point I think you've stopped talking about anything useful).

Comment by rohinmshah on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-22T14:35:33.325Z · LW · GW

The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn.

?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?

EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just "fills in words" instead.

Seems to me the conclusion of this argument is that "In general it's not true that the AI is trying to achieve its training objective." 

Isn't that effectively what I said? (I was trying to be more precise since "achieve its training objective" is ambiguous, but given what I understand you to mean by that phrase, I think it's what I said?)

we have no idea what it'll do; treacherous turn is a real possibility because that's what'll happen for most goals it could have, and it may have a goal for all we know.

This seems reasonable to me (and seems compatible with what I said)

Comment by rohinmshah on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-21T08:57:35.082Z · LW · GW

Yeah, I agree with all this. I still think the pretraining objective basically doesn't matter for alignment (beyond being "reasonable") but I don't think the argument I've given establishes that.

I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).

Comment by rohinmshah on Fractional progress estimates for AI timelines and implied resource requirements · 2021-07-20T17:20:25.692Z · LW · GW

Planned summary:

One [methodology](https://www.overcomingbias.com/2012/08/ai-progress-estimate.html) for forecasting AI timelines is to ask experts how much progress they have made to human-level AI within their subfield over the last T years. You can then extrapolate linearly to see when 100% of the problem will be solved. The post linked above collects such estimates, with a typical estimate being 5% of a problem being solved in the twenty year period between 1992 and 2012. Overall these estimates imply a timeline of [372 years](https://aiimpacts.org/surveys-on-fractional-progress-towards-hlai/).

This post provides a reductio argument against this pair of methodology and estimate. The core argument is that if you linearly extrapolate, then you are effectively saying “assume that business continues as usual: then how long does it take”? But “business as usual” in the case of the last 20 years involves an increase in the amount of compute used by AI researchers by a factor of ~1000, so this effectively says that we’ll get to human-level AI after a 1000^{372/20} = 10^56 increase in the amount of available compute. (The authors do a somewhat more careful calculation that breaks apart improvements in price and growth of GDP, and get 10^53.)

This is a stupendously large amount of compute: it far dwarfs the amount of compute used by evolution, and even dwarfs the maximum amount of irreversible computing we could have done with all the energy that has ever hit the Earth over its lifetime (the bound comes from [Landauer’s principle](https://en.wikipedia.org/wiki/Landauer%27s_principle)).

Given that evolution _did_ produce intelligence (us), we should reject the argument. But what should we make of the expert estimates then? One interpretation is that “proportion of the problem solved” behaves more like an exponential, because the inputs are growing exponentially, and so the time taken to do the last 90% can be much less than 9x the time taken for the first 10%.

Planned opinion:

This seems like a pretty clear reductio to me, though it is possible to argue that this argument doesn’t apply because compute isn’t the bottleneck, i.e. even with infinite compute we wouldn’t know how to make AGI. (That being said, I mostly do think we could build AGI if only we had enough compute; see also <@last week’s highlight on the scaling hypothesis@>(@The Scaling Hypothesis@).)

Comment by rohinmshah on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-20T06:59:36.798Z · LW · GW

Yes, that's right, sorry about the confusion.

Comment by rohinmshah on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-17T05:07:40.246Z · LW · GW

Wrote a separate comment here (in particular I think claims 1 and 4 are directly relevant to safety)

Comment by rohinmshah on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-17T05:06:10.766Z · LW · GW

This comment is inspired by a conversation with Ajeya Cotra.

As a simple example of how the scaling hypothesis affects AI safety research, it suggests that the training objective (“predict the next word”) is relatively unimportant in determining properties of the trained agent; in contrast, the dataset is much more important. This suggests that analyses based on the “reward function used to train the agent” are probably not going to be very predictive of the systems we actually build.

To elaborate on this more:

Claim 1: Scaling hypothesis + abundance of data + competitiveness requirement implies that an alignment solution will need to involve pretraining.

Argument: The scaling hypothesis implies that you can get strong capabilities out of abundant effectively-free data. So, if you want your alignment proposal to be competitive, it must also get strong capabilities out of effectively-free data. So far, the only method we know of for this is pretraining.

Note that you could have schemes where you train an actor model using a reward model that is always aligned; in this case your actor model could avoid pretraining (since you can generate effectively-free data from the reward model) but your reward model will need to be pretrained. So the claim is that some part of your scheme involves pretraining; it doesn't have to be the final agent that is deployed.

Claim 2: For a fixed 'reasonable' pretraining objective, there exists some (possibly crazy and bespoke but still reasonably-sized) dataset which would make the resulting model aligned without any finetuning.

(This claim is more of an intuition pump for Claim 3, rather than an interesting claim in its own right)

Argument 1: As long as your pretraining objective doesn't do something unreasonable like say "ignore the data, always say 'hello world'", given the fixed pretraining objective each data point acts as a "constraint" on the parameters of the model. If you have D data points and N model parameters with D > N, then you should expect these constraints to approximately determine the model parameters (in the same way that N linearly independent equations on N variables uniquely determine those variables). So with the appropriate choice of the D data points, you should be able to get any model parameters you want, including the parameters of the aligned model.

Argument 2: There are ~tens of bits going into the choice of pretraining objective, and ~millions of bits going into the dataset, so in some sense nearly all of the action is in the dataset.

Argument 3: For the specific case of next-word prediction, you could take an aligned model, generate a dataset by running that model, and then train a new model with next-word prediction on that dataset.
I believe this is equivalent to model distillation, which has been found to be really unreasonably effective, including for generalization (see e.g. here), so I’d expect the resulting model would be aligned too.

Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.

Argument: Basically the same as for Claim 2: by far most of the influence on which model you get out is coming from the dataset.

(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn't mean that the pretraining objective can't have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)

Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.

Argument: There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there's a corresponding model that the resulting GPT-N would "try" to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn't matter much which pretraining objective you use, so most of these models would be wrong.

Comment by rohinmshah on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-16T19:55:35.679Z · LW · GW

Wait, people are doing this, instead of just turning words into numbers and having 'models' learn those? Anything GPT sized and getting results?

Not totally sure. There are advantages to character-level models, e.g. you can represent Twitter handles (which a word embedding based approach can have trouble). People have definitely trained character-level RNNs in the past. But I don't know enough about NLP to say whether people have trained large models at the character level. (GPT uses byte pair encoding.)

Why this can't be rescued with counterfactuals isn't clear.

I suspect Alex would say that it isn't clear how to define what a "counterfactual" is given the constraints he has (all you get is a physical closed system and a region of space within that system).

Comment by rohinmshah on rohinmshah's Shortform · 2021-07-16T07:10:17.120Z · LW · GW

But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting?

I'd be pretty surprised if that happened. GPT-3 already knows way more facts than I do, and can mimic far more writing styles than I can. It seems like by the time it can write any good posts (without cherrypicking), it should quickly be able to write good posts on a variety of topics in a variety of different styles, which should let it scale well past 20 posts.

(In contrast, a specific person tends to write on 1-2 topics, in a single style, and not optimizing that hard for karma, and many still write tens of high-scoring posts.)

Comment by rohinmshah on rohinmshah's Shortform · 2021-07-15T19:53:14.923Z · LW · GW

20,000 LW karma: Holy shit that's a lot of karma for one year.

I was thinking 365 posts * ~50 karma per post gets you most of the way there (18,250 karma), and you pick up some additional karma from comments along the way.  50 karma posts are good but don't have to be hugely insightful; you can also get a lot of juice by playing to the topics that tend to get lots of upvotes. Unlike humans the bot wouldn't be limited by writing speed (hence my restriction of one post per day). AI systems should be really, really good at writing, given how easy it is to train on text. And a post is a small, self-contained thing, that takes not very long to create (i.e. it has short horizons), and there are lots of examples to learn from. So overall this seems like a thing that should happen well before TAI / AGI.

I think I want to give up on the research area example, seems pretty hard to operationalize. (But fwiw according to the picture in my head, I don't think I'd count AlphaFold.)

Comment by rohinmshah on Model-based RL, Desires, Brains, Wireheading · 2021-07-15T07:29:27.100Z · LW · GW

I note that even experts sometimes sloppily talk as if RL agents make plans towards the goal of maximizing future reward—see for example Pitfalls of Learning a Reward Function Online.

Fwiw, I think most analysis of this form starts from the assumption "the agent is maximizing future reward" and then reasoning out from there. I agree with you that such analysis probably doesn't apply to RL agents directly (since RL agents do not necessarily make plans towards the goal of maximizing future reward), but it can apply to e.g. planning agents that are specifically designed that way.

(Idk what the people who make such analyses actually have in mind for what sorts of agents we'll actually build; I wish they would be clearer on this.)

I still don’t think it’s a good idea to control a robot with a literal remote-control reward button; I just don’t think that the robot will necessarily want to grab that remote from us. It might or might not want to. It’s a complicated and interesting question.

+1, and I think the considerations are pretty similar to those in model-free RL.

Comment by rohinmshah on rohinmshah's Shortform · 2021-07-15T07:09:30.333Z · LW · GW

Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post.

This seems more feasible, because you can cherrypick a single good example. I wouldn't be shocked if someone on LW spent a lot of time reading AI-written blog posts on rationality and posted the best one, and I liked that more than a typical >30 karma post. My default guess is that no one tries to do this, so I'd still give it < 50% (maybe 30%?), but conditional on someone trying I think probably 80% seems right.

Name something that you think won't happen, but you think I think will.

I spent a bit of time on this but I think I don't have a detailed enough model of you to really generate good ideas here :/

Otoh, if I were expecting TAI / AGI in 15 years, then by 2030 I'd expect to see things like:

  • An AI system that can create a working website with the desired functionality "from scratch" (e.g. a simple Twitter-like website, an application that tracks D&D stats and dice rolls for you, etc, a simple Tetris game with an account system, ...). The system allows even non-programmers to create these kinds of websites (so cannot depend on having a human programmer step in to e.g. fix compiler errors).
  • At least one large, major research area in which human researcher productivity has been boosted 100x relative to today's levels thanks to AI. (In calculating the productivity we ignore the cost of running the AI system.) Humans can still be in the loop here, but the large majority of the work must be done by AIs.
  • An AI system gets 20,000 LW karma in a year, when limited to writing one article per day and responses to any comments it gets from humans.
  • Productivity tools like todo lists, memory systems, time trackers, calendars, etc are made effectively obsolete (or at least the user interfaces are made obsolete); the vast majority of people who used to use these tools have replaced them with an Alexa / Siri style assistant.

Currently, I don't expect to see any of these by 2030.

Comment by rohinmshah on rohinmshah's Shortform · 2021-07-14T11:30:06.294Z · LW · GW

What won't we be able to do by (say) the end of 2025? (See also this recent post.) Well, one easy way to generate such answers would be to consider tasks that require embodiment in the real world, or tasks that humans would find challenging to do. (For example, “solve the halting problem”, “produce a new policy proposal that has at least a 95% chance of being enacted into law”, “build a household robot that can replace any human household staff”.) This is cheating, though; the real challenge is in naming something where there’s an adjacent thing that _does_ seem likely (i.e. it’s near the boundary separating “likely” from “unlikely”).

One decent answer is that I don’t expect we’ll have AI systems that could write new posts _on rationality_ that I like more than the typical LessWrong post with > 30 karma. However, I do expect that we could build an AI system that could write _some_ new post (on any topic) that I like more than the typical LessWrong post with > 30 karma. This is because (1) 30 karma is not that high a filter and includes lots of posts I feel pretty meh about, (2) there are lots of topics I know nothing about, on which it would be relatively easy to write a post I like, and (3) AI systems easily have access to this knowledge by being trained on the Internet. (It is another matter whether we actually build an AI system that can do this.) Note that there is still a decently large difference between these two tasks -- the content would have to be quite a bit more novel in the former case (which is why I don’t expect it to be solved by 2025).

Note that I still think it’s pretty hard to predict what will and won’t happen, so even for this example I’d probably assign, idk, a 10% chance that it actually does work out (if we assume some organization tries hard to make it work)?

Comment by rohinmshah on What will the twenties look like if AGI is 30 years away? · 2021-07-14T08:08:54.187Z · LW · GW

Ajeya describes a "virtual professional" and says it would count as TAI; some of the criteria in the virtual professional definition are superhuman speed and subhuman cost. I think a rough definition of AGI would be "The virtual professional, except not necessarily fast and cheap." How does that sound as a definition?

I'm assuming it also has to be a "single system" (rather than e.g. taking instructions and sending them off to a CAIS-like distributed network of AI systems that then do the thing). We may not build AGI as defined here, but if we instead talk about when we could build it (at reasonable expense and speed), I'd probably put that around or a bit later than the TAI estimate, so 2050 seems like a reasonable number.

Hmm, a billion users who use it nearly every day is quite a lot. I feel like just from a reference class of "how many technologies have a billion users who use it every day" I'd have to give a low probability on that one.

Google Search has 5.4 billion searches per day, which is a majority of the market; so I'm not sure if web search has a billion users who use it nearly every day.

Social media as a general category does seem to have > 1 billion users who use it every day (e.g. Facebook has > 2 billion "daily active users").

On the other hand, Internet access and usage is increasing, e.g. the most viewed YouTube video today probably has an order of magnitude more views than the most viewed video 8 years ago. Also, it seems not totally crazy for chatbots to significantly replace social media, such that "number of people who use social media" is the right thing to be thinking about.

Still, overall I'd guess no, we probably won't have a billion people talking to a chatbot every day. Will we have a chatbot that's fun to talk to? Probably.

At least ten people you know will regularly talk to chatbots for fun

That seems quite a bit more likely, I think I do expect that to happen (especially since I know lots of people who want to keep up-to-date with AI, e.g. I know a couple of people who use GPT-3 for fun).

What about AI-powered prediction markets and forecasting tournament winners?

I don't know what you mean by this. We already use statistics to forecast tournament winners, and we already have algorithms that can operate on markets (including prediction markets when that's allowed). So I'm not sure what change from the status quo you're suggesting.

Comment by rohinmshah on What will the twenties look like if AGI is 30 years away? · 2021-07-14T07:50:47.977Z · LW · GW

Yeah, I'm just talking about the US and probably also Europe for the medical diagnosis part. I don't have a strong view on what will happen in China.

Comment by rohinmshah on What will the twenties look like if AGI is 30 years away? · 2021-07-13T13:43:56.834Z · LW · GW

I haven't thought about it that carefully but Ajeya's paragraph sounds reasonable to me. Intuitively, I feel more pessimistic about medical diagnosis (because of regulations, not because of limitations of AI) and more optimistic about AI copy-editors (in particular I think they'll probably be quite helpful for argument construction). I'm not totally sure what she means about AIs finding good hyperparameter settings for other AIs; under the naive interpretation that's been here for ages (e.g. population-based training or Bayesian optimization or gradient-based hyperparameter optimization or just plain old grid search).

I'd expect this all to be at "startup-level" scale, where the AI systems still make errors (like, 0.1-50% chance) that startups are willing to bear but larger tech companies are not. For reference, I'd classify Copilot as "startup-level" but probably doesn't yet meet the bar in Ajeya's paragraph (though I'm not sure, I haven't played around with it). If we're instead asking for more robust AI systems, I'd probably add another 5ish years on that to get to 2030.

High uncertainty on all of these. By far the biggest factor determining my forecast is "how much effort are people going to put into this"; e.g. if we didn't get good AI copy-editors by 2030 my best explanation is "no competent organization tried to do it". Probably many of them could be done today by a company smaller than OpenAI (but with similar levels of expertise and similar access to funding).

But she thinks it'll probably take till around 2050 for us to get transformative AI, and (I think?) AGI as well.

I'm similar on TAI, and want to know what you mean by AGI before giving a number for that.

Comment by rohinmshah on BASALT: A Benchmark for Learning from Human Feedback · 2021-07-13T06:47:18.241Z · LW · GW

That makes sense, though I'd also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general.

Oh yeah, it totally is, and I'd be excited for that to happen. But I think that will be a single project, whereas the benchmark reporting process is meant to apply for things where there will be lots of projects that you want to compare in a reasonably apples-to-apples way, so when designing the reporting process I'm focused more on the small-scale projects that aren't GPT-N-like.

It's also possible this has already been done and I'm unaware of it

I'm pretty confident that there's nothing like this that's been done and publicly released.

Comment by rohinmshah on rohinmshah's Shortform · 2021-07-13T06:43:26.623Z · LW · GW

https://docs.google.com/spreadsheets/d/1PwWbWZ6FPqAgZWOoOcXM8N_tUCuxpEyMbN1NYYC02aM/edit#gid=0

Yes (or more specifically, the private version from which that public one is automatically created).

Comment by rohinmshah on Problems facing a correspondence theory of knowledge · 2021-07-12T14:45:26.086Z · LW · GW

Planned summary for the Alignment Newsletter:

Probability theory can tell us about how we ought to build agents that have knowledge (start with a prior, and perform Bayesian updates as evidence comes in). However, this is not the only way to create knowledge: for example, humans are not ideal Bayesian reasoners. As part of our quest to <@_describe_ existing agents@>(@Theory of Ideal Agents, or of Existing Agents?@), could we have a theory of knowledge that specifies when a particular physical region within a closed system is “creating knowledge”? We want a theory that <@works in the Game of Life@>(@Agency in Conway’s Game of Life@) as well as the real world.

This sequence investigates this question from the perspective of defining the accumulation of knowledge as increasing correspondence between [a map and the territory](https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation), and concludes that such definitions are not tenable. In particular, it considers four possibilities, and demonstrates counterexamples to all of them:

1. Direct map-territory resemblance: Here, we say that knowledge accumulates in a physical region of space (the “map”) if that region of space looks more like the full system (the “territory”) over time.

Problem: This definition fails to account for cases of knowledge where the map is represented in a very different way that doesn’t resemble the territory, such as when a map is represented by a sequence of zeros and ones in a computer.

2. Map-territory mutual information: Instead of looking at direct resemblance, we can ask whether there is increasing mutual information between the supposed map and the territory it is meant to represent.

Problem: In the real world, nearly _every_ region of space will have high mutual information with the rest of the world. For example, by this definition, a rock accumulates lots of knowledge as photons incident on its face affect the properties of specific electrons in the rock giving it lots of information.

3. Mutual information of an abstraction layer: An abstraction layer is a grouping of low-level configurations into high-level configurations such that transitions between high-level configurations are predictable without knowing the low-level configurations. For example, the zeros and ones in a computer are the high-level configurations of a digital abstraction layer over low-level physics. Knowledge accumulates in a region of space if that space has a digital abstraction layer, and the high-level configurations of the map have increasing mutual information with the low-level configurations of the territory.

Problem: A video camera that constantly records would accumulate much more knowledge by this definition than a human, even though the human is much more able to construct models and act on them.

4. Precipitation of action: The problem with our previous definitions is that they don’t require the knowledge to be _useful_. So perhaps we can instead say that knowledge is accumulating when it is being used to take action. To make this mechanistic, we say that knowledge accumulates when an entity’s actions become more fine tuned to a specific environment configuration over time. (Intuitively, they learned more about the environment, and so could condition their actions on that knowledge, which they previously could not do.)

Problem: This definition requires the knowledge to actually be used to count as knowledge. However, if someone makes a map of a coastline, but that map is never used (perhaps it is quickly destroyed), it seems wrong to say that during the map-making process knowledge was not accumulating. 

Comment by rohinmshah on rohinmshah's Shortform · 2021-07-12T14:06:24.043Z · LW · GW

I often search through the Alignment Newsletter database to find the exact title of a relevant post (so that I can link to it in a new summary), often reading through the summary and opinion to make sure it is the post I'm thinking of.

Frequently, I read the summary normally, then read the first line or two of the opinion and immediately realize that it wasn't written by me.

This is kinda interesting, because I often don't know what tipped me off -- I just get a sense of "it doesn't sound like me". Notably, I usually do agree with the opinion, so it isn't about stating things I don't believe. Nonetheless, it isn't purely about personal writing styles, because I don't get this sense when reading the summary.

(No particular point here, just an interesting observation)

(This shortform prompted by going through this experience with Embedded Agency via Abstraction)