Posts

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024-01-12T19:51:01.021Z
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research 2023-08-08T01:30:10.847Z
Engineering Monosemanticity in Toy Models 2022-11-18T01:43:38.623Z
ELK Proposal - Make the Reporter care about the Predictor’s beliefs 2022-06-11T22:53:40.746Z

Comments

Comment by Nicholas Schiefer (schiefer) on Prize for Alignment Research Tasks · 2022-06-01T06:57:00.444Z · LW · GW

Task: Suggest surprising experiments that challenge assumptions

Context: A researcher is considering an alignment proposal that hinges on some key assumptions. They would like to see some suggestions for experiments (either theoreetical thoughts experiments or actual real-world experiments) that could challenge those assumptions. If the experiment has been done, it should report the results.

Input type: An assumption about a powerful AI system

Output type: a suggestion for an experiment that could challenge that assumption. If it has been done already, the results of those experiments.

 

Instance 1:

Input: The performance of a model is impossible to predict, so we can’t hope to have an idea of a model’s capabilities before it is trained and evaluated.

Output: It might be that a key measure of performance of a model, such as the loss, might scale predictably with the model size. This was investigated by Kaplan et al (https://arxiv.org/abs/2001.08361), who found that the loss tends to follow a power law.

Instance 2:

Input: Suppose a model is trained on data that is mixed with some noise (as in https://arxiv.org/pdf/2009.08092.pdf ).The model will necessarily learn that the data was mixed with some noise, rather than learn a really complex decision boundary.

Output: Suppose that you try fine-tuning one of these models on data that doesn’t have the noise. It might be very slow to adapt to this in which case it might have learned the complex decision boundary. (This experiment hasn’t been done.)

Instance 3:

Input: It's impossible to train a neural network without non-linearities like ReLU or a sigmoid.

Output: That is true for theoretical neural networks, but real neural networks are trained using floating point numbers with inherently non-linear arithmetic. These imperfections might be enough to train a competent model. This experiment was done by Jakob Foerster, who found that this was indeed enough: https://openai.com/blog/nonlinear-computation-in-linear-networks/

Comment by Nicholas Schiefer (schiefer) on Prize for Alignment Research Tasks · 2022-06-01T06:55:53.096Z · LW · GW

Task: Apply an abstract proposal to a concrete ML system


Context: A researcher is reading a highly theoretical alignment paper and is curious about whether/how it might apply to a real world machine learning system, like a large transformer trained using SGD. They would like to see what parts of the ML system would change under this proposal.

Input type: a theoretical alignment proposal and a description of an ML system

Output type: a description of how the ML system would change under the given proposal

Info constraints: none

Instance 1:

Input:

Abstract proposal: The description of the complexity regularizer from the ELK report:

ML system description: a GPT-style model trained on natural language using SGD, with a GPT-style reporter trained on the GPT model’s weights.

Output: This is tricky becuase the circuit complexity of a neural network is largely fixed by its architecture. As we sweep over different architecutres/hyperparemeters for the reporter model, we can add a regularization term to the hyperparameter optimization based on the total number of weights in the model.

Other possible output: Since circuit complexity for a neural network is largely fixed by its architecture, we will consider the “minimal complexity” of the trained neural network to be the number of non-zero parameters, and will regularize on that to encourage sparsity in the weights.

Instance 2:

Input:

Abstract proposal: The Iterated Amplification and Distillation proposal, as described in https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616

ML system description: a GPT-style model trained on natural language using SGD.

Output: Any of the project proposals in https://www.alignmentforum.org/posts/Y9xD78kufNsF7wL6f/machine-learning-projects-on-ida, or something similar

Comment by Nicholas Schiefer (schiefer) on Prize for Alignment Research Tasks · 2022-06-01T06:54:05.984Z · LW · GW

Task: convert mathematical expressions into natural language

Context: A researcher is reading a paper about alignment that contains a lot of well-specified but dense mathematical notation. They would like to see a less terse and more fluent description of the same idea that’s easier to read, similar to what a researcher might say to them at a blackboard while writing the math. This might involve additional context for novices.

Input type: a piece of mathematically-dense but well-specified text from a paper

Output type: a fluent, natural language descirption of the same mathematical objects

Info constraints: none

Instance 1:

Input: the section “The circuit distillation prior” from https://www.alignmentforum.org/posts/7ygmXXGjXZaEktF6M/towards-a-better-circuit-prior-improving-on-elk-state-of-the

Output: Consider a predictive model that predicts the output of a video camera given some sensors in the world. As in ELK, our goal will be to find a function that looks at the sensors, the model, and some questions, then reutrns some answers to those questions using the model’s latent knoweldge.

Instance 2:

Input: the section “Our model of proxy misspecification” from https://www.alignmentforum.org/posts/tWpgtjRm9qwzxAZEi/proxy-misspecification-and-the-capabilities-vs-value

Output: Alice has n things that she values: given any of these items, she’ll always value a set at least as much if she adds another one. A robot is given a proxy for this utility, but it dpeends on a strict subset of the items. The robot optimizes its proxy subjec to some resource constrants. It’s a thoerem that the robot will not pick the things that weren’t included in its utility.

Instance 3:

Input: The paragraph “vector valued preferences” from https://www.alignmentforum.org/posts/oheKfWA7SsvpK7SGp/probability-is-real-and-value-is-complex

Output: We think of events, in the sense of the sigma algebras that are used in the formalization fo probability theory. Each event has a probability and an expected utility assigned to it. We are interested in the product of these two, which Valdimir Nesov called “shouldness’”

 

This feels tractable in large part becuase mathematical notation tends to invovle a lot of context, which a language model could probably digest.