Posts

Towards AI Safety Infrastructure: Talk & Outline 2024-01-07T09:31:12.217Z
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation 2023-12-04T07:31:48.726Z
Elements of Computational Philosophy, Vol. I: Truth 2023-07-01T11:44:22.154Z
Cataloguing Priors in Theory and Practice 2022-10-13T12:36:40.477Z
Boolean Primitives for Coupled Optimizers 2022-10-07T18:02:54.688Z
(Structural) Stability of Coupled Optimizers 2022-09-30T11:28:35.698Z
Interlude: But Who Optimizes The Optimizer? 2022-09-23T15:30:06.638Z
Representational Tethers: Tying AI Latents To Human Ones 2022-09-16T14:45:38.763Z
Ideological Inference Engines: Making Deontology Differentiable* 2022-09-12T12:00:43.836Z
Oversight Leagues: The Training Game as a Feature 2022-09-09T10:08:03.266Z
Benchmarking Proposals on Risk Scenarios 2022-08-20T10:01:53.248Z
Steelmining via Analogy 2022-08-13T09:59:28.866Z
[Linkpost] diffusion magnetizes manifolds (DALL-E 2 intuition building) 2022-05-07T11:01:55.967Z
[Linkpost] Value extraction via language model abduction 2022-05-01T19:11:15.790Z

Comments

Comment by Paul Bricman (paulbricman) on Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation · 2023-12-05T06:05:17.351Z · LW · GW

That's an interesting idea. In its simplest form, the escrow could have draconic NDAs with both parties, even if it doesn't have the technology to prove deletion. In general, I'm excited about techniques that influence the type of relations that players can have with each other.

However, one logistical difficulty is getting a huge model from the developer's (custom) infra on the hypothetical escrow infra... It'd be very attractive if the model could just stay with the dev somehow...

Comment by Paul Bricman (paulbricman) on Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation · 2023-12-04T11:43:56.248Z · LW · GW

Thanks for the reply! I think the flaw you suggested is closely related to the "likelihood prioritization" augmentation of dictionary attacks from 3.2.1. Definitely something to keep in mind, though one measure against dictionary attacks in general is slow hashing, the practice of configuring a hashing function to be unusually expensive to compute.

For instance, with the current configuration, it would take you a couple years to process a million hashes if all you had was one core of a modern CPU. But this can be arbitrarily modified. The current slow hashing algorithm used also requires a memory burden, so as to make it harder to parallelize on accelerators.

Curious if you have other ideas on this!

Comment by Paul Bricman (paulbricman) on Elements of Computational Philosophy, Vol. I: Truth · 2023-09-18T09:37:00.169Z · LW · GW

Thanks for the interest! I'm not really sure what you mean, though. By components, do you mean circuits or shards or...? I'm not sure what you mean by clarifying or deconfusing components, this sounds like interpretability, but there's not much interpretability going on in the linked project. Feel free to elaborate, though, and I'll try to respond again.

Comment by Paul Bricman (paulbricman) on Elements of Computational Philosophy, Vol. I: Truth · 2023-09-18T09:32:23.532Z · LW · GW

Thanks a lot for the feedback!

All I want for christmas is a "version for engineers." Here's how we constructed the reward, here's how we did the training, here's what happened over the course of training.

For sure, I greatly underestimated the importance of legible and concise communication in the increasingly crowded and dynamic space that is alignment. Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit of work that focused more on the conceptual and theoretical rather than the applied, a goal for which I think it was very suitable (e.g. introducing an epistemological theory with direct applications to alignment).

My current impression is that the algorithm for deciding who wins an argument is clever, if computationally expensive, but you don't have a clever way to turn this into a supervisory signal, instead relying on brute force (which you don't have much of).

You mean ArgRank (i.e. PageRank on the argument graph)? The idea was to simply use ArgRank to assign rewards to individual utterances, then use the resulting context-utterance-reward triples as experiences for RL. After collecting experiences, update the weights, and repeat. Now, though, I'd rather do PEFT on the top utterances as a kind of expert iteration, which would also make it feasible to store previous model versions for league training (e.g. by just storing LoRa weight diffs).

I didn't see where you show that you managed to actually make the LLMs better arguers.

Indeed, preliminary results are poor, and the bar was set pretty low at "somehow make these ideas run in this setup." For now, I'd drop ArgRank and instead use traditional methods from computational argumentation on an automatically encoded argument graph (see 5.2), then PEFT on the winning parties. But I'm also interested in extending CCS-like tools for bettering ArgRank (see 2.5). I'm applying to AISC9 for related follow-up work (among others), and I'd find it really valuable if you could send me some feedback on the proposal summary. Could I send you a DM with it? 

Connection between winning an argument and finding the truth continues to seem plenty breakable both in humans and in AIs.

Is it because of obfuscated arguments and deception, or some other fundamental issue that you find it so?

Comment by Paul Bricman (paulbricman) on Cataloguing Priors in Theory and Practice · 2022-10-14T11:46:15.795Z · LW · GW

I feel a lot of the problem relates to an Extremal Goodhart effect, where in the popular imagination views simulations as not equivalent to reality.

That seems right, but aren't all those heuristics prone to Goodharting? If your prior distribution is extremely sharp and you barely update from it, it seems likely that you run into all those various failure modes.

However my guess is that simplicity, not speed or stability priors are the default.

Not sure what you mean by default here. Likely to be used, effective, or?

Comment by Paul Bricman (paulbricman) on Cataloguing Priors in Theory and Practice · 2022-10-14T11:39:47.241Z · LW · GW

Thanks a lot for the reference, I haven't came across it before. Would you say that it focuses on gauging modularity?

Comment by Paul Bricman (paulbricman) on Linda Linsefors's Shortform · 2022-10-03T14:16:57.639Z · LW · GW

You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?

Comment by Paul Bricman (paulbricman) on Oversight Leagues: The Training Game as a Feature · 2022-09-14T11:22:37.014Z · LW · GW

Hm, I think I get the issue you're pointing at. I guess the argument for the evaluator learning  accurate human preferences in this proposal is that it can make use of infinitely many examples of inaccurate human preferences conferred by the agent as negative examples. However, the argument against can be summed up in the following comment of Adam:

I get the impression that with Oversight Leagues, you don't necessarily consider the possibility that there might be many different "limits" of the oversight process, that are coherent with the initial examples. And it's not clear you have an argument that it's going to pick one that we actually want.

Or in your terms:

Not just any model will do

I'm indeed not sure if the agent's pressure would force the evaluator all the way to accurate human preferences. The fact that GANs get significantly closer to the illegible distributions they model and away from random noise while following a legible objective feels like evidence for, but the fact that they still have artifacts feels like evidence against. Also, I'm not sure how GANs fare against purely generative models trained on the positive examples alone (e.g. VAEs) as data on whether the adversarial regime helps point at the underlying distribution.

Comment by Paul Bricman (paulbricman) on Oversight Leagues: The Training Game as a Feature · 2022-09-12T13:37:32.526Z · LW · GW

Thanks a lot for the feedback!

How are you getting the connection between the legible property the evaluator is selecting for and actual alignment?

Quoting from another comment (not sure if this is frowned upon):

1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer.
2. Attempt to (outer) align the other subsystem (evaluator) to the human's true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly nuanced negative examples (specified by the improving agent).


As it stands, this seems like a way to train a capable agent that's hyperspecialized on some particularly legible goal.

I'm not entirely sure what you mean by legible. Do you mean a determinstic reward model which runs on a computer, even though it might have a gazillion parameters? As in, legible with respect to the human's objective?


Or to color your thinking a little more, how is the evaluator going to interact with humans, learn about them, and start modeling what they want?

In this scheme, the evaluator is not actively interacting with humans, which indeed appears to be a shortcoming in most ways. The main source of information it gets to use in modeling what humans want is the combination of initial positive examples and ever trickier negative examples posed by the agent. Hm, that gets me thinking about ways of complementing the agent as a source of negative examples with CIRL-style reaching out for positive examples, among others.

Comment by Paul Bricman (paulbricman) on Oversight Leagues: The Training Game as a Feature · 2022-09-12T13:28:08.072Z · LW · GW

In general could someone explain how these alignment approaches do not simply shift the question from "how do we align this one system" to "how do we align this one system (that consists of two interacting sub-systems)"

Thanks for pointing out another assumption I didn't even consider articulating. The way this proposal answers the second question is:

1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer.
2. Attempt to (outer) align the other subsystem (evaluator) to the human's true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly nuanced negative examples (specified by the improving agent).

The very weak evaluator e runs a very simple algorithm. It avoids being gamed if the agent it is evaluating has the same source code as A.

Oh, that's interesting. I think this is indeed the most fragile assumption from the ones invoked here. Though I'm wondering if you could actually obtain such an evaluator using the described procedure.

Comment by Paul Bricman (paulbricman) on Benchmarking Proposals on Risk Scenarios · 2022-08-21T11:03:19.246Z · LW · GW

Thanks for the pointer to the paper, saved for later! I think this task of crafting machine-readable representations of human values is a thorny step in any CEV-like/value-loading proposal which doesn't involve the AI inferring them itself IRL-style.

I was considering sifting through literature to form a model of ways people tried to do this in an abstract sense. Like, some approaches aim at a fixed normative framework. Others involve an uncertain seed which is collapsed to a likely framework. Others involve extrapolating from fixed to an uncertain distribution of possible places an initial framework drifted towards. Does this happen to ring a bell about any other references?

Comment by Paul Bricman (paulbricman) on [Linkpost] Value extraction via language model abduction · 2022-05-02T11:30:31.660Z · LW · GW

We tried using (1) subjectivity (based on simple bag-of-words), and (2) zero-shot text classification (NLI-based) to help us sift through the years of tweets in search for bold claims. (1) seemed a pretty poor heuristic overall, and (2) was still super noisy (e.g. It would identify "that's awesome" as a bold claim, not particularly useful). The second problem was that even if tweets were identified as containing bold claims, those were often heavily contextualized in a reply thread, and so we tried decontextualizing those manually to increase the signal-to-noise ratio. Also, we were initially really confident that we'd use our automatic negation pipeline (i.e. few-shot prompt + DALL-E-like reranking of generations based on detected contradictions and minimal token edit distance), though in reality it would take way way longer than manual labeling given our non-existent infra.

I agree that all those manual steps are huge sources of experimenter bias, though. Doing it the way you suggested would improve replicability, but also increase noise and compute demands.