Posts

Using GPT-4 to Understand Code 2023-03-24T00:09:41.746Z
What sci-fi books are most relevant to a future with transformative AI? 2023-01-24T15:30:21.753Z

Comments

Comment by sid (sidhire) on Takeaways from our robust injury classifier project [Redwood Research] · 2023-11-06T18:36:37.133Z · LW · GW

Are there any plans to repeat this work using larger models which now exist?

Comment by sid (sidhire) on More information about the dangerous capability evaluations we did with GPT-4 and Claude. · 2023-04-04T22:34:45.886Z · LW · GW

see the inputs of running that code

Should this be "outputs"?

Comment by sid (sidhire) on Using GPT-4 to Understand Code · 2023-03-24T16:06:57.687Z · LW · GW

Nice, exercises are a good idea, especially for bite-sized things like einsum. It could also give personalized feedback on your solutions to exercises from a textbook.

Randomized flashcards like you've described would be really really cool. I'm just dipping my toes in the water with having it generate normal flashcards. It has promise, but I'm not sure on the best way to do it yet. One thing I've tried is prompting it with a list of principles the flashcards ought to adhere to, and then having it say for each flashcard which of the principles that card exhibits.

And one more: I've been prompting it to teach me things via the Socratic method. So it asks questions and I have to answer them. Most usefully, it's not just a "yes man"—it actually tells me if I'm wrong.

Comment by sid (sidhire) on What sci-fi books are most relevant to a future with transformative AI? · 2023-01-24T20:38:45.485Z · LW · GW

Can you point to a particular one? I've read Player of Games but I don't think it's relevant.

Comment by sid (sidhire) on What sci-fi books are most relevant to a future with transformative AI? · 2023-01-24T15:36:13.398Z · LW · GW

Worth the Candle by Alexander Wales

Comment by sid (sidhire) on What sci-fi books are most relevant to a future with transformative AI? · 2023-01-24T15:34:56.457Z · LW · GW

Permutation City by Greg Egan has humans living in a simulation.

Comment by sid (sidhire) on What sci-fi books are most relevant to a future with transformative AI? · 2023-01-24T15:31:15.247Z · LW · GW

Diaspora by Greg Egan features human-like beings living in a virtual world, similar to the digital people described here.

Comment by sid (sidhire) on Debate update: Obfuscated arguments problem · 2022-12-13T16:49:01.536Z · LW · GW

In the RSA-2048 example, why is it infeasible for the judge to verify every one of the honest player's arguments? (I see why it's infeasible for the judge to check every one of the dishonest player's arguments.)

Comment by sid (sidhire) on AI Safety via Debate · 2022-12-12T17:52:05.203Z · LW · GW

I was trying to get a clearer picture of how training works in debate so I wrote out the following. It is my guess based on reading the paper, so parts of it could be incorrect (corrections are welcome!), but perhaps it could be helpful to others.

My question was: is the training process model-free or model-based? After looking into it more and writing this up, I'm convinced it's model-based, but I think maybe either could work? (I'd be interested if anyone has a take on that.)

In the model-free case, I think it would not be trained like AlphaGo Zero, but instead using something like PPO. Whereas in the model-based case it would be more similar to AlphaGo Zero, where training would use Monte Carlo tree search and debate would serve as a policy improvement operator, making it IDA. Or does it not matter? (n.b. I'm using model-free and model-based in the RL sense here, where "model" is not the ML model but rather a model of the game which allows the network to simulate the game in its mind.)

More details on the approaches:

  • Model-free — During training, the network gets reward for winning the game, and the (e.g.) policy gradient algorithm updates it to take more winning moves in the future.
  • Model-based — During training, the network's policy head is trained to better predict the result of the amplification process, i.e. what move it would make after simulating the debate in its mind. Edit: I'm not sure how you would compute the distance between two possible utterances though, to constitute the loss. Maybe something like is used in RLHF fine-tuning for LLMs but I'm not familiar with that.
  • Both — In both cases, [my guess is that] the network is outputting arbitrary utterances, which could be position-taking sentences or argumentative sentences.

The relevant paper sections I found on this are:

  • "...we propose training agents via self play on a zero sum debate game." AND "We can approximate optimal play by training ML systems via self play, which has shown impressive performance in games such as Go, chess, shogi, and Dota 2 [Silver et al., 2016, 2017a,b, OpenAI, 2017]." AND "Similarly, the deep networks used in Silver et al. [2017b] are convolutional residual networks unrelated to the game tree of Go, though the training process does involve the tree via MCTS."
    • Strongly implies the model-based approach given the self-play and MCTS. But maybe either approach can be viewed as self play? (In the model-free case if it's playing a copy of itself.)
  • "The equivalence is far from exact: the feedback for a debate is about the whole game and the feedback for amplification is per step, debate as presented uses reinforcement learning while the easiest versions of amplification use supervised learning, and so on. However all these features can be adjusted in either direction."
    • Not exactly sure, but maybe this is saying that either approach works?
  • "In contrast to a legal argument or a typical competitive debate, the two players in this game are allowed to choose what they are arguing for, including both arguing for the same thing."
    • Hence the "arbitrary utterances" thing above.
  • "At test time it suffices to stop after step 2: we do not need to run the debate (though agents could simulate debates at test time to strengthen answers)."
    • The parenthetical implies that the model-based approach would be used. However, under both approaches I think it would be valid to not run the debate at test time. Whatever "opening move" the network takes would be its stance on the proposition (e.g. on the "Where should we go on vacation" question, its first utterance would likely be something like "Aruba".)
Comment by sid (sidhire) on The alignment problem from a deep learning perspective · 2022-12-11T21:14:35.907Z · LW · GW

Do you think it's possible we end up in a world where we're mostly building AIs by fine-tuning powerful base models that are already situationally aware? In this world we'd be skipping right to phase 2 of training (at least on the particular task), thereby losing any of the alignment benefits that are to be gained from phase 1 (at least on the particular task).

Concretely, suppose that GPT-N (N > 3) is situationally aware, and we are fine-tuning it to take actions that maximize nominal GDP. It knows from the get-go that printing loads of money is the best approach, but since it's situationally aware it also knows that we would modify it if it did that. Thus, it takes agreeable actions during training, but once deployed pivots to printing loads of money. (In this hypothetical the hazard of just printing money doesn't occur to us humans, but you could imagine replacing it with something that more plausibly wouldn't occur to us.)

Comment by sid (sidhire) on Solstice 2022 Roundup · 2022-12-05T18:00:05.194Z · LW · GW

I think the date is Dec 17 per the Facebook event?

Comment by sid (sidhire) on Trying to Make a Treacherous Mesa-Optimizer · 2022-11-12T17:53:47.397Z · LW · GW

On page 8 of the paper they say, "our work does not demonstrate or address mesa-optimization". I think it's because none of the agents in their paper has learned an optimization process (i.e. is running something like a search algorithm on the inside).

Comment by sid (sidhire) on The alignment problem from a deep learning perspective · 2022-10-28T15:25:57.997Z · LW · GW

It says that the first head predicts the next observation. Does this mean that that head is first predicting what action the network itself is going to make, and then predicting the state that will ensue after that action is taken?

(And I guess this means that the action is likely getting determined in the shared portion of the network—not in either of the heads, since they both use the action info—and that the second head would likely just be translating the model's internal representation of the action to whatever output format is needed.)

Comment by sid (sidhire) on Prizes for ELK proposals · 2022-01-09T17:49:34.083Z · LW · GW

If the predictor AI is in fact imitating what humans would do, why wouldn’t it throw its hands up at an actuator sequence that is too complicated for humans—isn’t that what humans would do? (I'm referring to the protect-the-diamond framing here.)

Comment by sid (sidhire) on larger language models may disappoint you [or, an eternally unfinished draft] · 2022-01-03T17:05:37.917Z · LW · GW

There is some point at which it’s gaining a given capability for the first time though, right? In earlier training stages I would expect the output to be gobbledygook, and then at some point it starts spelling out actual words. (I realize I’m conflating parameters and training compute, but I would expect a model with few enough parameters to output gobbledygook even when fully trained.)

So my read of the de-noising argument is that at current scaling margins we shouldn’t expect new capabilities—is that correct? Part of the evidence being that GPT-3 doesn’t show new capabilities over GPT-2. This also implies that capability gains are all front-loaded in the lower numbers of parameters.

As an aside that people might find interesting: this recent paper shows OpenAI Codex succeeding on college math problems simply once prompted in a particular way. So in this case the capability was there in GPT-3 all along…we just had to find it.