Distillation Of DeepSeek-Prover V1.5

post by IvanLin (matthewshing) · 2024-10-15T18:53:11.199Z · LW · GW · 0 comments

Contents

  TL;DR
  Overview
  Pre-Training and SFT
  RLPAF
  MCTS
  Results and Evaluations
None
No comments

TL;DR

DeepSeek-Prover-V1.5 is an improved open-source language model for theorem proving in Lean 4.

The paper continues pre-training DeepSeekMath-Base, a math foundation model, and then does supervised fine-tuning on a dataset of incomplete proofs in Lean 4, followed by RL with feedback sourced from Lean 4. DeepSeek-Prover-V1.5 finds proofs using truncate-and-resume, which combines existing proof generation techniques, and a novel Monte-Carlo-Tree-Search (MCTS) algorithm for superior performance.

Overview

Definitions:

Language models trained for theorem proving typically do one of two things:

DeepSeek-Prover-V1.5 combines proof-step and whole-proof generation, naming the combination truncate-and-resume. Truncate-and-resume starts with whole-proof generation. If Lean detects an error in the proof, all code succeeding the error is removed (truncate), and the valid proof steps are used as a prompt for the next proof segment (resume). This truncate and resume mechanism is used in conjunction with MCTS to search over possible proof states. A reward-free exploration algorithm that uses intrinsic motivation to explore the tactic space is introduced to address reward sparsity.

Pre-Training and SFT

Pre-training:

Supervised fine-tuning: 

RLPAF

Thought-Augmented Proof Generation:

MCTS

The paper uses a proof tree abstraction that defines a proof state and possible tactics to implement MCTS in the whole proof generation. The proof search tree is constructed at the tactic level, with each edge representing a single tactic state transition

The process then proceeds as,

Intrinsic Rewards for MCTS are sparse, occurring only when a proof is completely solved, to address this, the RMaxTS algorithm is introduced[1], incorporating intrinsic rewards to encourage the search agent to explore and gather information about the environment, balancing the optimization of extrinsic rewards with the acquisition of general knowledge about the interactive proof environment.

Results and Evaluations

The model is evaluated on the following benchmarks

Metrics

They compare the DeepSeekProver V1.5 to prior SOTA models.

 

  1. ^

    RMax applied to Tree Search

0 comments

Comments sorted by top scores.