A High Level Closed-Door Session Discussing DeepSeek: Vision Trumps Technology
post by Cosmia_Nebula · 2025-01-30T09:53:16.152Z · LW · GW · 0 commentsThis is a link post for https://rentry.co/DeepSeek-interview-2025-01-27/
Contents
No comments
This document has been translated at ChinaTalk.media, but the critical technical section (18. to 48.) is not free. So I translated that part.
## Technical Detail 1: SFT
> “There's no need to do SFT at the inference level anymore.”
18. The biggest shock brought by DeepSeek is not open source or low cost, but that there is no need to do SFT. (Note: SFT: Supervised Fine-Tuning, a technique to improve the performance of a pretrained model on a specific task or domain on labeled data.) But only for logical tasks. Non-logical tasks may still require SFT. It is interesting to discuss this point -- Does this present a new paradigm or architecture that makes training models more sample-efficient? Or would the models iterate faster?
19. DeepSeek-R1 shows the benefits of using SFT for distillation. DeepSeek-R1 did do *some* SFT, but in the third step, and then RLHF (Reinforcement Learning from Human Feedback) for the final alignment step.
20. R1 is SFT-trained by synthetic data generated by RLHF-trained models, which means that there is no need to use a particularly complex method, as long as there is a good enough method, you only need to distill it with standard SFT.
21. The essence of GRPO lies in the fact that the base model must be smart enough. 1 prompt gets 16 rollouts, because it takes a dozen attempts to get even one right answer. R1 showed us that this works: a good base model plus a verifier. Math and coding are good, because they are easy to verify, but theoretically, similar processes can be done for other scenarios and tasks, and eventually a generalist RL model will be realized.
22. R1-Zero got CoT emerging without SFT. The CoT just got longer and longer during training. This emergence is very meaningful, SFT is just a help: Without SFT we still get a model. With SFT we get a model much faster.
23. This incident shows that many small companies in the game for AI models can now use SFT to distill from large models, and just get good small models. However, [though R1 is not a small model], we didn't fully abandon SFT for training R1.
24. Consider a set of infinitely long CoTs generated by an LLM. That set can theoretically be viewed as a Turing machine, and arbitrarily complex computational tasks can be solved by it. A non-infinite CoT that you actually get is essentially just an intermediate computational trace -- an optimized way to iteratively sample potential output. It might get the right result sometimes, and [every time it does, it] nudges the model towards the right result. In essence, the model has to do some non-reducible amount of computation to accomplish some task, and the CoT is simply the intermediate computation that the model has to go through. We can call the final answer an "emergence", but we can also say this is just what computation *is*.
25. Although there is no mention of long context in DeepSeek's paper, but from the vibes, the effective context window has increased a lot between R1-preview and R1. I guess it is because they used better Long2Short CoT -- including the CoT used in the third stage of SFT, which was also finally removed during generation. [Comment: I have no idea what this sentence means. It originally says "包括在第三阶段的 SFT 用的 CoT 最终在 generation 的时候也被去掉".] The final released version of R1 may have used even cleaner CoT data for its SFT.
26. There are several kinds of data for SFT. One is the cold-start data, which is more like giving the model a good strategy and a better initialization, so that it can explore better. After all, in the GRPO objective, there is that term [the KL penalty term] which encourages the policy to stay close to the starting policy. Two is the synthetic data generated after RL was done [on R1-Zero], and then added with other data, and then used to train the `DeepSeek-V3-Base` by SFT. Essentially, each domain has its own data processing pipeline, and the ability that can be learned from this synthetic data was ultimately sourced from the base model. The distillation lost nothing. Putting together data from multiple domains may have led to generalization.
27. I'm not sure about the sample-efficiency for training R1. I guess OpenAI has done similar things for sample-efficiency, such as fine tuning. [For R1, they actually trained twice. The first RL-trained R1 was an internal model] not the final R1. It was simply used to just generate training data. That generated data was then used to do SFT on `DeepSeek-V3-Base` again, and *that* led to R1. The synthetic data contained 600K of reasoning data and 200K non-reasoning data. In the second stage, the model might have sometimes received a problem that required reasoning, but outside the example domains. In those cases, it might still have solved the problem, and thus obtain the reasoning data. The non-reasoning data is part of the V3 SFT data, by letting V3 impute a CoT. 800K data It's still pretty small, pretty efficient.
## Technical Detail 2: Data
> “DeepSeek takes data labeling very seriously.”
28. Scale.AI won't necessarily fail. Now for RL on various domains, most commonly math and coding, we still need expert labels. Data labeling may become more complex, but the market will exist.
29. For training, we hardly see the benefit of multimodal data. In other words, the cost is too high. Today there is no evidence it is useful. In the future, opportunities may be bigger.
30. DeepSeek attaches great importance to data labeling, and I heard that Liang Wenfeng himself did labeling. In addition to algorithms and skills in AI, the accuracy of the data is also very critical, the cost of Tesla's labeling is almost 20 times the cost of China's self-driving efforts., China's self-driving effort's data began as a large and comprehensive thing, then the effort kept becoming more and more refined, until they made the final discovery that they need people with special driving experience and ability. This is what Tesla was doing at the very beginning. Tesla's robot's movements are labeled by people with very healthy cerebellums, and the degree of smoothness is very good, while the smoothness of labels by people hired in China's self-driving effort is very poor. So DeepSeek's investment in data labeling is one of the keys to good model efficiency.
## Technical detail 3: distillation
> “The bad thing about distillation is that model diversity goes down.”
31. If you avoid understanding the biggest technical pain points in model training, by just doing distillation, you may be trapped by some technical problem when the next generation of technology is proposed.
32. Big model and small model capability are not matched. Distillation from big model to small model is real distillation, teacher to student. If you try to distill Chinese data from the model that does not know Chinese at all, the performance may drop. But in fact, distillation into small models does lead to a very obvious performance improvement, R1-distilled models would be much better at RL, because it is trained using data that comes from beyond the model itself.
33. The disadvantage of distillation is that the diversity of the model decreases, which affects the upper limit of the model and prevents it from surpassing the strongest model. However, in the short term, distillation is a way forward.
34. There will be some hacky things during distillation. When using RL to train an instruction-tuned model, it would, during the early stage, make up useless ideas first, and then suddenly answer the questions correctly at the end. The reason is that a lot of those odd RL hacks have subtle causes. The model may have memorized a lot of the questions in the pre-training, so even if the model is pretending to be thinking, it is in fact just nudging itself towards the problems it memorized. This is the hidden danger of distillation. If we distill without annotation, then when we do Reinforcement Learning with Verifiable Rewards (RLVR), it will lead to the model solving the problem in a simpler way, instead of thinking about the problem. OpenAI has not solved. It may be a flaw for this generation of technology.
35. You can take shortcuts -- instead of thinking about how to do the technical solution through the vision by yourself, you can just reproduce it. But this has a hidden downside in the long run. For example, our generation of technology assumes there's no qualitative change in `long context`. Assuming that, the upper limit of problem solving may be limited. R1-Zero may be a right direction, and it may be better to do R1-Zzero from the start, and it may be better to avoid starting with o1-like data. Following someone else's technical solution may not be good. Should explore more.
36. Other models can also get pretty good results from disillation. In the future, there may be a distinction between the roles of teacher and student in the model ecosystem. Producing models that can be a good student might become a viable business model.
37. In terms of distillation and technical route, R1 brings less shock than AlphaGo, but in terms of business, the ability for it to become a breakout success [outside of the AI circle] is much better than AlphaGo.
38. Distillation is divided into two phases, if you just distill o1 or R1 without establishing your own system and verifiable reward, it will lead to more and more reliance on distillation, but it is impossible to distill your way to a generalist model, because you don't get the reward signal, and you don't get the special CoT. Moreover, the first stage of distillation has traces. A model distilled from OpenAI's models may leave much of annealing scars from OpenAI. Why did R1-Zero get such powers during pure RL? Directly related to the self-reflection ability of the base model, obtained after annealing.
39. I don't really believe that a model pretrained on purely Internet data and no annealing can achieve such behavior, because there is almost no high quality data on the internet.
40. there are probably only a few top labs exploring exactly how much data and data ratios are needed for the annealing phase. Either distillation or no-distillation -- both can be thought of as RL. After all, distillation is just [Behavior Cloning](https://en.wikipedia.org/wiki/Imitation_learning#Behavior_Cloning), a form of unlimited RL, but SFT-only has a very low performance ceiling, and compromises diversity.
41. the primary market startups got very excited by DeepSeek. If DeepSeek can follow up with more iteration models, for a public company that's not one of the big ones, using AI allows great flexibility. DeepSeek also distilled a few small version that can run on a phone. If this direction would be proven, it'd raise the performance ceiling on many AI applications.
42. To distill, it is very important to determine what the goal is. OpenAI did no data distillation. You definitely can't get a better model than OpenAI by distillation.
43. In the future, the model may need to learn to skip steps to answer questions, like human beings. Can it increase the ceiling performance of the model under the fixed context length?
## Technical Detail 4: Process Reward
> “The upper limit of process supervision is the human. As for the limit of the model itself? That's outcome supervision.”
44. Process Reward might still work, but Process Reward may be susceptible to reward hack, i.e., the model doesn't learn anything, but can make the reward very high. If you are solving a math problem, and you use a model to generate 1000 generations, there may be none that is close to the correct answer. In that case, any RLVR-like method would fail to train anything. If there is an OK Process Reward at this time, you may be able to get the model close to the correct direction. In that case, Process Reward is helpful. It depends on how hard it is to solve the problem, how reliable the process reward is, etc.
45. For process reward in PRM estimation, if it deviates from the real reward, then it's quite easy to hack. Process supervision is theoretically possible, but the problem lies in the strength of the process, and how to give the reward based on the strength of the process. Right now, Outcome Reward Modeling is merely the simplest method of extracting the final answer from the model output, and matching against the ground truth label. Nobody has a very mature way to get a neural network reward model that can't be easily hacked. And self-iteration by the models itself would lead to the easiest reward hacking. Labelling the process data isn't too hard. We can just enumerate them exhaustively. It's just that people don't want to do it. May be a promising direction.
46. The upper limit of process supervision is the human. Humans can't imagine many weird corner cases. As for the limit of the model itself? That's outcome supervision.
47. The reason why AlphaZero is more effective is that it can judge the winner and loser in the final game, and the whole reward can be calculated according to the winning rate. An LLM doesn't know if the stream of generation would eventually get the answer, which is a little bit similar to genetic algorithm. The upper limit may be higher, but it's also possible that it can't be hacked towards.
48. One of the advantages of AlphaGo over AlphaZero is that the rules of Go are fixed. So now the model starts from math and coding because it is easier to validate. Whether validation is good enough or not will affect the quality of the final RL. The rules have to be good enough, otherwise the model will reward-hack -- the model can satisfy the rules, but the result is not what we want.
0 comments
Comments sorted by top scores.