Interviews with Moonshot AI's CEO, Yang Zhilin
post by Cosmia_Nebula · 2025-01-31T09:19:36.561Z · LW · GW · 0 commentsThis is a link post for https://rentry.co/Moonshot-AI-interview/
Contents
No comments
<https://news.qq.com/rain/a/20240208A05KFR00>
# Exclusive interview with Moonshot's Yang Zhilin: How does a new AGI startup surpass OpenAI?
Overseas Unicorn
Published on Beijing Overseas Unicorn official account at 2024-02-21 11:25.
Interviewers: 天一、penny、guangmi
Editor: 天一
Typesetting: Scout
"Lossless long context is everything." This is the point we remember most deeply after a two-hour conversation with Yang Zhilin.
This technical judgment was already conveyed in October 2023 when Moonshot AI, founded by Yang Zhilin, released its first model, moonshot, and the smart assistant Kimi, supporting an input of 200,000 characters. The focus on "long" stems from Yang Zhilin's belief that the ultimate value of AI-Native products is providing personalized interactions, and lossless long context is the foundation for achieving this. He argues that model finetuning shouldn't exist in the long run, that the interaction history between a user and the model should be the best form of personalization, and that every generation of technology in history has increased context length.
Yang Zhilin has been labeled as a genius AI scientist, serial entrepreneur, and more. In this in-depth interview, he once again proves himself to be a startup founder who truly "understands" large models. Thus, this article presents many counter-consensus views: Yang Zhilin believes that finetuning will eventually become obsolete, and tokenizers may not necessarily be essential in the end. While large model trainers in Silicon Valley worry about data bottlenecks and energy limitations, he believes that all problems are interconnected. Multimodal approaches can alleviate data shortages, and synthetic data can address energy issues by changing the compute paradigm.
This article also attempts to answer another question widely concerning the outside world: How can a newly established AGI startup surpass OpenAI? Yang Zhilin's answer is "tech vision". The headman must be able to make technical judgments and also decisively execute them. A specific example is that Moonshot AI aims to be more user-centric than OpenAI, because Yang Zhilin judges that the scale-up effect of user data will ultimately surpass the base model itself.
Yang Zhilin is also confident that using the underlying principles of the transformer probability model will lead to AGI. In his words, "If you have a context length of 1 billion, today's problems will cease to be problems."
## 01. AGI: AI is fundamentally a pile of scaling laws
Overseas Unicorn: We liken the training of LLMs to a moonshot, thus Moonshot AI's name. What do you think about LLM training at startups today? With limited GPU and compute resources, is it still possible to achieve a "moonshot"?
Yang Zhilin: There are several different production factors for a "moonshot." Compute power is definitely a core one, but there are others. You need an architecture that simultaneously satisfies both scalability and generality. However, many architectures today don't meet these two criteria. The transformer architecture meets these two in the known token space, but when expanded to a more general scenario, it's not quite the case. Data is also a production factor, including the digitalization of the entire world and data from users.
So, among many core factors of production, we can increase the compute utilization rate by changing other factors of production. At the same time, for a "moonshot," compute power must continue to grow. The best models we see today are on the scale of 1e25 to 1e26 FLOPs. This magnitude will definitely continue to grow, so I think compute power is a necessary condition. Consider: machine learning or AI has been studied for 70--80 years, and the only thing that truly works is just the scaling law, which is to scale up these several production factors.
We are actually quite confident that within a one-year timeframe, we can achieve models at the scale of 1e26 FLOPs. Resources will eventually be allocated reasonably.
Overseas Unicorn: It's estimated that OpenAI is using at least 100,000 H100 GPUs to train its next-generation model, with a single cluster reaching 30,000 GPUs. OpenAI is clearly pursuing a "moonshot", but perhaps doesn't focus as much on user and customer experience. Where will Moonshot AI's path of differentiation from OpenAI lie? What can Moonshot AI do that OpenAI can't?
Yang Zhilin: In the short term, a key point is that everyone's tech vision is not entirely the same. Many areas are not OpenAI's core competencies. For example, in image generation, DALL-E 3 is at least one generation behind Midjourney. GPT's long-context is also not state-of-the-art. Our lossless long-context technology that we developed recently performs better than OpenAI in many specific scenarios because it uses lossless compression technology. You can use it to read a very long article, and it can restore specific details well, and it can also do reasoning with the content. Users will also discover many scenarios, such as throwing 50 resumes at it and asking it to analyze and screen them according to your requirements.
To achieve differentiation, I think it's about seeing how large the tech space is. The larger the tech space, the greater the differentiation that can be achieved at the technology, product, and business levels. If the technology has already converged, then everyone can only race with each other, an involution among equalized competitors.
And I'm actually quite optimistic because there is still a huge tech space now. AGI technology can be divided into three layers:
The first layer is the combination of scaling law and next-token-prediction. This foundation is the same for everyone, and the process of catching up gradually converges. On this path, OpenAI is currently doing better because they have invested the corresponding resources over the past four or five years.
The second layer now has two core problems. The first is how to represent the world generally? By true "generality" I mean that of a computer, which uses 0s and 1s to represent the entire world. For transformer-based language models, they can represent a book, an article, or even a video, but it's still difficult to represent a larger 3D world or all the files on your hard drive. It hasn't achieved token-in-token-out, and there is still a gap from the so-called unified representation. This problem is what architecture is supposed to actually solve.
Another problem in the second layer: overcoming the bottleneck of not enough data through AI self-improvement. Today's AI is actually like a black box. This black box has two inputs: a power cord and a data cord. After inputting these two things, the box can produce intelligence. Then, everyone realizes that the input of the data cord is limited. This is the so-called data bottleneck problem. The next generation of AI needs to unplug the data cord and achieve a state where, as long as electricity is continuously input, it can continuously output intelligence.
These two core problems lead to a huge space in the third layer, including long-context, the generation of different modalities, the model's multi-step planning ability, instruction following ability, various functions that agents should do, etc.
These upper-level things will be hugely differentiated, because of those two important technical variables in the middle. I think there lies our opportunity.
In addition to the technical level, we are a bit differentiated from OpenAI in our values: we hope that in the next era, we can become a company that combines OpenAI's techno-idealism and the philosophy of commercialization shown by ByteDance. The Oriental utilitarianism has some merits. If you don't care about commercial values at all, it is actually very difficult for you to truly create a great product, or make an already great technology even greater.
Overseas Unicorn: What story do you think model companies should tell? Like OpenAI, should they tell a story of pursuing AGI, or a story of a super app? Would there be a contradiction between the two, and how do you balance them?
Yang Zhilin: How to tell a story depends on the mindset of the investors. For us, it is more important to understand the relationship between the two.
AGI and product are not a means-and-end relationship for us. Both are end-goals. At the same time, in the pursuit of AGI, I think the so-called data flywheel is very important, even though it is an old concept.
Products like ChatGPT haven't fully established continuous learning based on user data. I think this is largely because the base model can still improve. Once a generation evolves, the previous generation's user data is of little use. This is related to the stage of development -- now it's still "feasting upon" the base model's scaling law, and in the future, it may have to "feast upon" the scaling law of user data instead.
Basically, all internet products in history that have taken off have ultimately relied on the scale of user data. Today, we can see some signs in MidJourney, which can outperform the scale up of base models by "feasting upon" the scaling law of users, but if we only look at language models and text, the scaling effect of base models is still far greater than that of users. But I think this will eventually shift to the scaling law of users, it's just a matter of time.
This is especially important now in the face of the data bottleneck. Especially human preference data, which is very limited, but you can't do without it. I think this is one of the most important problems that every AI-Native product should be thinking about now. Therefore, a company that doesn't care enough about users may not be able to achieve AGI in the end.
Overseas Unicorn: What do you think of MoE? There is a saying that MoE is not a true scale up, and only scaling up dense models will improve the model's capabilities.
Yang Zhilin: You can consider that there are two scaling laws, one with MoE and one without. Essentially, the scaling law depicts the relationship between loss and the number of parameters. MoE changes this function, allowing you to use more parameters while keeping the FLOPs constant. Synthetic data changes another relationship, allowing the data scale to increase while keeping the FLOPs constant.
It is a known certainty that we all march along the scaling law. Everyone is merely trying to change the specific relationship in the scaling law to obtain higher efficiency. The extra efficiency is their respective differentiated advantages.
Now many people think that if you have MoE you can achieve GPT-4. I think this is a one-sided view. In the end, the more substantive thing may still be how to have a unified representation space and scalable data production.
Overseas Unicorn: If there were enough compute power, would someone want to make a trillion-parameter dense model?
Yang Zhilin: It depends on the rate at which inference cost decreases, but I think there definitely will be. Now everyone is making tradeoffs because the inference cost is too high. But in the end, directly training a trillion-parameter dense model will definitely be better than a model with only hundreds of billions of parameters.
Overseas Unicorn: Anthropic has been emphasizing the explainability of models, which is actually quite controversial. How do you think about explainability? Because you just mentioned that the model is a black box, and in fact, humans have not yet figured out how their own brains work.
Yang Zhilin: Explainability is a matter of trust at its core. Building a trustable rapport is very important. The corresponding application scenarios may even be different from ChatGPT, such as the combination of long-context and search.
When the model does not hallucinate at all, or the probability is very low, there is no need for explanation because what it says is all correct. And explanation may also just be part of alignment, for example, chain-of-thought can also be considered a form of explanation.
Hallucination can be solved through scaling law. But not necessarily in the pre-training stage, because in fact, alignment also has scaling law. It can definitely be solved as long as you can find the right data. AI is fundamentally a pile of scaling laws.
Overseas Unicorn: What is your expectation for AGI? Transformer is essentially a statistical probability model. Can it lead to AGI?
Yang Zhilin: There is nothing wrong with a statistical model. When next token prediction is good enough, it can balance creativity and factuality.
Factuality is generally a challenge for statistical models, but today's language models can have very sharp distributions. If you ask it "What is the capital of China?" [北京], the model would star with over 99% probability for the character "北". At the same time, if I ask it to write a novel today, then the probability distribution of the next word may be very uniform. Probability is actually a universal representation. In essence, there is a lot of entropy in this world. Grasp the things that can be certain, and leave the things that are inherently chaotic be.
If we are going towards AGI, long-context will be a very important point. All problems are long-context problems - all architectural evolutions in history have essentially been to increase the effective context length. word2vec recently won the NeurIPS Test of Time award. 10 years ago, it used a word to predict the surrounding words, which is equivalent to a context length of 5. RNN increased the effective context length to 20. LSTM went up to many 10s. Transformer got to several 1000s. Now we can reach several 100,000s.
If you have a context length of 1 billion, today's problems will cease to be problems.
In addition, lossless compression is actually learning certainty in a chaotic environment. An extreme example is an arithmetic sequence. Given the first two numbers, each subsequent number is determined. There is no chaos, so a perfect model can restore the entire sequence. But there is a lot of noise in real-world data. We need to filter out this noise and allow the model to only learn what it can learn. In this process, we also need to allocate enough probability to the uncertain possibilities. For example, if you want to generate a picture, its loss will be higher than generating a piece of text. This is because the picture contains more chaos and information. But you only need to capture the part that you can grasp with certainty, and leave the remaining part to probability. For example, the color of a water glass can be green or red, which has a probability of happening, but the color information will not change "what the water glass looks like". So what needs to be focused on learning is the shape of the water glass. As for its color, a probability distribution needs to be made.
Overseas Unicorn: What are the patterns in the improvement of context length? Is it technically predictable?
Yang Zhilin: I personally feel that there is a Moore's Law for context length. But it needs to be emphasized: the accuracy rate under a given length is also very important. Both length and accuracy (lossless compression) need to be optimized at the same time.
I think that the increase in context length would likely be exponential, even if we require that the model's capability and intelligence are not getting worse.
## 02. Multimodal: Must architectures don't deserve to be scaled up
Overseas Unicorn: Everyone expects multimodal to explode in 2024. Compared to text, where will the technical difficulties of multimodal lie?
Yang Zhilin: The FLOPs of state-of-the-art video generation models are actually an order of magnitude less than language models. It's not that people don't want to scale up, but rather that most architectures are not worth doing so. In 2019, the most popular architecture was BERT. Later, people asked why no one was scaling BERT. The reason is that an architecture that deserves to be scaled needs to have both scalability and generality. I don’t think BERT lacks scalability, but you can clearly see that it lacks generality – no matter how much you scale it, it can't write an article for you. Multimodal has also been stuck in the architecture in the past few years, lacking a truly generalist model that people would want to scale. Diffusion is clearly not it. Even if you scale it to the moon, it can't be AGI. Today, the auto-regressive architecture has brought some new possibilities, sacrificing some efficiency to solve for generality. Today's autoregressive architecture brought new possibilities, sacrificing some efficiency to solve generality.
Auto-regression itself is scalable, but the tokenizer is not necessarily scalable, or perhaps we won't need a tokenizer in the end. This is the core problem of 2024.
Overseas Unicorn: If the tokenizer is not scalable, do we need a completely new architecture beyond the transformer?
Yang Zhilin: Just speaking of the transformer itself, I don't think there's a big problem. The core issue is still solving the tokenizer problem. The transformer architecture has actually undergone many changes. Today, we're doing long-context and MoE, which are not standard transformers. But the soul/idea of the Transformer will certainly exist for a long time. The key is how to solve more problems based on this idea.
Overseas Unicorn: Actually, if the context length is infinitely long, we won't even need a tokenizer anymore?
Yang Zhilin: That's right. Essentially, if the model is strong enough, it can process any token, pixel, or byte. With an infinitely long context length, you can directly feed it everything on your hard drive. It will become your true new computer, taking action based on this context.
Overseas Unicorn: Leading model companies like OpenAI and Anthropic believe that a major bottleneck in 2024 will be data, so they have high expectations for how to use synthetic data. What do you think of synthetic data?
Yang Zhilin: An architecture worthy of being scaled up is the foundation. This architecture must first support the continuous addition of more data before data would truly become a bottleneck. The so-called "data bottleneck" we're talking about now will be encountered in 2024 in the text modality, but the introduction of multimodal data will postpone this problem by 1--2 years.
If the bottleneck of video and multimodal cannot be solved, then the text data bottleneck will become critical. We also have some progress on this point - if the problem is limited, such as mathematics or writing code, data is relatively easy to generate. There isn't a complete solution to general problems yet, but there are some directions to explore.
Overseas Unicorn: Will energy be the bottleneck in 2025? Because by then, the scale of individual clusters will be very large, posing a challenge to energy consumption.
Yang Zhilin: These issues are actually interconnected. Ultimately, multimodal might solve the data bottleneck, and synthetic data might solve the energy problem.
By the GPT-6 generation, those who have mastered synthetic data technology will show a significant advantage. Because there are actually two types of data: one for pre-training, and the other for alignment, which is much more expensive to acquire. If data generation technology is mastered, the cost of alignment might decrease by several orders of magnitude, or the same investment could generate several orders of magnitude more data, and the landscape will change.
I think 2025 and 2026 might be very important milestones – the majority of a model’s computation will happen on data that the model itself generates.
By 2026, the computation for inference may far exceed the training itself. We might spend 10 times the cost on inference, and then spend one time the cost on training afterwards. A new compute paradigm will emerge, where *inference is training*, and this inference is not for serving any users, but solely for generating synthetic data for itself.
If this happens, the energy problem will also be solved, because inference can be distributed. And it doesn’t violate the laws of physics; it's essentially still energy conservation. It’s just that we’ve changed the compute paradigm, allowing energy to be utilized in a distributed manner.
## 0.3 Super-app: Model finetuning might eventually disappear
Overseas Unicorn: Google and Douyin's search and recommendation systems have a strong data flywheel effect. The algorithms can provide real-time feedback based on user behavior, and the user experience continuously improves. LLMs currently cannot provide real-time feedback on user behavior. What will the data flywheel effect of AI-Native products be?
Yang Zhilin: I've thought deeply about this question. The core value of AI-Native products ultimately lies in personalized interaction. This is something previous technologies didn't do well, so this question is really about personalization—how to ensure that the more users use your product, the more they get a highly personalized interactive experience. Today, for many products, this level of personalization is almost zero. Previously, we could only do personalized recommendations, but now, users can interact with the product. This interaction is highly anthropomorphized and personalized. How do we achieve this?
I think this is actually a technical problem. In the traditional AI era, achieving personalization required continuously updating models and using small models to solve individual problems. In the era of large models, one way to achieve personalization is through finetuning, but I believe finetuning might not be the fundamental method, and in the long run, model finetuning might eventually disappear. Why? When your model's instruction following ability, reasoning ability, and contextual consistency become stronger, everything need only be kept in memory. For example, your large model's memory has a bunch of "prefix"-like stuff that it can follow, and the cost can be reduced very low. In the end, your entire interaction history with the model *is* the personalization, a collection of your preferences and feedback. This feedback will be more direct than the products of the previous era because it's entirely generated through a dialogue interface.
Based on this judgment, we can think further: how can we technically achieve long-context-based customization to completely replace finetuning?
I think we are moving in this direction now. In the future, models won't need finetuning, but rather solve problems through powerful context-consistency and instruction-following. Seems likely that the long-term trend is that the base-level technology becomes personalized, and this will be a very important change.
For example, the new compute paradigm brought by GPT-4: Creating GPTs doesn’t require finetuning [just in-context learning]. Previously, customization was achieved through programming. Today, it’s actually done by making the model's prefix very complex, in order to locate a specific capacity within this highly generalist model. This is the new AI-native way of personalization. Attaching a traditional recommendation engine on top -- that will definitely be eliminated by the new way.
Overseas Unicorn: Pursuing lossless long-context first of all -- how did you decide to use that as the corporate strategy?
Yang Zhilin: I think the most important thing is to think while keeping this end goal in mind. Large models will become the new computer, so they will definitely need a lot of context-memory, because the memory of the conventional kind of computers has also increased by at least several orders of magnitude over the past few decades, and that the old conventional computers initially had only a little memory. The second point is that the ultimate value of AI is personalization.
Overseas Unicorn: OpenAI also has some long-context capabilities.
Yang Zhilin: It doesn't yet see the user interaction as a source of personalization. For example, if we prompt ChatGPT with something, whether it’s today or tomorrow, as long as the model version is the same, the result will be basically the same. This is what I mean by a lack of personalization.
Ultimately, everything is about instruction following. It's just that your instructions will become more and more complex. Today, your instructions might start with 10 words, but later, they might be 10,000 words or 1 million words.
Overseas Unicorn: Chatbot has always been the "white moonlight" [something that is wished for, but not achievable] for AI scientists. If each user has hundreds of conversations with a chatbot every day, will the chatbot system be able to collect and understand more user context, eventually greatly surpassing the matching accuracy of search and recommendation systems? Just like our interactions with colleagues and family, where a single word or even eye-contact is all that's needed to say what we mean.
Yang Zhilin: The core is crossing the trust barrier.
I think the ultimate measure of the long-term value of an AI product is how much personal information users are willing to input into it, and then lossless long-context and personalization are responsible for turning these inputs into valuable things.
New hardware forms may also be needed, but I think models and software are still a data bottleneck right now. And let's think one level deeper: The prerequisite for users to input a lot of information is trust. You need to have a sufficiently engaging and human-like AI. It shouldn't be that I set up certain product functions just to get your information. The ultimate effect should be that the user and AI become friends, such that the user would feel safe to say whatever is in their mind to the AI.
Inflection Pi's motivation is actually very good, wanting to build strong trust, but Pi might need to go one step further: How do you build user trust? It might not be socially acceptable to just appoint a "lifelong partner" to each user, as that's somewhat against human nature.
Overseas Unicorn: Moonshot AI wants to create a super-app. What does your ideal super app look like? How big does it need to be to be considered "super"?
Yang Zhilin: It still depends on the breakout level. When your relatives around you are all using it, then you've truly become a super-app. And I believe that the improvement of AI capabilities will precede the product's breakout. For example, if hypothetically, Character.ai were a perfect multimodal model today, then I think the probability of it breaking out would be at least 10 times higher. Ultimately, the upper limit of an application is measured by how much connection can be made between a human and an AI with a context length of many years.
## 04. The best people must be able to unlearn
Overseas Unicorn: What should the ideal CEO talent profile for an AGI company look like?
Yang Zhilin: On one hand, they need to have a "tech vision". They can't just keep doing things that others have already proven. A true AGI company must have its own unique technical judgment, and this judgment should influence the company's overall direction. It won't work if the No. 1 can't make decisions. We were already working on auto-regressive multimodal and lossless long-context even at the start of the year, even though these only became very popular in the last month or two. Even today, lossless long-context is still not a consensus. But if you only see this today, there won't be enough time to iterate, and you will end up as a follower.
The second point is to have a deep understanding of the new way to produce AI-Native products, and then adapt an organization around the new way. In the past, product development involved understanding user needs first, then designing features. In the new era, design needs to be completed while the thing is being made. ChatGPT emerged through the making of it, without first designing a bunch of use cases first, and then finding the algorithms for accomplishing them. Kimi users uploading their resumes and then doing filtering is also a use case that just emerged, one we hadn't even known to test for before we launched it.
Resource acquisition is also certainly very important. The main thing that burns money is compute. In the early stages, it relies on financing, and later it needs more product commercialization. Commercialization also cannot simply copy mature things from the previous era, so a good CEO and team should have a certain amount of experience, but also very strong learning and iteration capabilities.
Overseas Unicorn: But it's possible that investors can't discern whose tech vision is the most advanced.
Yang Zhilin: I'm not too worried about this issue. The current situation is the best allocation method, closer to a free market, and ultimately will have the highest allocation efficiency. What we need to prove to others isn't our vision anyway, because vision is an abstract thing. To prove ourselves, we must ship concrete models and products. After Anthropic shipped models like Claude, they immediately received more resources. The market is fair.
Overseas Unicorn: From the perspective of establishing product and company competitive moats, the industrial age emphasized economies of scale, the internet age emphasized network effects, will there be a new paradigm in the AGI era?
Yang Zhilin: In the short term, changes in organizational methods bring about technical improvements -- you achieve better technology through better organization, and then directly convey a better experience in the product.
In the long term, it's likely still network effects. The question is, how will network effects manifest themselves? For example, the two-sided networks of the internet era may still exist, but they may not be between users and creators. The two-sided networks of AI-Native products may be reflected in personalization; there is a co-creation relationship between users and AI.
So what I see as worthy of exploration are two points: the continuous improvement of model capabilities, and the other is two-sided effects. They will bring about a new paradigm in the new era. MidJourney has already seen an explosion in this two-sided effect. Stable Diffusion, as an open-source model, is in an awkward position because the user side is so dispersed, and it can only rely on improvements to the base model.
Overseas Unicorn: From a hiring perspective, how do you define good talent?
Yang Zhilin: I would break it down into experience and learning. Learning is a general ability, not just to learn, but also to unlearn, especially successful experiences of the past. Suppose you went from 0 to 1 and built YouTube, now doing AI products might be harder for you than others, because you have to unlearn a lot of things. Learning is more important than experience. Maybe in 5 more years, the AI industry will have many job titles in a state of maturity. Today, I think that dividing functions doesn't really make sense; everyone needs to be multifaceted.
Overseas Unicorn: What kind of researcher possesses tech vision?
Yang Zhilin: The core lies in two points: one is to focus on the big picture and let go of the small details, and the other is to have an end-game mindset. I’ve collaborated with many researchers, and a common issue is excessive "detailed woodcarving." They tend to see many things that can be optimized in specific areas. For example, we found that the transformer solved the context length problem of LSTM, but if you go one level deeper, you'll realize that essentially every generation of technology is just about increasing context length over the previous generation.
Overseas Unicorn: How many more people like that does Moonshot AI need?
Yang Zhilin: Objectively speaking, our limitations are definitely on the supply side. The scarcity of AGI talent right now is due to experience, but there are still many talented individuals with learning abilities.
However, from a demand perspective, the entire organization cannot be too large—if we turn ourselves into a big company, we will lose many organizational advantages. So, we will definitely maintain a lean and efficient organization. I believe a key judgment is that AGI doesn't require that many people. And in the long run, after truly "removing the data", models at the GPT-6 level and beyond can completely self-improve. Only then can we break through the bounds of capacity that humans currently possess.
Overseas Unicorn: What do you think about the difficulty and time it takes to catch up with GPT-4?
Yang Zhilin: It's very easy to "Goodhart that benchmark ranking" to reach GPT-4 levels, but achieving its real performance is definitely difficult, and it doesn’t just rely on resources. Google has already proven this. Actually, the training cost of GPT-4 isn't that high. 10s of millions of dollars isn't a scary number, which is good for us, and we've already made good progress.
The most important thing is having the underlying tech vision to predict what GPT-5 and GPT-6 should look like and then proactively execute and accumulate. Otherwise, you will never surpass OpenAI. Much of OpenAI's advantage comes from predicting ahead of time. They believed around 2018 that they were exploring the right direction and spent a long time accumulating.
Overseas Unicorn: If you were to create an image generation product, how would you do it? How would you balance language understanding and image quality?
Yang Zhilin: MidJourney has already done a great job in the single task of image generation. If I were to do it, I would want it to do many tasks while also doing some of those tasks very well. This is also OpenAI's thinking, but they haven't actually succeeded.
An AGI company should become the ingest point, that users use your product by default. In addition, certain groups will have specific needs and pursue the absolute best performance. So, there are still opportunities for companies like MidJourney in the market. However, when the generalist model of AGI is powerful enough, many users will also switch—if today I repackage the entire Photoshop software into a prompt, making it an all-around designer that everyone can outsource to, then fewer people will use Midjourney.
Midjourney's current position is due to its first-mover advantage, which got the "data flywheel" running. The tricky thing is whether there will be this window of opportunity in the future. If there isn't, then it's likely to be directly crushed by the generalist model.
Overseas Unicorn: Following the logic of ingest point, how many points of ingest do you think there will be in the future?
Yang Zhilin: There are at least two: one that is useful and one that is fun.
The "information ingest" may no longer exist because when we search for information, we are essentially hoping to complete a task end-to-end. The intelligent entry point will likely cover information ingest like search engines in the future. People obtaining information is not an ultimate need; it has merely been artificially defined as a "need". Sometimes we want to complete a task, and sometimes we want to learn something. The AGI entry point should just help do the tasks the users really want to do, rather than merely help them "obtain information".
Overseas Unicorn: How much money is still needed to achieve your ideal AGI from today?
Yang Zhilin: True AGI still needs hundreds of billions of dollars. But it is not a one-step process. You need to start a cycle where the business can generate corresponding resources itself. This hundred-billion-dollar estimate is due to the fact that the scale up needs at least 2-3 orders of magnitude more. Of course, cost optimization will accompany the process.
Overseas Unicorn: What should the business model of an AGI company be? Will it still be seat-based or usage-based?
Yang Zhilin: The value of each task that AGI completes for you is different. It may be similar to an outsourcing service, priced per task. In addition, advertising will definitely play an important role in the task-solving process. With personalized interactions and dialogue, advertising may become much more efficient at monetization than it is now.
Overseas Unicorn: If the training cost of GPT-4.5, Claude-3, and Gemini-2.0 is around $300 million, and the training cost for the next generation of models in 2025 might rise to several billion dollars, then exploring AGI will be a multi-hundred-billion-dollar gamble. Have you considered its ultimate impact on human society?
Yang Zhilin: One relatively certain thing is a tangible improvement in productivity. Now when you use a piece of software, it corresponds to the intelligence of 1,000 programmers, and it is fixed. In the future, the applications we use may correspond to the intelligence of 1 million people, and it is iterating every day.
Looking at the possibilities, everything will change. Training so many languages together will impact culture and values. People's time allocation may also change a lot. Fewer people may work for money, and more time may be spent in the spiritual world. Finally, there may be a huge virtual spiritual space. To achieve the Metaverse, we may actually need to first achieve AI.
Also, I believe that AGI will ultimately be globalized.
Overseas Unicorn: But now, we judge that the leading models are both strong and cheap, which will have a strong Matthew effect, and the final pattern will still be very convergent.
Yang Zhilin: Within a 5-year window, there'll still be an overt "Superstar premium". But after 50 years, I believe that AGI will definitely be made fungible, like electricity today.
[Superstar premium: The No. 1, even if only slightly better than No. 2, is way more popular than No. 2.]
----
<https://mp.weixin.qq.com/s?__biz=Mjc1NjM3MjY2MA==&mid=2691539716&idx=1&sn=d0630dc55f1569f866b9cf485bd283e3>
# Moonshot's Yang Zhilin looks back to a year of LLM startup: March towards the endless uncharted snow-capped mountains
Author: Zhang Xiaojun
Published at Tencent News "Periscope"
2024-02-29 16:07
> If everyone thinks you are normal, and your dream is imaginable by everyone, then you have failed to expand the dreamscape of humanity.
Just a year ago, AI scientist Yang Zhilin made a precise calculation in Silicon Valley. He realized that if he decided to launch a large model startup targeting AGI, he would need to raise over $100 million in capital within the next few months.
However, this was just an entry ticket. A year later, that number has increased 13-fold.
The competition among large model companies is less a scientific competition and more a brutal battle of capital. With investors tightening their belts, you need to outpace your rivals in finding more money, buying more GPUs, and grabbing more talent.
“It requires talent aggregation and capital aggregation,” said Yang Zhilin, founder and CEO of Moonshot AI, a large model company established on 2023-03-01.
Over the past year, domestic large model companies seem to be on a precarious edge of survival. On the surface, they each hold large sums of money. But on one hand, they have to immediately invest the newly raised funds into extremely expensive research to catch up with OpenAI—first catching up to GPT-3.5, and before catching up to GPT-4, Sora arrived; on the other hand, they have to relentlessly search for potential application scenarios to validate that they are a company, and not just a research institute that devours capital. And that’s not enough, the way out for each project, whether it's an IPO or a merger, is even more unclear.
Among Chinese large model founders, Yang Zhilin is the youngest, born in 1992. Industry insiders describe him as a steadfast AGI believer and a founder with technological appeal. His academic and professional experience is largely related to general AI, with over 22,000 paper citations.
Regarding large models, the Chinese tech community abruptly shifted from fervor to coolness in mid-2023, entering a practical utilitarianism mainstream of accelerating real-world applications. This inevitably puts large model CEOs in a fierce tug-of-war between ideals and reality. In the Chinese AI ecosystem where everyone is shouting for PMF (Product/Market Fit) and commercialization, this founder with an AI researcher background is not so anxious.
Moonshot AI is the smallest of the leading domestic large model companies, with 80 employees. Unlike its competitors, it did not pursue safe to-business products, or look for applications in niche scenarios such as healthcare or gaming. Instead, it has done one thing and only one thing: a to-consumer product – the intelligent assistant Kimi, which supports a context length of 200,000 Chinese characters. Kimi is also Yang Zhilin’s English name.
Yang Zhilin tends to see his company as building a system that combines science, engineering, and business. You can imagine it as him building an AI experimental platform over the sky above the human world. On one hand, he conducts experiments, and on the other, he drops cutting-edge technology into the real world. Through interaction with humans, he finds application opportunities and delivers them to consumers. Ideally, the former burns billions or tens of billions of capital, while the latter earns back these funds hundreds or thousands of times over—which sounds like a dangerous "tightrope walk."
"AI is not about finding some PMF in the next year or two, but about how it will change the world in the next ten to twenty years," he said.
Such abstract and idealistic thinking inevitably makes people worry for him: Can a young AI scientist find room to survive in the pragmatic China?
In February 2024, Moonshot AI bucked the trend and completed a large round of financing, a B round of over $1 billion at a pre-investment valuation of $1.5 billion, with Alibaba leading the investment and Lishi Capital and Xiaohongshu co-investing. After this transaction, Moonshot AI’s post-investment valuation is approximately $2.5 billion—making it the highest-valued unicorn in the Chinese large model arena at this stage. (They declined to respond or comment on this matter.)
During the process of this third round of financing, we talked with Yang Zhilin about his startup journey over the past year. It’s also a microcosm of the one-year sprint in the domestic large model race.
His company is not located in the Beijing Sohu Network Building, a gathering place for large model companies. For a company with a total financing amount of about 9 billion RMB, this office in the Quantum Chip Building [Note: 量子芯座大厦, just a building with a cool name. It's not actually doing quantum computing.] seems simple and old. There isn’t even a company logo at the door; just a white piano standing guard at the entrance.
The meeting room is in a corner, dark due to the small windows, with the hum of the air conditioner sending warm air during winter. In the dim light, Yang Zhilin described his perception over the past year: "It's a bit like driving on the road, with continuous snow-capped mountains in front of you, but you don't know what's inside. You're just moving forward step by step."
The following is the full transcript of the interview with Yang Zhilin. (For ease of reading, the author made some textual optimizations.)
## Stand at the start
> Have to ride the wave.
Tencent News "Periscope": How have you been lately?
Yang Zhilin: Busy, lots of things going on. But still very excited. Standing at the beginning of an industry, there's a huge space for imagination.
Tencent News "Periscope": I just saw a pure white piano at the entrance of your company.
Yang Zhilin: There's also a Pink Floyd album on it. I don't even know who put it there. I suddenly saw it a couple of days ago and haven't had time to ask. (Pink Floyd is the British rock band that released the album Dark Side of the Moon.)
[Note: The Chinese name of Moonshot AI is 月之暗面, which literally means "Dark Side of the Moon"]
Tencent News "Periscope": What were you doing on the day ChatGPT was released in November 2022?
Yang Zhilin: I was preparing for this, looking for people to form a team, and brainstorming some new ideas. I was very excited when I saw ChatGPT. Three to five years ago, even in 2021, it would have been unbelievable. This kind of high-level reasoning ability was very difficult to achieve in the past.
I predicted that many variables would occur in the market: on one hand, capital, and on the other hand, talent. These are the core factors of production for doing AI. If these variables hold true, then we might have a real chance of starting a company to do this -- the possibility of an organization built for AGI, from 0 to 1 -- a great satori. An independent company just makes more sense, but it's not something you can do immediately. ChatGPT stimulated these variables, completing all the factors of production. We still have to ride the wave.
Tencent News "Periscope": After deciding to establish an AGI company, what preparations did you make? How did you gather the two production factors of capital and talent?
Yang Zhilin: It was a winding process. ChatGPT's spread took time. Some people knew early, some knew late, some initially doubted, then became shocked, and finally believed. Finding people and raising money was closely tied to timing.
We started to focus on the first round of financing in February 2023. If we had delayed until 2023-04, we basically would have had no chance. But if we had done it in 2022-12 or 2023-01, we also would have had no chance. There was still the pandemic, and people hadn't reacted - so the real window was only one month.
At that time, one night in the US, I made precise calculations. After calculating, I felt that we needed to raise at least $100 million within a few months. Many people in the market had not started fundraising, and many thought you couldn't raise that much money. But it later proved that it was possible, even more than that.
The talent market began to flow with liquidity. Inspired by ChatGPT, many people in 2023-03 or 2023-04 reached the same realization: this is the only thing worth doing for the next ten years. We had to actively reach the right people at the right time. If it had been the previous two years, the concentration of talent would not have been so high. At that time, more people were doing traditional AI or AI-related businesses, not generalist AI.
Tencent News "Periscope": To summarize, February was the window for financing, and March and April were the window for hiring?
Yang Zhilin: Pretty much.
Tencent News "Periscope": Where in the US did you calculate those numbers that night? How exactly did you calculate it?
Yang Zhilin: I stayed in the US for a month or two from the end of 2022 to the beginning of 2023, looking for people to talk to.
At my place of stay, I calculated how many FLOPs you correspond to, the Training Cost, Inference, and the user base.
Tencent News "Periscope": What kind of mood was Silicon Valley immersed in at that time?
Yang Zhilin: This product began to have many early adopters, concentrated in the tech circle. We ourselves were in this circle, so we felt it more deeply. Big tech companies in Silicon Valley have performance reviews every six months, and many people started using ChatGPT to write them. Some people's usual writing wasn't very professional, but with ChatGPT, everyone's writing got super-duper professional.
Dark undercurrents swelled. Many people were considering where to go for their next job or to start a business. Many friends who talked to us later started their own companies. Also, there was a strong FOMO (Fear of Missing Out) feeling. Everyone was losing sleep every day. Whether it was 12, 1, or 2 am, if you looked, everyone was always up. A little anxious, a little FOMO, but also very excited.
Tencent News "Periscope": How late did you stay up, on that night when you figured out you needed to raise $100 million?
Yang Zhilin: It wasn't too late. The calculation process didn't take too long.
But after calculating, I couldn't tell too many people. If I did, no one would have thought this was possible.
## Learned the techniques from the masters
> Liberating myself from the endless "detailed woodcarving".
Tencent News "Periscope": In the VC industry, people say this about you, "The founder is very smart, has technical appeal, and the team also has many technical stars." So, before discussing large model startups, I'd like to first talk about your academic background.
You did your undergraduate studies in the Computer Science Department at Tsinghua University, and your PhD at Carnegie Mellon's School of Computer Science. Was your focus always on AI?
Yang Zhilin: I was born in 1992, entered Tsinghua in 2011, and have been in this field for over a decade, since my sophomore year. At the beginning, it was more of a broad exploration, looking around, and I did some work with graph learning and multimodal learning. In 2017, I converged on language models – at the time, I felt that language models were a very important problem, and later I felt that it was the only important problem.
Tencent News "Periscope": In 2017, what was the general understanding of language models in the AI industry, and how did it evolve later?
Yang Zhilin: At the time it was a model used for ranking in speech recognition. (Laughs) After you've recognized a segment of speech, there are many results. You use a language model to see which one has the higher probability and output the most likely result. Its application was very limited.
But you realize that it is a fundamental problem because you are modeling the probabilities of this world. Although language is limited, it is a projection of the world. But theoretically, if you make the token space (the space composed of all possible tokens) larger, you can build a general world model. How everything in the world comes into being and develops can be assigned a probability. All problems can be reduced to how to estimate probabilities.
Tencent News "Periscope": Your academic mentors are very famous. Your PhD advisors were Ruslan Salakhutdinov, the head of AI at Apple, and William W. Cohen, the chief scientist of Google AI. They work both in industry and academia.
Yang Zhilin: The combination of industry and academia has been increasing in the past few years, and now the trend is changing: more valuable breakthroughs will occur in industry. This is an inevitable law of development. It starts with exploratory research and gradually shifts to a more mature industrialization process, but that doesn't mean that research is not needed during the industrialization process, it's just that pure research will have a hard time making valuable breakthroughs.
Tencent News "Periscope": What did you learn from these prestigious mentors?
Yang Zhilin: I learned the most at Google, where I interned for a long time. I started working on Transformer-based language models at the end of 2018. The biggest learning was liberating myself from the endless "detailed woodcarving" -- that's crucial.
You should look at what the big directions and the big gradients are. When you have ten paths in front of you, most people think about how to brake if there's a pedestrian in front of them on this path. These are short-term details, but the most important thing is which of these ten paths to choose.
This field had this problem before. For example, on a dataset with only one or two million tokens, you'd see how to reduce the perplexity (a measure of the model’s uncertainty), how to lower the loss (the model’s error during training), and how to improve accuracy. You get caught up in endless "detailed woodcarving." Some people have devised all kinds of weird and arcane architectures – that's "detailed woodcarving". After "detailed woodcarving" it might improve on this dataset, but it misses the essence of the problem.
The essence lies in analyzing what this field lacks. What are the First Principles?
Why can scaling law become a First Principle? You just need to find a framework that satisfies two conditions: first, it's general enough; and second, it is scalable. Generality means you cast all problems into this framework, and scalability means that as long as you throw in enough compute, it can improve.
This is the thinking I learned at Google: If it can be explained by something at the deep level, you shouldn't carve wood in the superficial layers. There's an important saying I agree with: if you can solve a problem by scaling, don't solve it with a new algorithm. The biggest value of a new algorithm is how it makes scaling better. When you have liberated yourself from the "detailed woodcarving", you can just see more.
Tencent News "Periscope": Was Google also a follower of the scaling law at that time? How did it implement First Principles?
Yang Zhilin: There were already many such ideas, but Google didn't implement it very well. It had this way of thinking, but it couldn't organize it into a real moonshot effort. More often, you'd have 5 people pursuing their version of the First Principles over here, and 5 other people pursuing theirs over there. There was nothing top-down.
Tencent News "Periscope": During your PhD studies, you co-authored papers with Turing Award winners Yann LeCun and Yoshua Bengio, and you were the first author on those papers. How did these academic collaborations come about? - I mean, they are Turing Award winners, not your advisors, so how did you attract them?
Yang Zhilin: The academic world is very open. As long as you have good ideas and meaningful problems, it's all good. What two or n brains can produce is more than what one brain can. This can also be used when developing AGI. An important AI strategy is called "ensemble" (combining the predictions or results of multiple different models or methods to achieve better performance). It's essentially doing the same thing. When you have diverse viewpoints, you can generate many new things. Collaboration is very beneficial.
Tencent News "Periscope": Was it like, you first had an idea, and then you asked them if they were interested?
Yang Zhilin: Pretty much like that.
Tencent News "Periscope": Which is more difficult, winning over academic leaders or winning over venture capital leaders in fundraising? What are the similarities?
Yang Zhilin: "Win over" is not a good word. The essence behind it is collaboration. Collaboration means win-win, because a win-win is the premise of collaboration. So there's not much difference. You need to provide a unique value to others.
Tencent News "Periscope": How do you make them trust you? What do you think your talent is?
Yang Zhilin: Not really talent, just working hard.
## The old system no longer applies
> AGI needs a new form of organization.
Tencent News "Periscope": You just said, "More valuable breakthroughs will happen in the industry," including startups and the AI labs of the incumbents?
Yang Zhilin: Labs are history. Google Brain used to be the biggest AI lab in the industry, but it was a research organization placed within a big company. This kind of organization can explore new ideas, but it's hard for them to produce great systems – it can produce Transformer, but it can't produce ChatGPT.
The current development approach is evolving into a process where you need to build a huge system, requiring new algorithms, solid engineering, and even a lot of product and commercialization work. It's like in the early 21st century, you couldn't research information retrieval in a lab. You needed to put it into the real world, with a huge system and a product with users, like Google. So, the role of research or education systems will shift, focusing on cultivating talent.
Tencent News "Periscope": How would you describe this new system form? Is OpenAI its prototype?
Yang Zhilin: OpenAI is the most mature organization right now, and it’s still gradually evolving.
Tencent News "Periscope": It can be understood that this is an organization established for humanity's grand scientific goals?
Yang Zhilin: I want to emphasize that it’s not purely science; it’s a combination of science, engineering, and business. It has to be a commercialized organization, a company, not a research institute. But this company is built from the ground up because AGI requires a new organizational structure – first, its production methods are different from those of the internet; second, it will transform from pure research to a combination of research, engineering, product, and business.
The core is that it should be a moonshot AI project, with a lot of top-down planning, but within the planning, there is room for innovation. Not all technologies are set in stone. There are bottom-up elements within a top-down framework. Such an organization didn't exist before, but the organization needs to adapt itself around the technology, because the technology determines the mode of production. If organization doesn't match technology, effective output is not possible. We believe there’s a high probability that we need to redesign the organizational form.
Tencent News "Periscope": Last year, during the OpenAI coup, Sam Altman had an option to join Microsoft and lead a new Microsoft AI team. What is the fundamental difference between that and being CEO of OpenAI?
Yang Zhilin: It’s very difficult to create a new organization within an old culture.
Tencent News "Periscope": You want to create the “Chinese OpenAI,” is that right to say?
Yang Zhilin: Not very accurate. We don’t want to do “China’s” anything, and we don't necessarily want to do OpenAI.
First, true AGI will definitely be global. There won’t be an AGI company that can only operate in a regional market because of market protection mechanisms. It won't exist in the long term -- globalization, AGI, and a product with a large user base are all necessary conditions ultimately.
Second, regarding whether it's OpenAI – if you look at 2017--2018, OpenAI had a very bad reputation. People in our circle generally considered working at places like Google. After talking with Ilya Sutskever (OpenAI's Chief Scientist), many people thought he was crazy, too arrogant and sure of himself -- people viewed OpenAI as either crazy or a scam. But they invested early, found a non-consensus point, and discovered the only First Principle that works for AI right now: scaling through next token prediction.
I believe that a company even greater than OpenAI will exist. A truly great company can combine techno-idealism [with commercial success] and co-create it with users through a great product. AGI will ultimately be something that is produced in co-work with all users. So, it’s not just about technology; it also requires utilitarianism and pragmatic pursuits. It's about finding the perfect combination between the two.
However, we should learn from OpenAI’s techno-idealism. If everyone thinks you are normal, and your dream is imaginable by everyone, then you have failed to expand the dreamscape of humanity.
## First step to the moon: long context. What's the second?
> Next up are two milestones.
Tencent News "Periscope": Going back to the moment you decided to start your own business, did you immediately launch your first round of financing after returning to China?
Yang Zhilin: It started in 2023-02 in the US, with some remote activity as well. Ultimately, it was primarily domestic investors.
Tencent News "Periscope": The first round raised $100 million?
Yang Zhilin: The first round was less than that, but it later exceeded that number. We completed two rounds in 2023, totaling nearly 2 billion RMB.
Now we are in the third round. We haven't officially announced the financing, so I can't comment on it at this time.
Tencent News "Periscope": Some people say that starting in the second half of 2023, no one is willing to invest in foundational large model companies anymore. Are they wrong?
Yang Zhilin: There are still investors. You can definitely see a shift in sentiment, but it's not like no one is investing. At least, there's still a lot of investment interest in the market right now.
Tencent News "Periscope": Besides capital and people, what key decisions did you make in 2023?
Yang Zhilin: On what to do. This is the advantage of companies like ours – the highest-level decisions are guided by a tech vision.
We're focusing on long context, which requires a judgment about the future. You need to know what's fundamental and what's the next direction. It's still about First Principles, “the process of not doing detailed woodcarving”. If you focus on “detailed woodcarving”, you can only look at what OpenAI has already done, and see how you can replicate it.
You'll find that in Kimi AI assistant, we're doing lossless compression of long text, which gives a unique product experience. When reading English literature, it can help you understand it very well. If you use Claude or GPT-4 today, it might not do as well. This requires early planning. We’ve been working on it for more than half a year. This is very different from like, seeing a long context trend today, scrambling to gather a couple of teams, and hack together a long-context model as fast as you can.
Of course, this is just the beginning of the marathon. There will be more differentiation going forward, and this requires you to anticipate what is a "valid non-consensus."
Tencent News "Periscope": When was the decision to do this made?
Yang Zhilin: 2023-02 or 2023-03. It was decided as soon as the company was established.
Tencent News "Periscope": Why is long text the first step of the moonshot?
Yang Zhilin: It's very fundamental. It's a new kind of computer memory.
Old computer memory has increased by several orders of magnitude over the past few decades, and the same thing will happen with new computers. It can solve many current problems. For example, current multimodal architectures still need tokenizers, but when you have lossless compression for long context, you don’t need them. You can just put the raw data in. Furthermore, it transforms the new compute paradigm into a more general foundation.
Old computers can represent everything with 0s and 1s, and everything can be digitized. But today’s new computers can't do that yet. The context isn't large enough; it’s not as general. To become a general world model, you need long context.
Secondly, it enables personalization. The core value of AI is personalized interaction, and its value lies in personalization. AGI will be even more personalized than the previous generation of recommendation engines.
But personalization isn't achieved through finetuning; it's about supporting very long context. All your history with the machine is context, and this context defines the personalization process, and it cannot be replicated. It will be a more direct dialogue, and dialogue creates information.
Tencent News "Periscope": How much room for scalability is there?
Yang Zhilin: It's very large. On one hand, the window itself can be improved, and there's a long way to go; there will be several orders of magnitude.
On the other hand, you can’t just extend the window. You can’t just look at the numbers. Whether it's millions or billions of tokens today is not meaningful. You have to look at the inference capability it can achieve within that window, the faithfulness to the original information, and the instruction following ability. You shouldn’t just pursue single metrics, but rather combine metrics with capabilities.
If both of these dimensions continue to improve, you can do a lot of things. It might be possible to follow an instruction that is tens of thousands of words long. The instruction itself would define many agents, which would be highly personalized.
Tencent News "Periscope": Is the work of developing long text and catching up to GPT-4 technology reusable? Are they the same thing?
Yang Zhilin: I don't think so. It's more of an elevation to a new dimension that GPT-4 doesn't have.
Tencent News "Periscope": Many people say that the work of the domestic large model companies is similar – catching up with GPT-3.5 in 2023 and GPT-4 in 2024. Do you agree with this assessment?
Yang Zhilin: There are definitely key targets for comprehensive capability improvement. This statement is correct to some extent – if you're a latecomer, there's definitely a catching-up process. But it's also a one-sided view. In addition to comprehensive capabilities, there are many areas where you can create unique abilities and achieve state of the art results. Long context is one. The image generation effect of DALL-E3 is completely defeated by Midjourney V6. So you have to do both.
Tencent News "Periscope": What proportion of time and resources are spent on comprehensive capabilities and new dimensions respectively?
Yang Zhilin: It needs to be combined. New dimensions can’t exist independently of comprehensive capabilities, so it’s hard to give a direct proportion. However, you need to make a substantial investment to do well in new dimensions.
Tencent News "Periscope": Will these new dimensions be incorporated into Kimi for you?
Yang Zhilin: That’s definitely a very important product for us, but we will also have other experiments.
Tencent News "Periscope": What do you think about Li Guangmi’s [founder of Shixiang Technologies 拾象科技, a Chinese VC firm] statement that the technical recognizability of Chinese large model companies is not yet very high?
Yang Zhilin: I think it's okay. We've already created a lot of differentiation. It's because of time. You should be able to see more dimensions this year. Last year, everyone was focused on building the basic framework and getting things running.
Tencent News "Periscope": The first step of the moonshot is long textual length, what is the second step?
Yang Zhilin: There will be two major milestones next. One is a truly unified world model that unifies modalities. A truly scalable and general architecture.
The second is to enable AI to continuously improve without human data input.
Tencent News "Periscope": How long will it take to reach these two milestones?
Yang Zhilin: Two to three years, possibly faster.
Tencent News "Periscope": So, in three years we’ll see a world completely different from today.
Yang Zhilin: Given the current pace of development, yes. Technology is in its nascent, rapidly developing phase.
Tencent News "Periscope": Can you imagine what will happen in three years?
Yang Zhilin: There will be some degree of AGI. Many of the things we're doing today can also be done by AI, and it might even do them better. But the key is how we use it.
Tencent News "Periscope": For you, for Moonshot AI, what is the second step?
Yang Zhilin: We will pursue these two things. Many other questions are derived from these two factors. Today, when we talk about reasoning and agents, they are all products of solving these two problems. We need to do some more "detailed woodcarving," but there are no fundamental blockers.
Tencent News "Periscope": Will you all-in to catch up with GPT-4?
Yang Zhilin: GPT-4 is a necessary step on the path to AGI. The core thing is, you can't just be satisfied with achieving GPT-4's performance. First, you need to think about what the real non-consensus is right now. What is next after GPT-4? What should GPT-5 and GPT-6 look like? Second, you need to see what unique abilities you have within that. This is more important.
Tencent News "Periscope": Other large model companies release their model capabilities and rankings, you don't seem to be doing this?
Yang Zhilin: Goodharting on benchmarks is meaningless. The best benchmark is the user. You should let users vote. Many benchmarks have problems.
Tencent News "Periscope": Is reaching GPT-4 the fastest in the competition among Chinese large model companies your goal? Is there a difference between speed and slowness?
Yang Zhilin: Yes, there definitely is. If you put it on a long enough timeframe, eventually everyone can reach it. But it depends on how long your timeline is. A period of six months or longer is meaningful, and it also depends on what you can do with that period.
Tencent News "Periscope": When do you expect to reach GPT-4?
Yang Zhilin: It should be soon, but we can’t disclose the specific timing.
Tencent News "Periscope": Will you be the fastest?
Yang Zhilin: Our chances change over time, but we do have a chance.
Tencent News "Periscope": After launching Kimi, what is your guiding north star?
Yang Zhilin: Today, it's about making the product better, with more new dimensions. For example, we shouldn’t just involute in the search user-case. Search is just a small part of the value of this product. The product should have more room for growing the pie [rather than involute and compete around the same pie]. Being 10% or 20% better than traditional search engines is not valuable. Only something disruptive is worthy of the title of AGI.
The intelligence with room for growth -- that's a unique value. You have to seize this focal point. Intelligence is always the core incremental value. If the core value of your product only comes 10--20% from AI, then it’s invalid.
## I'm not at all anxious about making a safe landing
> User scaling and model scaling must be done concurrently.
Tencent News "Periscope": 2023 mid-year was a huge watershed moment, with the market rapidly turning from mad-bull to ice-cold. What was your perception?
Yang Zhilin: I don’t completely agree with that assessment. We did complete a round of financing in the second half of the year. Moreover, new things continue to emerge. Today’s model capabilities were unimaginable at the end of last year. The user base and revenue of an increasing number of AI companies continue to rise. This constantly proves the value.
Tencent News "Periscope": What were the different feelings for you between the first half and the second half of the year?
Yang Zhilin: There wasn't a huge change. Variables certainly exist, but again, the First Principles -- how to provide users with a good product. Ultimately, we must satisfy user needs, not to win a competition. We are not a company built for the sake of competing.
Tencent News "Periscope": The industry believes that a significant difference between the first and second half of 2023 is the shift in focus. The first half focused more on AGI, while the second half started talking about how to implement it and how to commercialize it. Did you do this?
Yang Zhilin: Of course, I want to do AGI. This is the only meaningful thing to do in the next 10 years. But it's not like we aren't doing applications. Or rather, we shouldn't define it as an "application".
"Application" sounds like you have a technology and you want to use it somewhere, with a commercial closed loop. But "application" is inaccurate. It's complementary to AGI. It's a means to achieve AGI and also the purpose of achieving AGI. "Application" sounds more like a goal: I want to make it useful. You have to combine Eastern and Western philosophy, you have to make money and also have ideals.
Today, users have helped us discover many use cases that we had never imagined. They use it to filter resumes, which we didn't think of when designing the product, but it naturally works. User input, in turn, makes the model better. Why is Midjourney so effective? It did scaling on the user end -- user scaling and model scaling need to happen simultaneously. Conversely, if you only focus on applications and not on model capability iteration or AGI, your contribution will be limited.
Tencent News "Periscope": Zhu Xiaohu (Managing Partner of GSR Ventures) only invests in applications of large models. He has a point of view: the most difficult thing is AI-Generated Content's PMF – if ten people can't find a PMF, investing in a hundred people won't find it either. It has nothing to do with the number of people or cost, don't throw money at it. He says, "Training LLaMA for two or three months can at least achieve the top-30 level of humans, and immediately replace humans." What do you think of his view?
Yang Zhilin: AI for me is not about finding some PMF in the next year or two, but how to change the world in the next ten to twenty years – these are two different ways of thinking.
We are firm long-termists. When you achieve AGI or stronger intelligence, everything today will be rewritten. PMF is certainly important, but if you rush to find PMF, you are very likely to be subjected to a "dimensionality collapse attack". Dimensionality collapse attacks have happened too many times. Many people used to do customer service, dialogue systems, and slot filling [an NLP task that extracts different parameters of the user's query from a running dialog] [and then GPT just solved it by pure autoregression, making all their work useless.] Some of these companies were of a decent scale. But, they were all hit by dimensionality reduction attacks and it was very tough.
[Note: 降维打击 is a term in the Three Body Problem: Death's End. It creates an expanding space of 2D in a space of 3D, and destroyed the solar system. All humanity's defense system was simply bypassed completely. It is used metaphorically in general Chinese, to mean a solution that solved the problem much more effectively without following the preconceived rules.]
It's not that it's invalid. Suppose you find a scenario today, using current technical capabilities, and the incremental value from 0 to 1 is huge, but the space from 1 to n is not so large. This kind of scenario is OK. Midjourney is like that, and so is copywriting generation, relatively simple tasks where the effect from 0 to 1 is very obvious. These are opportunities that only requires focus on applications. However, the biggest opportunities are not here. If your goal is commercialization, you can't think separately from AGI. If you're like "I am only doig applications for now", then, okay, you might be crushed in a year.
Tencent News "Periscope": You could secretly upgrade the underlying base model though.
Yang Zhilin: But you can't make it bigger than them. Technology is the only new variable in this era; other variables haven't changed. Returning to First Principles, AGI is the core of everything. Based on this, we deduce that super-apps must have the strongest technical capacity.
Tencent News "Periscope": Can you use open-source models? (The latest news is that Google announced the open-sourcing of the Gemma model.)
Yang Zhilin: Open source is behind closed source, that's also a fact.
Tencent News "Periscope": Could it be just temporarily behind?
Yang Zhilin: It doesn't look like it at the moment.
Tencent News "Periscope": Why can't open source catch up to closed source?
Yang Zhilin: Because the way open source is developed is different from the past. Previously, everyone could contribute to open source. Now, open source itself is still centralized. Many contributions to open source may not have been validated with compute. Closed source will have talent and capital gathering, and in the end, closed source will definitely be better. It's market consolidation.
If I have a leading model today, open-sourcing it is most likely unreasonable. Instead, laggards might do this, or open-source small models to disrupt the market because it's already worthless even if they don't open source it.
Tencent News "Periscope": How do you deal with the anxiety in China? They say that if large model companies cannot quickly create a safe landing by making products that can realize investors' expectations, it will be difficult to raise the next round of funding.
Yang Zhilin: You need a balance between the long-term and the short-term. Having absolutely no users and no revenue is definitely not okay.
We can see that from GPT-3.5 to GPT-4, many applications have been unlocked; from GPT-4 to GPT-4.5 and then to GPT-5, it's highly likely that more, even exponential, applications will continue to be unlocked. The so-called "Moore's Law of use cases" is that the number of use cases will increase exponentially with time. We need to improve model capacity while finding more use cases. We need this balance.
It’s a spiral. It depends on how much you allocate to the short term and how much to the long term. You need to pursue the long term while being able to survive. But the long term must exist, otherwise, you will miss the entire historical stage. It's too early to draw conclusions today.
Tencent News "Periscope": Do you agree with Wang Huiwen (co-founder of Meituan and founder of Guangnian Zhiwai) who proposed "dual-wheel drive" corporate strategy?
[Note: Wang Huiwen is the founder of Meituan, a food takeout delivery system. After ChatGPT, he returned from retirement to start an AI company "Light Years Beyond" (光年之外) with $50 million of starting capital, and then sold it back to Meituan at no cost. Most invested money was returned to the investors, Meituan took on much of the debt of Light Years Beyond, while Wang's $50 million was simply lost.]
Yang Zhilin: That's a good question. To some extent, it's true. But how you actually do it makes a big difference. Can you really do some "non-consensus but possibly true" things?
Tencent News "Periscope": I understand that they mean by "dual-wheel drive" that one needs to quickly find some new use cases, otherwise, it's hard to know how the technology can make a safe landing.
Yang Zhilin: It’s still the difference between model scaling and user scaling.
Tencent News "Periscope": In China, besides you, who else has a model scaling mindset?
Yang Zhilin: I'm not in a position to comment on that.
Tencent News "Periscope": Most people may have a user scaling mindset. Or, can we say that this is the difference between the academic camp and the commercial implementation camp?
Yang Zhilin: We are not academics, academics definitely won't work.
Tencent News "Periscope": Many large model companies will focus on to-business to make a safe landing (after all, to-business is more certain). Do you do that?
Yang Zhilin: We don't. We decided to do to-consumer from day one.
It depends on what you want. If you know something is not what you want, you won't FOMO. Because even if you get it, it's nothing.
Tencent News "Periscope": Have you been anxious in the past year?
Yang Zhilin: More excitement and thrill. Because I've been thinking about this for a very long time. We may be the earliest people to explore the dark side of the moon. Today you realize that you are actually building a rocket, and every day you are discussing what fuel to add to the rocket to make it run faster, and how to prevent it from exploding.
Tencent News "Periscope": Summarizing the "non-consensus decisions with a probability" you have made, besides to-consumer and long text length, are there any others?
Yang Zhilin: More are in the process, and we hope to ship those to everyone as soon as possible.
Tencent News "Periscope": The previous generation of Chinese entrepreneurs reaped benefits from applications and use cases, so they pay more attention to products, users, and the data flywheel. Can the new generation of AI entrepreneurs, represented by you, represent the new future?
Yang Zhilin: We also pay close attention to users, users are our ultimate goal, but it is also a co-creation process. The biggest difference is that this time it will be more technology-driven – it’s still the issue of the horse-drawn carriage vs the car – we are still in the middle of that leap from the horse-drawn carriage to the car, and we should try our best to think about how to provide users with a car.
Tencent News "Periscope": Do you feel lonely?
Yang Zhilin: Hahaha...that's an interesting question. I think it's okay, because we still have dozens, or 100 people fighting together.
## We haven't even caught up with GPT-4 and now there's Sora
> It's like GPT-3.5 for video generation. The capacity gapped up.
Tencent News "Periscope": The sudden appearance of Sora this year, how much was within your expectations, and how much was beyond your expectations?
Yang Zhilin: For genAI to achieve this, that was expected, but the timing was unexpected—it was earlier than previously estimated. This also reflects how fast AI is developing now, and much of the scaling overhang have not yet been fully digested.
Tencent News "Periscope": Last year, the industry judged that in 2024, large models would definitely involute in multimodal narratives, and the video generation effect would improve rapidly, just like text-to-image in 2023. Is Sora's technical capability exceeding, meeting, or falling short of your expectations?
Yang Zhilin: It solved many previously difficult problems. For example, it can maintain the consistency of generation within a relatively long time window, which is a key point and a huge gapping up.
Tencent News "Periscope": What is its significance for the global industrial landscape? What new narratives will emerge for large models in 2024?
Yang Zhilin: First, the short-term application value; it can further improve efficiency in production links. Of course, we also look forward to more extensions based on current capabilities. Second, its combination with other modalities. It is essentially modeling the world, and with this knowledge, it is a very good supplement to existing text. On this basis, there are many spaces and opportunities, whether in agents or in the connection with the physical world.
Tencent News "Periscope": What is your overall judgment of Sora?
Yang Zhilin: We were also planning a similar direction and have been working on it for some time. In terms of direction, there was no big surprise, but more in the technical details.
Tencent News "Periscope": What are the technical details that should be learned?
Yang Zhilin: Many things OpenAI has not fully explained. It has given a general idea, but some key details are missing that needs to be judged from its effect or existing information, as well as our previous experiments. At least for us, we will add more data points and have more data input during development.
Tencent News "Periscope": Compared with text generation, what were the main bottlenecks for video generation before? What solutions can we see that OpenAI has found this time?
Yang Zhilin: The main bottleneck is still the data; how do you scale up to fit this data? It hasn't been verified before. Especially when your actions are relatively complex, and the generation effect is photorealistic. This time it solved the problem of scaling up under such conditions.
Some remain to be solved. For example, it needs a unified architecture. The DiT [Diffusion Transformer] architecture is still not very general. In simply modeling the marginal probability of visual signals, it can do very well, but how do you generalize it into a general-purpose new computer? It still needs a more unified architecture, and there is still room for that.
Tencent News "Periscope": Have you read the Sora report published by OpenAI - "Video generation models as world simulators"? What key points are worth highlighting?
Yang Zhilin: I have read it. Given the current state of competition, they definitely didn't write the most important points. But it is still worth studying since it contains some information dearly paid for. You may have to spend money to do many experiments to understand, but now there is some knowledge. You do not need to spend money to do experiments and still have a general idea.
Tencent News "Periscope": What key signals did you extract from it?
Yang Zhilin: That this thing is somewhat scalable. In addition, it also gives relatively specific details on how to implement the architecture. But it's also possible that different architectures do not necessarily have such a fundamental difference in this matter.
Tencent News "Periscope": Do you agree with their statement that "scaling video generation models is a promising path towards building general purpose simulators of the physical world"?
Yang Zhilin: I very much agree. These two things are optimizing the same objective function. There is no big doubt.
Tencent News "Periscope": What do you think about Yann LeCun jumping up again to go against GenAI? His quote: "Modeling the world for action by generating pixel is as wasteful and doomed to failure as the largely-abandoned idea of "analysis by synthesis". Generation happens to work for text because text is discrete with a finite number of symbols. Dealing with uncertainty in the prediction is easy in such settings. Dealing with prediction uncertainty in high-dimension continuous sensory inputs is simply intractable." [Source: [His Twitter on 2024-02-18](https://x.com/ylecun/status/1759486703696318935)]
Yang Zhilin: I now think that by modeling the marginal probability of videos, you are essentially doing lossless compression, which is not fundamentally different from the next token predictions of language models. As long as you compress well enough, you can explain everything in the world that can be explained.
But at the same time, there are important things that have not yet been done: How does it combine with existing, already compressed capabilities?
You can think of it as two different kinds of compression. One is compressing the original world, which video models are doing. The other is compressing the behavior generated by humans, because the behavior generated by humans has passed through the human brain, which is the only thing in the world that can produce intelligence. You can think of video models as doing the first, and text models as doing the second. Of course, video models also contain the second to some extent; some videos created by people contain the creator's intelligence.
Ultimately, it may be a mix, needing to learn from different perspectives through these two ways, but in the end, it will be helpful to the growth of intelligence.
Therefore, generation may not be the purpose; it is just compressing this function. If you compress well enough, the generation effect will be very good. Conversely, if the model cannot generate, can it possibly compress well? Doubtful. It is possible that very good generation is a necessary condition for very good compression.
[Note: "marginal probability of video" refers to the idea of latent generative models of image. Consider for example video generation. You can do it by latent modelling in a latent space $y$, which then gets mapped to a conditional probability $Pr(x|y)$ where $x$ stands for the videos. The marginal probability then is what you get by marginalizing over $x$ by $Pr(x) = \int Pr(x|y) Pr(y) dy$. Yann LeCun's criticism is that this doesn't work. ]
Tencent News "Periscope": Compared to ChatGPT last year, Sora represents a different milestone. Which is more significant?
Yang Zhilin: Both are very important. It's now a bit like GPT-3.5 for video generation; it's a gapping up. Its model is also relatively small, and it is foreseeable that there will be larger models, which will certainly improve capacity.
Tencent News "Periscope": Some people also commented that for doing multimodal, Google Gemini's breakthrough is more important.
Yang Zhilin: Gemini follows the GPT-4V route, incorporating this understanding as well. They are all important, but ultimately, it is necessary to put these things in the same model, which has not yet been solved.
Tencent News "Periscope": Why is it so difficult to put them in the same model?
Yang Zhilin: Nobody knows how to do it yet. There is no validated architecture.
Tencent News "Periscope": What will be produced by Sora + GPT?
Yang Zhilin: Sora can be used in video production processes immediately, but if combined with a language model, it may be possible to break through the walls separating the digital and the physical world. In addition, you can complete tasks more end-to-end because your modeling of the world is better than before. It can even be used to improve your understanding of multimodal inputs, so you can do more mode-switching.
In summary, your understanding of the world is better; you can do more end-to-end tasks in the digital world, and even build a bridge to connect with the physical world, to complete some tasks in the physical world. This is the starting point. For example, autonomous driving or some household chores are theoretically all concepts of connecting to the physical world.
So, the breakthrough in the digital world is certain, but it also has the potential to lead to the physical world.
Tencent News "Periscope": What does Sora mean for domestic large model companies? What are the countermeasures?
Yang Zhilin: There is no difference. This is a determined direction.
Tencent News "Periscope": Domestic large models have not caught up with GPT-4 yet, and Sora has come out. How do you view this? The two worlds seem to be getting further apart. Do you feel anxious?
Yang Zhilin: This is just an objective fact. But the actual gap may still be shrinking, which is the law of technological development.
Tencent News "Periscope": What do you mean? That the technology curve is a sigmoid. Steep at first, then it winds down?
Yang Zhilin: Yes. I wasn't very surprised; OpenAI has always been working on the next-generation models. But objectively, the gap will continue to exist for some time. Even the gap between different companies within China will continue for some time. It is now still a period of technological explosion.
But in another two or three years, it is possible that the top companies in China can do more infrastructural work in this field, including the technical infrastructure, talent reserves, and the precipitation of organizational culture. After such refined groundworks, there is a greater possibility of leading in some aspects—but it requires some patience.
Tencent News "Periscope": Is it possible that the US and China will eventually form completely different AI technology ecosystems?
Yang Zhilin: The ecosystems may be different, if you look from a product and commercialization perspective. But from a technical perspective, the general-purpose capabilities will not be completely different technical routes. The basic general-purpose capabilities will definitely be similar. But because the AGI space is large, it is more likely that there will be differentiation on the basis of general-purpose capabilities.
Tencent News "Periscope": There has always been a debate in Silicon Valley: "one model rules all" versus "many specialized (smaller) models." What do you think?
Yang Zhilin: My view is the first one.
Yang Zhilin: On this point, will China and the United States be significantly different?
Yang Zhilin: I don't think so in the end.
## I accept the probability of failure.
> It has already changed my life.
Tencent News "Periscope": Large model startups are a rather peculiar phenomenon in the context of China. You've raised so much money, but it seems a large chunk of it is going towards scientific experiments. How do you convince investors to keep funding you under these circumstances?
Yang Zhilin: It's no different than in the US. The money we've received today isn't particularly large. Therefore, we need to learn even more from OpenAI.
Tencent News "Periscope": I'd like to know how much more money is needed to achieve GPT-4? How much to achieve Sora?
Yang Zhilin: Neither GPT-4 nor Sora requires that much. The money we're raising now is more for preparing ourselves for the next generation, or even the generation after that, of models, and for cutting-edge exploration.
Tencent News "Periscope": Although Chinese large model startups have received money from the incumbent megacorps, these megacorps are also training their own models. How do you view the relationship between large model startups and the incumbents?
Yang Zhilin: There's both competition and collaboration here. The primary goals of the incumbents and the startups are different. If you look at the primary goals of each large company today, they are different from the primary goals of AGI companies. The primary goal will influence actions and results, and ultimately lead to different relationships within the ecosystem.
Tencent News "Periscope": Why do megacorps invest a little money in multiple large model companies instead of placing a heavy bet on one company?
Yang Zhilin: That's a question of stage. There will be more consolidation of resources, and fewer companies, in the future stages.
Tencent News "Periscope": Some say the end game for large model companies is to be acquired by the megacorps. Do you agree?
Yang Zhilin: I don't think that's certain, but they are likely to have very deep collaborations.
Tencent News "Periscope": For example, how could they collaborate?
Yang Zhilin: OpenAI and Microsoft are a typical collaboration model. There's much to learn from that, and also things that can be optimized.
Tencent News "Periscope": Over the past year, where do you see the setbacks in your startup journey?
Yang Zhilin: There are many external variables - capital, talent, GPUs, product, R&D, technology. There have been highs and lows, and difficulties to overcome. For example, GPUs.
There was a lot of back and forth. There were periods where it was very tight, and periods where supply improved. The most extreme was when the price changed every day. One day a machine cost 260, then 340 the next day, and then it dropped back down. It was a dynamically changing process. You have to pay close attention to this. Since the price kept changing, our strategy also had to change -- what channels to use, whether to buy or rent. There were many different options.
Tencent News "Periscope": What influences this dynamic factor?
Yang Zhilin: There are geopolitical reasons, chip production itself comes in batches [of varying quality], and it's also affected by changes in market sentiment. We observed many companies starting to sell off GPUs because they realized they didn't necessarily need to train that model. Market sentiment and everyone's decisions change, and the supply-demand relationship changes accordingly. The good news is that the overall market supply has improved a lot recently. My personal judgment is that at least for the next one to two years, GPUs won't be a major bottleneck.
Tencent News "Periscope": You seem to be constantly thinking about organization. How are you approaching team building?
Yang Zhilin: Our approach to hiring has changed somewhat. There's a very limited amount of AGI talent in the world, and few people with experience. Our earliest talent profile focused on finding geniuses with the right skills. That proved very successful. Those with the ability to operate on models, and direct experience training large-scale models, could make progress quickly. Including the release of Kimi, both capital efficiency and organizational efficiency were actually very high.
Tencent News "Periscope": How much money was spent on that?
Yang Zhilin: A pretty small number. Compared to many other expenses, it was a case of doing a lot with a little. For a long time, we were at 30-40 people. Now we're at 80. We strive for talent density.
The talent profile changed later. In the earliest days, we hired geniuses because we thought their upper limit was high, and the company's upper limit was determined by the upper limit of the people. But later we filled out more dimensions -- people on the product operations side, leader-types, people who can take things to the extreme. Now we have a more well-rounded, resilient, and combat-ready team.
Tencent News "Periscope": In your year of large model startup experience in China, how do you evaluate the current stage achievements?
Yang Zhilin: We built a rocket prototype, and now we're doing test flights. We've accumulated a team, figured out some of the fuel recipe, and more or less can see the beginning of a PMF.
You could say that we took the first step towards a moon landing.
Tencent News "Periscope": How do you view Yann LeCun's statement that he's not optimistic about the current technical route, believing that self-supervised language models cannot learn true knowledge of the real world, and that as the model size increases, the probability of errors, or machine hallucination, will increase? He proposed the concept of a "world model".
Yang Zhilin: There's no fundamental bottleneck. When the token space is large enough, it becomes a new type of computer that can solve general problems. Then, it's a general world model.
Why did he say that? An important reason is that everyone can see the current limitations. But the solutions do not necessarily require a completely new framework. The only thing that works in AI is next token prediction + scaling law. As long as the token space is complete enough, it can all be done. Of course, the problems he points out exist today, but these problems can be solved if you make the token space very general.
Tencent News "Periscope": He's exaggerating the limitations.
Yang Zhilin: I think so. But there's nothing wrong with the underlying First Principles, it's just that some small technical problems haven't been solved yet.
Tencent News "Periscope": How do you view Geoffrey Hinton, the father of deep learning, repeatedly calling for AI Safety?
Yang Zhilin: His concern for safety actually indicates that he has great confidence in the growth of technological capacity in the future. He's the opposite of Yann LeCun.
Tencent News "Periscope": How do you solve the hallucination problem?
Yang Zhilin: Still a scaling law, but it scales something different.
Tencent News "Periscope": What is the probability that the scaling law will ultimately not work?
Yang Zhilin: About 0.
Tencent News "Periscope": How do you view your CMU alumnus [Lu Qi [陆奇]](https://en.wikipedia.org/wiki/Lu_Qi_(computer_scientist))'s point of view: that OpenAI will definitely be bigger than Google in the future, the only question is whether it's one, five or ten times bigger?
Yang Zhilin: The most successful AGI company of the future will definitely be bigger than all the companies today. There's no question about that. It may ultimately be double or triple GPT’s size. It may not be OpenAI, it could be another company, but there will definitely be such a company.
Tencent News "Periscope": If you happen to become the CEO of this AI empire, what would you do to protect humanity?
Yang Zhilin: Thinking about this now is lacking some preconditions. But we are definitely willing to cooperate with different roles in society and improve, including having more safety measures in our models.
Tencent News "Periscope": What are your goals for 2024?
Yang Zhilin: First is technical breakthrough. We should be able to make models that are much better than those of 2023. Second is users and product. I hope we can have more users and stickiness at scale.
Tencent News "Periscope": What are your predictions for the global large model industry in 2024?
Yang Zhilin: This year there will be even more capabilities emerging, but the landscape won't change too much from today. The top few will still be leading. In terms of capability, there should be some major breakthroughs in the second half of this year, and many will come from OpenAI. It definitely has another generation of models - it may be 4.5 or 5, which seems very likely. Video generation models will definitely continue to scale.
Tencent News "Periscope": What are your predictions for the domestic large model industry in 2024?
Yang Zhilin: First, we'll see new and unique capabilities emerge. You will see domestic models, because of the previous investments and with suitable teams, creating world-leading capabilities in some dimensions. Second, there will be more products with larger user bases, which is highly probable. Third, there will be further consolidation and a differentiation in the choice of paths.
Tencent News "Periscope": What is the one thing you fear most about starting a company?
Yang Zhilin: Not much, you just have to charge ahead without fear.
Tencent News "Periscope": What would you like to say to your peers in the industry?
Yang Zhilin: *Ganbatte*.
Tencent News "Periscope": What is a question about the large model industry that you don't know the answer to but would most like to know?
Yang Zhilin: I don't know what the upper limit of AGI is like, what kind of company it will produce, and what kind of products that company can generate. This is what I most want to know now.
Tencent News "Periscope": If AGI continues to develop like this, what is the one thing you would least like to see?
Yang Zhilin: I'm rather optimistic about this. I believe it can help human civilization develop to the next stage.
Tencent News "Periscope": Has anyone ever said you were too much of a techno-idealist?
Yang Zhilin: We're also very down-to-earth, we've actually done a lot of things, not just talking about them.
Tencent News "Periscope": If the money you have today is the last bit of funding you will ever get, how would you spend it?
Yang Zhilin: I hope that will never happen, because we will need a lot of money in the future.
Tencent News "Periscope": If you don't achieve something, would you consider yourself a failure?
Yang Zhilin: It's not that big of a deal. I accept that there is a chance of failure.
This whole endeavor has completely changed my life, and I'm very grateful for that.
——————End——————
(If you need to communicate, please add the author’s WeChat: `benita_zhangxj`, and note your company + position in your request)
----
<https://k.sina.cn/article_1642720480_61e9ece00270195by.html>
# Conversation with Moonshot's founder Yang Zhilin: The most important ability for future AI is thinking and interaction
Released on the Aifainer public account, 2024-11-19, Guangdong.
Moonshot AI and Yang Zhilin are probably the most closely watched domestic large model company and founder recently. The arbitration dispute and the news of overseas product contraction have put them in the eye of the storm.
[Note: It was reported by 暗涌Waves on 2024-11-11 that CEO Yang Zhilin, and CTO Zhang Yutao, were sued in Hong Kong by investors of Circular Intelligence. Circular Intelligence is the company where Yang Zhilin and Zhang Yutao worked before starting Moonshot AI. According to insiders, the applicants for this arbitration came from Circular Intelligence and five of its seven investors: Jinshajiang Venture Capital, Jingya Capital, Boyu Capital, Huashan Capital and Wanwu Capital.
It was speculated that these early funders were regretful that they didn't invest more during the angel round like other investors, and now that Moonshot AI had grown beyond their expectations, they wanted more equity.]
[Note: Multiple large model companies in China had tried opening overseas services in 2024. MiniMax's Talkie reached 20.62 million MAU in 2024-11, and its growth rate exceeded that of Character.AI. Moonshot AI's Ohai was also a competitor to Character.AI, while Noisee was a video generator. However these overseas products were not growing as they expected, so Moonshot AI stopped investing in these products and focus on Kimi. Then in 2024-11, several people in the company who were working on these products left the company to make their own enterprises.]
More importantly, of course, is that Moonshot AI's Kimi is the leading AI application product in China. Currently, Kimi's monthly active users (MAU) have exceeded 36 million.
On the one-year anniversary of the full release of Kimi Chat, Kimi officially launched its new generation mathematical reasoning model, k0-math, benchmarked against OpenAI's o1 series.
Moonshot AI founder, Yang Zhilin, believes that the most suitable scenario for AI to train its thinking ability is mathematics. When introducing k0-math to media outlets like APPSO, he quoted a passage from Galileo:
"If the Universe is understood as a grand book, then it is written in the language of mathematics, which is the language for expressing the Universe."
Benchmark tests show that Kimi k0-math's mathematical ability is comparable to the globally leading OpenAI o1 series' two publicly available models: o1-mini and o1-preview.
In four mathematical benchmark tests, including the middle school entrance exam, college entrance exam, postgraduate entrance exam, and MATH, which includes introductory competition questions, the first generation of k0-math outperformed the o1-mini and o1-preview models.
In the two more challenging competition-level math question banks, OMNI-MATH and AIME, the first generation of k0-math achieved 90% and 83% of o1-mini's best performance, respectively.
Yang Zhilin demonstrated some of the processes of k0-math solving math problems. For example, when faced with a difficult competition question, it can go through a lot of attempts. It might try eight or nine different approaches and ultimately find that none of them truly lead to the final solution.
However, after multiple attempts, it might suddenly discover that it can combine the previous two or three different ideas to arrive at a correct answer.
To enable AI to have deep thinking ability, k0-math does not pre-design many prompt templates. Yang Zhilin hopes that AI will develop its own way of thinking during the learning, and that it will be different for each question. It needs to go through a lot of reflection and validation.
However, k0-math still has some limitations. For example, when asked a simple question like 1+1, it overthinks. K0-math's answer is roughly like this:
It would say that this question seems simple, but you can't be too careful, so it thinks there might be a trick, and starts to analyze it, and even does a visualization. It even draws an analogy to two apples.
That's not enough. It needs to check it again. If it's true for apples, and if we change it to hours, one hour + one hour equals two hours. It confirms many times and finally says, okay, it can confirm that 1+1=2.
The actual effect still needs to be tested in use. Yang Zhilin revealed that the k0-math model and the more powerful Kimi Exploration Edition will be launched in batches on the Kimi web version and the Kimi Smart Assistant APP in the coming weeks, to help everyone solve more challenging math and search/research tasks.
We also hope to achieve stronger reasoning capabilities. Because I think the most important ability for the development of AI products and AI technology in the future is deeper reasoning, to improve the current short-chain simple question-and-answer into longer-chain combinatorial operations.
APPSO, the Aifaner’s new media for AI, was invited to attend this Kimi communication meeting and had an interaction with Yang Zhilin on some issues regarding the company and products. The following is a partial transcript of the communication:
Q: What do you think about AI startups being acquired, and talent flowing back to large companies? Have you experienced any talent loss recently?
Yang Zhilin: We haven't experienced any talent loss.
We haven't encountered this issue, but it's possible that other companies have. Because the industry's development has entered a new phase, it has gone from many companies to fewer now. From now on, everyone's work will differentiate. I think this is an inevitable pattern.
Actually, we've actively chosen to reduce our business scope. Among the large model startups, we have always maintained the smallest number of people and the highest GPU-to-people ratio. I think this is crucial.
We don't want to expand our team too much. Overexpansion kills innovation. If you want to keep your team at a certain size, the best way is to reduce your business scope.
Initially, we did try to do several products simultaneously. This might have been effective for a certain period. But later, we realized it's important to focus. Doing one product well to the absolute limits is the most important thing.
Because cutting business lines is essentially also controlling the number of people; you don't want the number of people to grow too rapidly. For example, if we were doing three businesses at the same time right now, we would essentially do what a megacorp incumbent is doing, and we wouldn't have any advantage there.
Q: When did the idea of focusing on Kimi (reducing product lines) emerge? What factors made you reconsider and re-strategize?
Yang Zhilin: Around February or March of this year. One was based on our judgment of the US market, and the other was based on our own observations; these were the main two points. Also, in terms of doing the work itself, we really had to do subtraction, not crazy addition.
Q: What do you think is the most core task now?
Yang Zhilin: The most core task is to improve user retention, or to use user retention as a KPI. Because I think it's basically positively correlated with how mature or how high-leveled your tech is. So this is the most important thing for us currently, and I think there is still a lot of room for improvement.
Q: What level of retention would you be satisfied with?
Yang Zhilin: Never ever.
Q: After o1 was released, some people felt that the deep reasoning, including the math model you mentioned today, was relatively far from the average user. How do you see the relationship between this function and users?
Yang Zhilin: Actually, it's not far. In terms of math, I think there are two aspects of value. First, it actually has great value in educational products today. It also plays a very important role in our overall traffic.
Second, I think it's a technical iteration and verification. And we can put this technology into more scenarios. For example, as we just mentioned, we use the explore version for a lot of searching, I think it has these two levels of meaning.
Q: It's said that Sora will be released soon. Why haven't you been working on multimodal (多模态) capabilities?
Yang Zhilin: We are working on it too. Several of our multimodal capabilities are in internal testing. I think the most important capabilities for AI next are thinking and interaction.
The importance of thinking is far greater than interaction. It’s not that interaction is unimportant, but I think reasoning capacity determines the upper limit. Sure I think interaction is necessary. For example, vision capabilities; without vision capabilities, you cannot interact. So I think these two are different. You look at the task you want to do, the difficulty of labeling the task is very high; do you need a PhD to label it, or can everyone label it? Which thing is more difficult to find people for? That thing is the upper limit of AI.
So I think multimodal is definitely necessary, but I think reasoning determines its upper limit.
Q: How do you view the competition between Kimi and Doubao [豆包, a chatbot made by ByteDance]?
Yang Zhilin: I prefer to focus on how we can truly provide value to users. I don't want us to focus too much on competition itself, because competition itself does not generate value.
How to provide better technology and products, this is our core issue now. We will focus more on how to improve the model's thinking and reasoning capabilities, and use this to bring greater value to users. We simply need to do the right things, and not artificially focus on things that would merely differentiate us.
I believe that no matter who achieves AGI, as long as it's achieved, it'd be a great outcome.
Q: When will an AI super app emerge?
Yang Zhilin: ChatGPT has already exceeded 500 million MAU. Is it a super app? At least half of one, right? 500 million people use it every month, so this question has mostly been answered.
Q: How do you view the recent discussions about large models encountering pre-training bottlenecks? Has the scaling law hit a wall?
Yang Zhilin: I think there is still room for pre-training, for maybe a half or one generation of models, and next year, this room would be opened up for the taking. The leading models next year will push pre-training to an extreme stage. Today, for the best models, there's still such room to be exploited.
But we judge that the next key things will be in reinforcement learning. The compute paradigm will change, but it is still scaling. It’s not that you don’t need to scale, just that you will scale through different methods. This is our judgment.
You ask whether the scaling law will be a ceiling or an upper limit. I am relatively optimistic about this. The core is that originally you used static datasets, which is a relatively simple and crude way of using data. Now, in many cases with reinforcement learning, people are involved in the process, but humans cannot label that much data. It is impossible to label every CoT for every question, so you are actually using AI itself to add leverage to the human element.
For example, if you label 100 pieces of data, you can produce a very large effect, because the rest is AI thinking by itself. I think more of this approach will be used to solve this. I think it is highly likely that it can be done in this way, so I think its upper limit is very high.
Q: How far are we from AGI?
Yang Zhilin: I think we are still in the early stages. Of course, there is relatively large progress every year. If we use last year's product, you might find that it feels unbearable to use now.
But there are still many things, for example, as I just said, the reasoning ability is not strong enough, and the interaction is not rich enough. So what it can do today in terms of interaction is still limited. This interaction may be two-dimensional; one is the interaction with users, and the other is the interaction with the objective world itself. I think there is still a lot of room for improvement in both of these areas.
----
<https://news.qq.com/rain/a/20240208A05KFR00>
# 专访月之暗面杨植麟:一家新创立的 AGI 公司如何超越 OpenAI?
海外独角兽
2024-02-21 11:25发布于北京海外独角兽官方账号
采访:天一、penny、guangmi
编辑:天一
排版:Scout
Lossless long context is everything。这是我们跟杨植麟聊完两个小时后记忆最深刻的一个观点。
这个技术判断在 23 年 10 月已经被传递出来,当时杨植麟创立的月之暗面发布了首个模型 moonshot 和智能助手 Kimi,支持 20 万字的输入。做“长”是因为杨植麟判断 AI-Native 产品的终极价值是提供个性化的交互,而 lossless long-context 是实现这一点的基础 —— 模型的微调长期不应该存在,用户跟模型的交互历史就是最好的个性化过程,历史上每一代技术都是在提升 context length。
杨植麟身上的标签有天才 AI 科学家、连续创业者……在这次深度访谈中,他再次证明自己是个真正“懂”大模型的创业者,所以本文中有许多反共识的观点:杨植麟觉得微调最终会不存在,tokenizer 最后也不一定是必须的;硅谷大模型训练者们担心数据瓶颈和能源限制,他反而觉得所有问题都是互相关联的,多模态可以缓解数据短缺,合成数据则可以通过改变计算范式解决能源问题。
本文还试图回答另一个外界普遍关心的问题:一家新创立的 AGI 公司如何超越 OpenAI?杨植麟的答案是 tech vision,一号位要能做出技术判断,同时还能拍板执行。一个具体的例子是,月之暗面希望比 OpenAI 更关心用户,原因是杨植麟判断用户数据的 scale up 的效果最终会超越 base model 自身。
杨植麟对于用 transformer 这个概率模型的思想基础走向 AGI 也很有信心,用他的话说“如果你有 10 亿的 context length,今天看到的问题都不是问题”。
## 01. AGI:AI 本质就是一堆 scaling law
海外独角兽:我们把 LLM 的训练比作登月,月之暗面的名字也和登月相关。你怎么看现在创业公司的 LLM 训练,在 GPU 和算力资源有限的条件下,还能实现登月吗?
杨植麟:“登月”有几个不同的生产要素,算力肯定是一个核心,但还有其他的。
你需要一个同时满足 scalability 和 generality 这两点的架构,但今天其实很多架构已经不满足这两条了。transformer 在已知的 token space 符合这两条,但放大到一个更通用的场景,也不太符合。数据也是一个生产要素,包括整个世界的数字化,和来自用户的数据。
所以在很多核心生产要素中,通过改变其他的生产要素,可以让算力利用率变高。
同时,针对“登月”,算力肯定要持续增长。今天能看到最好的模型是 10 的 25 到 26 次方 FLOPs 这种规模。这个数量级接下来肯定还会持续增长,所以我认为算力是个必要条件,因为机器学习或者 AI 研究了七八十年,唯一 work 的东西其实是 scaling Law,就是放大这几种生产要素。
我们其实比较有信心,在一年的时间窗口,能够达到 10 的 26 次方这样规模的模型,资源最终会得到合理分配的。
海外独角兽:OpenAI 训下一代模型,我们推测有至少 10 万张 H100,单个集群也能达到 3 万张。OpenAI 显然是追求“登月”的,不足可能是没那么注重用户和客户体验。月之暗面和 OpenAI 的差异化路径会在哪儿?有什么是月之暗面能做而 OpenAI 不做的?
杨植麟:短期内关键的一点在于大家的 tech vision 不完全相同。很多领域并不是 OpenAI 的核心竞争力,比如图片生成,DALL-E 3 至少比 Midjourney 落后一代。GPT 的 long-context 也并不是 state-of-the-art。我们前段时间做出来的 lossless long-context 技术在很多具体场景上要比 OpenAI 效果更好,因为用了无损压缩的技术。你可以用它去读一篇很长的文章,它可以很好地还原一些具体细节,还可以内容做推理。用户自己还会发现很多场景,比如扔给它 50 个简历,让它根据你的要求做分析和筛选。
要做差异化,我认为就是去看这里面的 tech space 有多大,tech space 越大,技术、产品、商业层面能实现的差异化就越大。如果技术已经收敛了,那大家只能去追赶,就是同质化内卷。
然后我其实比较乐观,因为现在仍有巨大的 tech space。AGI 技术可以分为三层:
第一层是 scaling law 结合 next-token-prediction。这个基础对所有人都是一样的,追赶过程逐渐收敛。在这个路径上, OpenAI 现在做得更好,因为他们过去四五年投入了相应的资源。
第二层现在有两个核心问题。首先是如何通用地表示这个世界?真正的“通用”是像计算机一样,用 0 和 1 就能表示整个世界。对于基于 transformer 的语言模型来说,它能表示一本书、一篇文章、甚至一个视频,但表示一个更大的 3D 世界或你硬盘上的所有文件还有难度,没做到 token-in-token-out,离所谓的 unified representation 其实有差距。架构其实解决的是这个问题。
通过 AI 自我进化克服数据稀缺性的瓶颈是第二层的另一个问题。今天的 AI 其实像一个黑盒,这个黑盒有两个输入:电源线和数据线,输入这两个东西后,盒子就能产出智能。随后大家意识到,数据线的输入是有限的,这就是所谓的数据瓶颈问题,下一代 AI 需要拔掉数据线,做到只要源源不断地输入电力,就能源源不断地输出智能。
这两个核心问题导致在第三层有巨大的空间,包括 long-context、不同模态的生成、模型多步规划的能力、指令遵循的能力、各种 agent 的功能等。
这些上层的东西都会有巨大的差异化,因为中间存在两个重要的技术变量。我认为这是我们的机会。
除了技术层面,价值观上我们有一点和 OpenAI 不同:我们希望在下一个时代,能成为一家结合 OpenAI 技术理想主义和字节所展现的商业化哲学观的公司。东方的效用主义我认为有一定的可取之处。完全不关心商业价值的话,你其实很难真的做出来一个伟大的产品,或者让一个本身很伟大的技术变得更伟大。
海外独角兽:你觉得模型公司应该讲什么故事?像 OpenAI 一样讲追求 AGI,还是超级应用的故事?两者会有矛盾吗,怎么来平衡?
杨植麟:如何讲故事取决于投资人的心态。对我们来说,更重要的是理解两者之间的关系。
AGI 和产品对我们来说并不是手段和目的的关系,两个都是目的。同时,在追求 AGI 的过程中,我认为所谓的数据飞轮是很重要的,尽管它是一个老套的概念。
像 ChatGPT 这样的产品,还没有完全建立起基于用户数据的持续进化。我觉得这很大程度上是 base model 还在进化,进化了一代,之前的用户数据就没什么用了。这跟发展阶段有关系 —— 现在“吃”的是 base model 的 scaling law,未来可能会去“吃”用户这个数据源的 scaling law。
历史上基本所有的互联网产品要跑出来,最终都要靠用户数据的 scale。今天 MidJourney 已经能看到一些迹象,它通过“吃”用户的 scaling law 可以胜过 base model 的 scale up,但如果只看语言模型和文本,base model 的 scaling 效果仍然远远超过用户的,但我认为最终会转移到用户的 scaling law,只是个时间问题。
现在面对数据瓶颈,这一点尤为重要。特别是人类偏好数据,它非常有限,但没有它又不行。我觉得这也是每一个AI-Native 产品现在最值得思考的问题之一。所以,一个不足够关心用户的公司最终可能也没法实现 AGI。
海外独角兽:怎么看 MoE?有一种说法是 MoE 不是真正的 scale up,只有 scale up dense model 才会提升模型的能力。
杨植麟:你可以认为带 MoE 和不带 MoE 是两条 scaling law。本质上 scaling law 刻画的是 loss 跟参数量之间的关系。MoE 改变了这个函数,让你能够用更大的参数,但同时 FLOPs 不变。合成数据改变的是另一个关系,FLOPs 不变的情况下让数据规模增长。
沿着 scaling law 一直走是个有确定性的事情,大家通过试图改变 scaling law 里的具体关系来获得更高的 efficiency,多出来的 efficiency 就是各自的优势。
现在很多人觉得做出 MoE 就可以实现 GPT-4。我觉得这是片面的说法,最终更实质的可能还是如何有一个统一的表示空间以及可规模化的数据生产。
海外独角兽:如果算力足够,会有人想做一个万亿参数的 dense model 吗?
杨植麟:取决于推理成本的下降速度,但我觉得肯定会有。现在大家是因为推理成本太高,所以都在做 tradeoff。但是最终直接训练一个万亿的 dense model 肯定效果会比一个只有千亿参数的模型要好。
海外独角兽:Anthropic 一直在提模型的可解释性,这一点其实有蛮多争论。你是如何思考可解释性的?因为刚刚你也提到了模型是一个黑盒,并且其实人类到现在还没有弄清楚自己的大脑是怎么工作的。
杨植麟:可解释性核心是个信任的问题。建立一个信任的心智是很重要的,对应的应用场景甚至可能和 ChatGPT 的也会不同,比如 long-context 和搜索的结合。
当模型完全不 hallucinate 或者概率非常低,就不需要解释了,因为它说的东西都是对的。而且解释有可能也只是 alignment 的一部分,比如说 chain-of-thought 也可以被认为是一种解释。
Hallucination 是可以通过 scaling law 来解决。但不一定是在 pre-training 环节,因为其实 alignment 也有 scaling law,它肯定是可以被解决的,只要你能找到对的数据。AI 本质就是一堆 scaling law。
海外独角兽:你对 AGI 的预期是什么?transformer 本质还是一个统计概率模型,它能通往 AGI 吗?
杨植麟:统计模型没有什么问题。当 next token prediction 足够好的时候,它能够平衡创造性和事实性。
事实性一般是对统计模型的挑战,但是今天的语言模型可以有非常尖峰的分布。让它回答“中国的首都”,模型对“北”这个字能给出 99% 的概率。同时,如果我今天让它写一本小说,那它可能下一个词的概率分布就会很均匀。概率其实是一个通用的表示方式。本质上这个世界上有大量的熵,抓住确定性的东西,让本身是混沌的东西继续混沌。
通往 AGI 的话,long-context 会是一个很重要的点。所有问题都是 long-context 的问题 —— 历史上所有的架构演进本质上都是在提升有效的 context length。word2vec 最近拿了 NeurIPS 的 Test of Time 奖。它在 10 年前用一个词去预测周围的词,相当于 context length 是 5。RNN 把有效的 context length 提升到了 20。LSTM 涨到大几十。transformer 到了几千。现在我们能做到几十万。
如果你有 10 亿的 context length,今天看到的问题都不是问题。
此外,其实无损压缩就是在一片混沌中学习确定性。一个极端的例子是等差数列,给定前两个数,接下来每一个数都是确定的,不存在混沌,所以一个完美的模型可以还原整个数列。但真实世界的很多数据都存在噪声,我们需要过滤掉这些噪声,让模型只学能学习到的内容。在这个过程中,对于那些不确定的可能性,也要分配足够的概率。举个例子,如果要生成一张图片,那么它的 loss 会比生成一段文字更高,这是因为图片包含了更多的混沌和信息量,但只需捕捉其中你能掌握的部分,剩余的部分可以认为是有概率发生的。比如,水杯的颜色是绿色还是红色就是有概率会发生的,但颜色这个信息不会改变“水杯长什么样”这件事,所以这里面需要重点学习的就是水杯的形状,至于它的颜色,就要做一个概率分配。
海外独角兽:context length 的提升存在什么规律?有技术可预见性吗?
杨植麟:我自己感觉存在 context length 的摩尔定律。但需要强调:给定长度下的准确率也非常重要,需要同时优化长度和准确率(无损压缩)两个指标。
在保证模型能力和智商的情况下,我觉得大概率 context length 的提升是指数级增长的。
## 02. 多模态:大部分架构不值得被 scale up
海外独角兽:大家都期待多模态会在 2024 年爆发,相比文本,多模态的技术难度会在哪里?
杨植麟:现在 state-of-the-art 的视频生成模型的 FLOPs 其实比语言模型少一个数量级以上,并不是大家不想 scale up,而是大部分架构不值得这么做。
19 年最流行的是架构是 BERT,后来大家问为什么没有人去 scale BERT,其实是因为值得被 scale 的架构需要具备 scalability 和 generality 这两个条件。我不认为 BERT 没有 scalability,但是你能明显看到它没有 generality —— 不管 scale 到多大,它都不可能给你写一篇文章。多模态过去几年也是卡在架构上,缺少真正通用的、有人愿意去 scale 的模型。Diffusion 明显不是,scale 上天了它也不可能是 AGI。今天 auto-regressive 的架构带来了一些新的可能,牺牲了一些效率解决了通用性。
Auto-regressive 本身是 scalable 的,但是 tokenizer 不一定,或者最后就不需要 tokenizer,这是 24 年的核心问题。
海外独角兽:如果 tokenizer 不 scalable ,我们需要一个 transformer 之外全新的架构吗?
杨植麟:光说 transformer 本身,我觉得问题不大。核心还是解决 tokenizer 的问题。transformer 架构其实已经发生很多变化了,今天做 long-context、做 MoE,都不是标准的 transformer。但是 transformer 的灵魂或者思想肯定还会存在很长时间,核心是怎么在这个思想基础上解决更多问题。
海外独角兽:其实 context length 无限长的话,我们也不需要 tokenizer 了?
杨植麟:对。本质上模型足够强的话,它可以处理任何的 token、pixel、byte。有了无限长的 context length,你可以直接把硬盘上所有的东西都输给它,它会变成你真正的新计算机,根据这些 context 采取行动。
海外独角兽:OpenAI、Anthropic 等领先的模型公司觉得 2024 年的一大瓶颈会是数据,所以他们对怎么用合成数据期待比较高,你怎么看合成数据?
杨植麟:一个值得被 scale up 的架构是基础,这个架构首先得支持不断加入更多数据,然后数据才会真的成为瓶颈。我们现在说的数据瓶颈,从文本模态上,2024 年就会遇到,但多模态数据的引入进来会把这个问题推迟 1-2 年。
如果视频和多模态的卡点解决不了,那文本的数据瓶颈就会很关键。这点上其实我们也有些进展 —— 如果限定了问题,比如数学或者写代码,数据是相对好生成的。通用的问题现在还没有完全的解法,但是存在一些方向可以去探索。
海外独角兽:2025 年的瓶颈会是能源?因为到时候单个集群规模很大,对能源带来挑战。
杨植麟:这些问题其实是连在一起的,最后可能是多模态解决数据问题,合成数据解决能源问题。
到了 GPT-6 这一代,掌握合成数据技术的玩家会体现出明显差距。因为数据其实有两种,一种是做 pre-training 的数据,另外一种是获取成本更高的 alignment 数据。如果掌握了数据生成技术,alignment 的成本可能会降低好几个数量级,或者能用一样的投入产生更大的几个数量级的数据,格局就会发生变化。
我觉得 2025、2026 年可能是很重要的 milestone —— 模型的大部分计算量会发生在模型自己生成的数据上。
26 年的时候也许模型用于推理的计算量会远远大于训练本身,可能花 10 倍的成本去推理,推理完之后花一倍的成本来训练。会出现新的范式,推理即训练,而且这个推理不是为任何用户服务的,只为自己本身的合成数据服务。
出现这种情况的话,能源的问题也解决了,因为推理是可以分布式的。而且它不违背定律,本质还是个能源守恒。只不过我把计算范式改变了,让能源能够以分布式的方式解决。
## 03. 超级应用:模型的微调可能最终不存在
海外独角兽:Google 和抖音背后的搜索和推荐有很强的飞轮效应,算法能根据用户的行为实时反馈,用户体验也能不断提升。LLM 现在无法实时反馈用户行为,AI-Native 产品的飞轮效应会是什么?
杨植麟:我深入思考过这个问题。AI-Native 产品最终的核心价值是个性化交互,这是以前技术实现得不好的,所以这个问题其实是关于个性化的 —— 怎么让用户使用你的产品多了之后,获得高度个性化的互动体验。今天对许多产品来说,这个个性化程度几乎为零。以前我们只能做个性化的推荐,但现在,用户可以与产品进行互动。这种互动是高度拟人化和个性化的。怎么实现这一点?
我觉得这背后实际上是个技术问题。传统 AI 时代,要实现个性化,需要持续更新模型,用小模型解决单点问题。大模型时代,实现个性化的一种方式是微调,但我认为微调可能不是本质的方法,长期来看可能不会存在模型的微调。为什么?当你的模型指令跟随能力、推理能力、上下文一致性能力越来越强时,所有东西只需要放在内存里就可以。比如你的大模型内存有一堆 prefix 这样的东西用来 follow,成本可以降到非常低。最终,你对模型个性化的过程实际上就是你所有的交互历史,也是一个包含了你的偏好和反馈的集合,这些反馈会比上个时代的产品更直接,因为它是完全通过对话界面产生的。
基于这个判断,进一步就会想:如何在技术层面实现基于 long-context 的定制化去完全取代微调?
我认为现在正在往这个方向走,未来模型不需要微调,而是通过强大的上下文一致性和指令跟随能力来解决问题,长期趋势应该是底层技术个性化,这会是一个很重要的变化。
比如,GPT-4 带来的新的计算范式,创建 GPTs 并不需要微调。以前的定制化是通过 programming 实现的,今天实际上是通过让模型的 prefix 变得非常复杂,从这个通用的集合中抽出你想要的东西。通过这种方式实现个性化才是 AI-native 的个性化,外挂一个传统的推荐引擎肯定会被新方式淘汰。
海外独角兽:你们先做 lossless long-context 这个决策是怎么做出来的?
杨植麟:我觉得最重要的还是以终为始地思考这个事。大模型作为新的计算机肯定也需要很大的内存,因为旧的计算机的内存在过去几十年的时间里面至少增长了几个数量级,而且旧的计算机也是一开始的时候只有很少的内存。第二点就在于 AI 的终极价值是个性化。
海外独角兽:OpenAI 其实也有一定的 long-context 了。
杨植麟:它还没有把用户的交互过程真正视为个性化的场景。比如,如果我们去 ChatGPT prompt 某个东西, 不管是今天还是明天,只要模型版本相同,可能效果基本上差不多,这就是我说的缺乏个性化。
最终所有东西都是指令遵循。只不过你的指令会越来越复杂。今天你的指令一开始可能是 10 个词,但是你到后面有可能它就是 1 万个词、 100 万个词。
海外独角兽:Chatbot 一直是 AI 科学家的白月光,如果每个用户每天和 Chatbot 对话几百条,Chatbot 系统能采集和理解更多的用户 context,最终会大幅超越搜索和推荐系统的匹配准确率吗?就像我们和同事家人之间的互动,只需要一句话甚至一个眼神对方就懂你的意思。
杨植麟:核心是跨越信任这一步。
我觉得最终衡量一个 AI 产品的长期价值,就是看用户愿意在它上面输入多少个人化的信息,然后 lossless long-context 和个性化负责把这些输入变成有价值的东西。
可能也还需要新的硬件形态,但我觉得模型和软件现在也还是个瓶颈。因为要再往下钻一层,让用户输入很多信息的前提是 trust,是你需要有足够 engaging 和 human like 的AI。不能说是我为了得到你的信息所以专门设置了一些产品功能。最终效果应该是用户和 AI 成为了朋友,那所有事情都可以跟它说。
Inflection Pi 的 motivation 其实是很好的,想要建立强信任,只是 Pi 可能要再往前推一步,到底怎样跟用户去建立信任,人类社会可能并不接受指派一个终身搭档的做法,这有点反人性。
海外独角兽:月之暗面想做超级应用,你自己理想中的超级应用长什么样子?多大才算超级?
杨植麟:还是看破圈程度。周围的亲戚都在用,你才真正成为超级应用。而且我认为 AI 能力的提升会领先于产品破圈。比如假设今天 character.ai 是非常完美的多模态模型,那我觉得它破圈的概率至少会大 10 倍。最终一个应用的上限体现在以年为维度的 AI 和人的 connection 的增加。
## 04. 月之暗面:最好的人才需要 unlearn 能力
海外独角兽:AGI 公司最理想的 CEO 画像应该是什么样的?
杨植麟:一方面需要有 tech vision。不能一直做别人已经证明过的东西。真的 AGI 公司必须有自己独特的技术判断,而且这个判断应该影响到公司的整体方向。如果一号位不能拍板也不行。我们年初已经在做 auto-regressive 的多模态、lossless long-context 了,但它们都是最近一两个月才变得非常火,甚至即使今天,lossless long-context 仍然不是一个共识。但如果今天才看到这个事情,已经没有足够多的时间去迭代,最后会变成跟随者。
第二点是能够很深刻的理解 AI-Native 产品的开发方式,然后基于新的生产方式适配一套组织。以前做产品是通过了解用户的需求设计功能,新时代需要在制造的过程中完成设计。ChatGPT 就是通过制造完成设计,并没有先设计出来一堆场景再找对应的算法。Kimi 的用户自己去上传简历然后做筛选,也是我们上线之前完全没有测试过的用例。
资源获取肯定也很重要。其中主要烧钱的是算力。早期靠融资,到后面就需要更多的产品商业化。商业化也不能照搬上一个时代成熟的东西创新,所以好的 CEO 和团队应该有一定经验,但同时也有很强的学习和迭代能力。
海外独角兽:但有可能投资人分辨不出来到底谁的 tech vision 是最领先的。
杨植麟:我不太担心这个问题。现在就是最好的分配方式,更接近一个自由市场,最后会有最高的分配效率。我们要跟别人证明的也不是我们的 vision,因为 vision 是一个抽象的东西,还是要通过真实的 deliver 模型和产品。Anthropic 放出 Claude 这些模型之后,马上就得到了更多的资源。市场是公平的。
海外独角兽:从建立产品和公司竞争壁垒的角度,工业时代讲究规模效应,互联网时代讲究网络效应,AGI 时代会有新范式吗?
杨植麟:短期是组织方式的变化带来技术上的提升 —— 你通过更好的组织带来更好的技术,然后在产品上直接传递出更好的体验。
长期大概率还是网络效应。问题在于网络效应的体现方式是什么?比如以前互联网的双边网络可能仍然会存在,但并不是用户和创作者双边。AI-Native 产品的双边网络可能体现在个性化上,用户和 AI 存在一种共创的关系。
所以我现在看到值得探索的是两点:模型能力的持续提升,另一个是双边效应。它们会在新时代带来新的范式。现在 Midjourney 在双边效应上已经爆发了,Stable Diffusion 作为开源模型就尴尬在单边太分散,只能依赖 base model 的提升。
海外独角兽:从招聘角度,你怎么定义好的人才?
杨植麟:我会拆成经验和学习来看。学习是一个通用的能力,不光是 learn,还要 unlearn,特别是以前的成功经验。假设你是从 0 到 1 做了 YouTube,现在做 AI 产品可能比别人更难,因为要 unlearn 很多东西。学习比经验重要。可能再过 5 年的话, AI 行业会培养出来很多所谓的成熟职能。今天我觉得其实划分职能没有什么意义,需要每个人都很多面。
海外独角兽:什么样的 researcher 才会有 tech vision?
杨植麟:核心是两点,一个是抓大放小,一个是终局思维。我跟很多 researcher 合作过,容易出现的一个问题就是过分雕花,容易在局部里看到有很多可以优化的东西,比如我们发现 transformer 解决了 LSTM 的 context length 问题,但如果再跳出来一层,就会发现本质上每一代技术都是在提升 context length。
海外独角兽:你觉得月之暗面还需要多少这样的人才?
杨植麟:客观上来说,限制我们的肯定还是供给。现在 AGI 的人才稀缺在于经验,但其实拥有学习能力的人才还是很多的。
但是需求角度,整个组织不能太大 —— 把自己活生生又弄成了大厂的话,很多组织优势就丢失了。所以我们肯定还是会维持一个精简高效的组织。我觉得一个核心判断是 AGI 不需要那么多人。而且长期来看,真的“拔掉了数据”之后,GPT-6 水平之后的模型完全可以自我进化,这样才能突破人类已有能力的边界。
海外独角兽:你怎么看追平 GPT-4 的难度和时间?
杨植麟:Benchmarking 刷到 GPT-4 非常简单,但是达到它的实际效果肯定有难度的,而且靠的不只是资源,Google 已经验证了这一点。其实 GPT-4 的训练成本也没那么高,大几千万美元不是一个很吓人的数字,对我们来说是好事,并且我们已经有比较好的进展。
最重要的还是底层有 tech vision 去预判 GPT-5 和 GPT-6 应该是什么样,然后提前去执行和积累,不然永远都不可能超越 Open AI。OpenAI 的很多红利也在于提前预判,它在 2018 年就大概相信自己在探索正确的方向,花了很长时间积累。
海外独角兽:让你来做图片生成这种产品的话,你会怎么做?怎么兼顾语言理解和图片质量?
杨植麟:现在 Midjourney 在图片生成这个单一任务已经做得特别好了,我来做的话会希望它能做很多任务,同时在其中的一些任务也能做得很好。这其实也是 OpenAI 的思路,只是它其实没做成功。
AGI 公司应该是入口逻辑,让用户默认用你,此外特定人群会有一些特殊需求和对极致效果的追求,所以市场里还存在 Midjourney 之类公司的机会。但是 AGI 的通用性足够强大时,很多用户也会转移 —— 如果今天我把 Photoshop 整个软件都重新封装成一个 prompt,它变成大家一个外包的全能设计师,那会有更少的人用 Midjourney。
Midjourney 今天的地位在于它通过先发优势让飞轮跑起来了。比较 tricky 的是未来还会不会有这种时间窗口,如果没时间窗口,那很可能直接被通用模型碾压。
海外独角兽:沿着入口逻辑的话,你觉得未来会有几个入口?
杨植麟:至少有两个,一个是有用的,一个是好玩的。
信息入口可能不存在了,因为我们搜寻信息本质上是希望端到端完成一个任务。智能的入口以后大概率会覆盖搜索引擎这类信息入口。人获取信息并不是终极需求,它只是一直被强行定义成一种需求。有些时候我们是希望完成一件事,有些时候是希望学习某个东西,AGI 的入口应该直接帮用户完成任务,而不是帮他们获取信息。
海外独角兽:从今天到实现你理想中的 AGI 还需要多少钱?
杨植麟:严格的 AGI 还需要百亿美元级别。但是它不是一步到位,你需要跑起来一个循环,业务能够自己产出对应的资源。这个百亿美元推论的原因是 scale up 的规模还需要至少 2-3 个数量级。当然,过程中会伴随着成本的优化。
海外独角兽:AGI 公司的商业模式应该是什么样的?还会是 seat-based 或者 usage-based 吗?
杨植麟:AGI 帮你完成的每个任务对应的价值不一样。它可能类似一个外包,按照每个任务定价。除此之外,在任务解决过程中,广告肯定还会扮演重要角色,基于个性化互动和对话的行为,广告的变现效率可能比现在要高很多。
海外独角兽:假如 GPT-4.5、Claude-3、Gemini-2.0 的训练成本是 3 亿美元左右,再往后到 2025 年下一代模型的训练成本可能要涨到几十亿美元,那要探索出 AGI 会是一场千亿美元豪赌,你思考过它最终对人类社会的影响吗?
杨植麟:相对确定的一点是实打实的生产力提升。现在用一个软件,其实对应 1000 个程序员的智能,是固定的,以后我们用的应用背后可能对应 100 万个人的智能,而且每天都在迭代。
看可能性的话,今天的一切都会变化。这么多语言被训练到一起,对文化、价值观都有影响。人的时间分配可能也会产生很多变化,真正为了钱工作的人可能会变少,更多时间可能花在精神世界里面,最后可能会有一个巨大的虚拟的精神空间。要实现 Metaverse,可能其实是要先实现 AI。
另外,我相信 AGI 最终是全球化的。
海外独角兽:但是现在我们判断领先的模型又强又便宜,会有很强的马太效应,最后格局还是很收敛。
杨植麟:5 年的时间窗口的话,头部效应还是会明显。但是 50 年之后,我相信 AGI 肯定是同质化的,跟今天的电没有什么区别。
----
<https://mp.weixin.qq.com/s?__biz=Mjc1NjM3MjY2MA==&mid=2691539716&idx=1&sn=d0630dc55f1569f866b9cf485bd283e3>
# 月之暗面杨植麟复盘大模型创业这一年:向延绵而未知的雪山前进
Original 张小珺 腾讯科技
2024年02月29日 16:07
> 如果所有人都觉得你正常,你的理想是大家都能想到的,它对人类的理想总量没有增量。
作者 | 张小珺
出品 | 腾讯新闻《潜望》
就在一年以前,AI科学家杨植麟在硅谷做了一笔精确的计算。他意识到,如果决定启动一场以AGI为目标的大模型创业,要在未来几个月立马筹措超1亿美金资本。
然而,这仅仅只是一张入场券。一年后,这个数字翻了13倍。
大模型公司的竞争,与其说是一场科学竞争,不如说首先是一场残酷的金钱角力。在资本方捂紧口袋的情况下,你要领先对手找到更多的钱,购买更多的卡,抢夺更多的人才。
“它需要人才聚集、资本聚集。”成立于2023年3月1日的大模型公司月之暗面(Moonshot AI)创始人兼CEO杨植麟说。
过去一年,国产大模型公司似乎处在一种紧迫而逼仄的生存边缘。看上去,他们每个都手握重金。但一方面,他们要把刚融的钱,立马投入极高昂的科研中追赶OpenAI——先是追齐GPT-3.5,没等追上GPT-4,Sora又来了;另一方面,他们要马不停蹄在落地场景上找可能,自我验证你是一家公司、而不是只会吞噬资本金的研究所;这还不够,每个项目不管是上市还是并购,出路更是毫不明朗。
在中国大模型创始人中,杨植麟年纪最轻,于1992年出生。业界评价他是坚定的AGI信徒和有技术号召力的创始人。他的学习与工作履历很多与通用AI相关,论文引用超22000次。
对于大模型,中国科技界于2023年中从狂热骤然转冷,进入加速落地的实用主义主旋律。这不免让大模型CEO们处于理想与现实的剧烈拉扯之间。在人人喊PMF(Product/Market Fit,产品/市场契合)、人人喊商业化的中国AI生态里,这位AI研究员出身的创始人倒不那么着急。
月之暗面是头部国产大模型公司中,人数最少的一家,为80人。他没有像他的对手那样,做更稳妥的to B生意,或是在医疗、游戏等细分场景中找落地,而是做且只做了一款to C产品——智能助手Kimi,支持20万汉字输入。Kimi也是杨植麟的英文名。
杨植麟倾向于将他的公司看作是,构建一个结合科学、工程和商业的系统。你可以想象成,他要在人类世界上空,架起一张AI实验台,一手做实验,一手将尖端技术落进真实世界,通过与人类互动找到应用机会,再将应用送入消费者手中。理想状况是,前者烧掉数以十亿、百亿计资本;后者再把这些钱数成百上千倍地挣回来——怎么听,都像“走钢丝”一样惊险。
“AI不是我在接下来一两年找到什么PMF,而是接下来十到二十年如何改变世界。”他说。
这种抽象和理想主义的思考,令人不免替他捏一把冷汗:一位年轻的AI科学家,在现实主义的中国能否找到生存空间?
2024年2月,月之暗面逆势完成一笔大额融资。据了解,它以15亿美金投前估值完成超10亿美元B轮,阿里领投,砺思资本、小红书等跟投,该笔交易完成后,月之暗面投后估值约25亿美元——由此,它成为中国大模型赛场上现阶段估值最高的一家独角兽。(他们拒绝回应和评论此事。)
就在第三笔融资的过程中,我们和杨植麟聊了聊他过去一年创业故事,这也是国产大模型抢跑一年的截面缩影。
他的公司没有选址在大模型企业聚集地,北京搜狐网络大厦。对于一家融资总额约90亿元人民币的公司,这间位于量子芯座的办公室,显得简陋又破旧。门口连公司logo都没有,只有一架白色钢琴守在门口。
会议室在一个角落,由于窗户小黑漆漆的,冬天送来暖风的空调机器嗡嗡作响。暗沉的光亮中,杨植麟形容自己过去一年的感知:“有点像开车在路上,前面有延绵的雪山,但你不知道里面是什么,你在一步一步往前走。”
以下是对杨植麟的访谈全文。(为方便阅读,作者做了一些文本优化)
## 站在开端
> “要ride the wave”
腾讯新闻《潜望》:最近你的状态怎么样?
杨植麟:忙啊,事情很多。但还是很兴奋。站在产业开端,有巨大想象空间。
腾讯新闻《潜望》:我刚进来看到你们公司门口放了一架纯白色钢琴。
杨植麟:上面还有一张Pink Floyd专辑。我都不知道谁放的,前两天突然看到,没来得及问。(Pink Floyd是发布专辑《月之暗面》的英国摇滚乐队)
腾讯新闻《潜望》:2022年11月,ChatGPT发布那天,你在做什么?
杨植麟:我正在筹备这个事,找人组团队,碰撞一些新认知。看到ChatGPT很激动。放到三五年前,甚至2021年,都是不可思议的。这种高阶推理能力过去很难做到。
我预感市场会发生很多变量:一方面是资本,一方面是人才,这是做AI的核心生产要素。如果变量成立,我们就有可能正儿八经搞一家公司做这件事——一个为AGI搭建的组织从0到1存在可能性,这是很大的顿悟。独立公司更make sense,但不是你想做马上就能做,ChatGPT刺激了变量,使生产要素齐全。还是要ride the wave。
腾讯新闻《潜望》:你在决定创立一家AGI公司后,做了哪些准备?怎么凑齐资本和人才这两个生产要素?
杨植麟:是曲折的过程。ChatGPT扩散需要时间。有的人知道得早,有的人知道得晚,有的人一开始怀疑、后面变成震惊、再变成相信。找人找钱,跟timing结合得很紧。
我们2023年2月开始集中做第一轮融资。如果delay(延迟)到4月,基本没机会了。但如果2022年12月或2023年1月做也没机会,当时有疫情,大家没反应过来——所以,真正窗口就是一个月。
当时,在美国有一个晚上,我做了精确的计算。算完觉得至少要在几个月内拿到1亿美元。市场上很多人没开始融资,很多人觉得你这个不一定能融这么多钱。但后来证明是可以的,甚至比这个更多。
人才市场开始流动。受ChatGPT启发,很多人在2023年3月或4月有这样的realization(意识):这是接下来十年唯一值得做的。要在正确时间积极触达对的人。如果是前两年,人才聚集度不会这么高。那时更多人做传统AI,或者跟AI相关的业务,都不是通用AI。
腾讯新闻《潜望》:总结一下,2月是融资的窗口期,3月、4月是招人的窗口期?
杨植麟:差不多。
腾讯新闻《潜望》:你在美国那一晚是在哪算了这笔账?具体怎么算的?
杨植麟:22年底到23年初,我在美国待了一两个月,找人聊。
在我住的地方。算一下你对应多少FLOPs(Floating Point Operations,每秒浮点运算次数)、Training Cost(训练成本)、Inference(推理)、用户量。
腾讯新闻《潜望》:彼时彼刻,硅谷沉浸在什么样的情绪中?
杨植麟:这个产品开始有很多early adopters(早期用户),集中在技术圈,我们本身在这个圈子,感受更深刻。硅谷大厂每半年要写performance review(绩效评估),开始很多人用ChatGPT写。有的人平时写的语言不大professional(专业),用ChatGPT写,大家都一本正经的样子。
暗流涌动。很多人考虑下一份工作去哪或者创业。很多和我们聊的朋友后来纷纷创业。而且,有很强FOMO情绪(Fear of Missing Out,害怕错过)。所有人每天睡不着觉。不管晚上12点、1点、2点,你去找,always大家都在。有点焦虑,有点FOMO,又很兴奋。
腾讯新闻《潜望》:算出要融1亿美金那晚,你算到了几点?
杨植麟:还好吧,计算过程倒不用很久。
但算完我也不能跟太多人说。说了也没有人觉得这事可以做。
## 技术师承
> “把自己从无限雕花中解放出来”
腾讯新闻《潜望》:创投行业提到你会说,“创始人很聪明,有技术号召力,团队里也有很多技术明星”。所以,聊大模型创业之前,想先聊聊你的学术背景。
你本科是清华计算机系,博士是卡耐基梅隆计算机学院,方向一直是AI吗?
杨植麟:我是92年出生,11级本科,大二到现在十多年一直在这个方向。一开始偏发散的探索,到处看看,跟图或多模态都做过一些,2017年收敛到语言模型——当时觉得语言模型是比较重要的问题,后来觉得它是唯一重要的问题。
腾讯新闻《潜望》:2017年AI业界对语言模型普遍是怎样的认知,后来如何演进?
杨植麟:它(当时)是用来给语音识别做排序的模型。(笑)当你识别完一段语音,有很多结果,拿语言模型看到底哪个概率更大,输出最有可能的结果,应用非常有限。
但你发现它是根本问题,因为你在对这个世界概率建模。虽然语言局限,它是世界的投映;但理论上你把token space(所有可能的标记组成的空间)做得更大,就可以构建一个通用世界模型。世界上每样东西怎么产生、发展,都能给它分配一个概率。所有问题都可以被归结成怎么对概率估计。
腾讯新闻《潜望》:你学术生涯的导师很有名,博士导师是苹果公司AI负责人Ruslan Salakhutdinov和Google AI智能首席科学家William W. Cohen。他们都既在产业界,又在学界。
杨植麟:产业界和学术界从前几年有更多结合,现在趋势在变化:更多有价值的突破会产生在工业界,这是发展的必然规律。先从探索性研究开始,逐渐转移到更成熟的工业化过程,但不意味着工业化过程中不需要研究,只是纯研究会很难做出有价值的突破。
腾讯新闻《潜望》:从这几位颇有名望的导师身上学到了什么?
杨植麟:我学习到最多是在Google,实习了很长时间。2018年底开始做基于Transformer的语言模型,最大learning是从无限雕花中把自己释放出来,这很关键。
应该看什么是大方向、大梯度。当你眼前有十条路,一般人考虑我走这条路前面有一个行人怎么刹车,是短期细节,但这十条路到底选哪一条最重要。
这个领域在之前有这样的问题。比如,在只有一两百万token(标记)的数据集上,看perplexity(困惑度,衡量模型在预测序列时的不确定性或混乱度)怎么降得更低,loss(损失,模型在训练过程中的误差或损失函数的值)怎么降得更低,怎么提升准确率,你会陷入无限雕花。有人发明很多诡异的architecture(架构),这些是雕花技巧。雕花之后可能在这种数据集上变好,但没看到问题本质。
本质在于,要去分析这个领域缺少的是什么?第一性原理是什么?
Scaling law为什么能成为第一性原理?你只要能找到一个结构,满足两个条件:一是足够通用,二是可规模化。通用是你把所有问题放到这个框架建模,可规模化是只要你投入足够多算力,它就能变好。
这是我在Google学到的思维:如果能被更底层的东西解释,就不应该在上层过度雕花。有一句重要的话我很认同:如果你能用scale解决的问题,就不要用新的算法解决。新算法最大价值是让它怎么更好的scale。当你把自己从雕花的事中释放出来,可以看到更多。
腾讯新闻《潜望》:Google那时也是scaling law的追随者吗?它是怎么贯彻第一性原理的?
杨植麟:已经有很多这样的思想,但Google没有贯彻得非常好。它有这样的思维,但它没办法组织起来,变成一个真正的moonshot(登月计划)。更多是,这有5个人追求我的第一性原理,那有5个人追求他们的第一性原理。没有top-down(自上而下)的东西。
腾讯新闻《潜望》:你读博期间,先后和图灵奖得主Yann LeCun(杨立昆)、Yoshua Bengio合作发表论文,而且你都是一作。学术上这些合作是怎么产生的?——我的意思是,他们是图灵奖得主,又不是你的导师,你靠什么吸引他们?
杨植麟:学术界很open。只要你有好的想法、有意义的问题,这个都还好。两个脑子或n个脑子做出来的,比一个脑子多。这在开发AGI的时候也可以用。AI一个重要策略叫“ensemble”(使用集成方法,用多个不同的模型或方法,将它们的预测或结果结合起来,获得更优性能),本质在做一样的事情,当你有diverse的观点你可以碰撞出很多新东西。合作有很大受益。
腾讯新闻《潜望》:你是先有一个idea,拿去问他们是否感兴趣吗?
杨植麟:差不多是这个过程。
腾讯新闻《潜望》:在学术上搞定学术大佬和在融资中搞定资本大佬哪个更难?相似点是什么?
杨植麟:“搞定”不是一个好的词,背后本质是合作。合作就是能双赢,因为双赢是合作的前提。所以也没什么区别,需要给别人提供独特价值。
腾讯新闻《潜望》:怎么让他们信任?你觉得你的天赋是什么?
杨植麟:也没有什么天赋,就是努力干活。
## 旧系统不适用了
> “AGI需要新的组织方式”
腾讯新闻《潜望》:你刚说“更多有价值的突破会发生在工业界”,包括创业公司、巨头的AI lab?
杨植麟:Lab是历史了。以前Google Brain是产业界最大AI lab,但它是把研究型组织安插在大公司。这种组织能探索新想法,很难产生伟大系统——能产生Transformer,但产生不了ChatGPT。
现在的开发方式会演变成,你是要做一个巨大的系统,需要新的算法,扎实的工程,甚至很多产品和商业化。好比21世纪初,你不可能在实验室研究信息检索,要放在现实世界,有一个巨大的系统,有一个有用户的产品,像Google。所以,科研或教育系统会转变职能,变成培养人才为主。
腾讯新闻《潜望》:你会怎么形容这个新的系统形式?OpenAI是它的雏形?
杨植麟:它是现在最成熟的组织了,还在逐渐演化。
腾讯新闻《潜望》:可以理解,这是为人类宏伟的科学目标而设立的组织?
杨植麟:我想强调,它不是纯科学,它是科学、工程和商业的结合。它得是一个商业化组织,是公司、不是研究院。但这个公司是从零到一建造的,因为AGI需要新的组织方式——一,生产方式跟互联网不一样;二,它会从纯研究变成研究、工程、产品、商业相结合。
核心是,它应该是一个登月计划,有很多自顶向下的规划,但规划中又有创新空间,并不是所有技术都确定。在一个top-down(自上而下)框架下有bottom-up(自下而上)的元素。本来不存在这样的组织,但组织要适配技术,因为技术决定了生产方式,不匹配就没法有效产出。我们相信大概率要重新设计。
腾讯新闻《潜望》:去年OpenAI政变时,Sam Altman有一种选择是加入微软,领导新的微软人工智能团队。这和他在OpenAI做CEO的本质差别是什么?
杨植麟:你需要在旧文化里产生新组织,难度很大。
腾讯新闻《潜望》:你想做“中国的OpenAI”,可以这么说?
杨植麟:不大准确,我们不想做中国的什么东西,也不一定想做OpenAI。
首先,真正AGI肯定是全球化的,不存在由于市场保护机制导致你只能做某个regional market(区域市场)的AGI公司,长期不存在——全球化、AGI和你有一个很大用户量的产品,这三个东西最终是必要条件。
第二,是不是OpenAI?你去看2017年-2018年,OpenAI风评很差,我们圈子的人找工作,一般考虑像Google。很多人跟Ilya Sutskever(OpenAI首席科学家)聊完,觉得这个人疯了,太自以为是了——OpenAI不是疯子就是骗子。但他们从很早开始投入,找到非共识,找到AI现在唯一work的第一性原理:通过next token prediction去scale(通过对下一个标记的预测来进行规模化)。
我认为,会有比OpenAI更伟大的公司存在。一个真正伟大的公司能结合技术理想主义,并让它用一个伟大的产品跟用户共创,AGI最终会是一个跟所有用户co-work(协作)产生的东西。所以,不光是技术,也需要功利主义和现实追求。最终在这两者之间完美结合。
不过我们应该学习OpenAI的技术理想主义。如果所有人都觉得你正常,你的理想是大家都能想到的,它对人类的理想总量没有增量。
## 登月的第一步是长文本,第二步呢?
> “接下来会有两个milestone”
腾讯新闻《潜望》:话题回到你决定创业的时刻,你回国后立马启动了第一轮融资?
杨植麟:(去年)2月在美国就开始了,也有远程的。最后以国内投资人为主。
腾讯新闻《潜望》:第一轮融了1亿美金?
杨植麟:第一轮还没有,后来超过这个数。2023年完成两轮,总共近20亿人民币。
现在是第三轮。融资我们没有正式announce,现在没办法comment。
腾讯新闻《潜望》:有人说,2023年下半年开始,已经没有人愿意投基础大模型公司了,他们说的是错误的?
杨植麟:还是有。确实能看到情绪变化,不是说没人投,至少目前市场上投资意向是蛮多的。
腾讯新闻《潜望》:除了资本和人,你在2023年还做了哪些关键决策?
杨植麟:要做什么事。这是我们这类公司的优势——在最高层面的决策有技术vision(愿景)。
我们做long context(长上下文),需要对未来有判断,你要知道什么是根本的、接下来的方向。还是第一性原理,“去雕花的过程”。如果你专注雕花,只能看OpenAI已经做了什么,我看怎么把它已经做的做出来。
你会发现在Kimi(AI智能助手)里做长文本无损压缩,产品体验独特。读英语文献,它能很好帮你理解。你今天用Claude或GPT-4,不一定做得好,需要提前布局。我们做了半年多。相比我今天看到一个long context风口,赶紧召集两个团队,用最快速度开发,有很大区别。
当然马拉松刚开始,接下来会有更多差异化,这需要你提前预判到底什么是“成立的非共识”。
腾讯新闻《潜望》:做这件事是在几月份决定的?
杨植麟:二三月,公司成立就决定了。
腾讯新闻《潜望》:为什么长文本是登月第一步?
杨植麟:它很本质。它是新的计算机内存。
老的计算机内存,在过去几十年涨了好几个数量级,一样的事会发生在新的计算机上。它能解决很多现在的问题。比如,现在多模态架构还需要tokenizer(标记器),但当你有一个无损压缩的long context就不需要了,可以把原始的放进去。进一步讲,它是把新计算范式变成更通用的基础。
旧的计算机可以0、1表示所有,所有东西可被数字化。但今天新计算机还不行,context不够多,没那么通用。要变成通用的世界模型,是需要long context的。
第二,能够做到个性化。AI最核心的价值是个性化互动,价值落脚点还是个性化,AGI会比上一代推荐引擎更加个性化。
但个性化过程不是通过微调实现,而是它能支持很长的context(上下文)。你跟机器所有的历史都是context,这个context定义了个性化过程,而且无法被复刻,它会是更直接的对话,对话产生信息。
腾讯新闻《潜望》:接下来它有多大可扩展的空间?
杨植麟:非常大。一方面是本身窗口的提升,有很长路要走,会有几个数量级。
另一方面是,你不能只提升窗口,不能只看数字,今天是几百万还是多少亿的窗口没有意义。你要看它在这个窗口下能实现的推理能力、the faithfulness的能力(对原始信息的忠实度)、the instruction following的能力(遵循指令的能力)——不应该只追求单一指标,而是结合指标和能力。
如果这两个维度持续提升,能做非常多事。可能可以follow(执行)一个几万字的instruction(指令),instruction本身会定义很多agent(智能体),高度个性化。
腾讯新闻《潜望》:做长文本和追赶GPT-4技术是可复用的吗?他们是一件事吗?
杨植麟:我觉得不是。更多是升维,是一个新维度,是GPT-4没有的维度。
腾讯新闻《潜望》:很多人说国内这几家大模型公司做的事都差不多——2023年追赶GPT-3.5,2024年追赶GPT-4。你认可这种说法吗?
杨植麟:综合能力提升肯定有关键目标,这个说法一定程度上是对的,你是后发肯定有追赶过程。但同时它是片面的。除了综合能力,在很多空间可以产生独特的能力,能在一些方向做到state of the art(世界领先)。Long context是一个。DALL-E3图片生成效果完败于Midjourney V6。所以要做两方面。
腾讯新闻《潜望》:综合能力和新维度分别耗费的时间及生产资源,占多大比例?
杨植麟:需要结合,新维度不可能脱离综合能力存在,很难直接给出一个比例。但需要足够投入才能把新维度做好。
腾讯新闻《潜望》:这些新维度对于你们,都会承载在Kimi上?
杨植麟:这肯定是我们很重要的产品,也会有一些别的尝试。
腾讯新闻《潜望》:怎么看李广密(拾象创始人)说,中国大模型公司今天的技术辨识度还不算太高?
杨植麟:我觉得还好啊,我们今天只是做出了很多差异化。这跟时间有关系,今年应该能看到更多维度。去年大家是先搭个架子,先跑起来。
腾讯新闻《潜望》:登月的第一步是长文本,第二步是什么?
杨植麟:接下来会有两个大的milestone(里程碑)。一是真正的统一的世界模型,就是它能统一各种不同模态,一个真正的scalable和general的architecture(可扩展、通用的系统结构)。
二是能在没有人类数据输入的情况下,使AI持续进化。
腾讯新闻《潜望》:这两个milestone需要多久达到?
杨植麟:两到三年,有可能更快。
腾讯新闻《潜望》:所以三年后我们已经看到的是和今天完全不一样的世界了。
杨植麟:按照今天的发展速度是这样。现在技术是萌芽,快速发展的阶段。
腾讯新闻《潜望》:能不能畅想一下三年后会出现什么?
杨植麟:会有一定程度的AGI。我们今天在做的很多事AI也能做,甚至它做得更好。但关键看我们怎么用它。
腾讯新闻《潜望》:对于你、对于月之暗面这家公司来说呢?接下来第二步是什么?
杨植麟:我们会去做这两件事。剩下很多问题,都是这两个因素推导出来的。今天谈到reasoning(推理)、agent(智能体),都是这两个问题解决后的产物。要再做一些雕花,但没有fundamental的blocker(根本性阻碍因素)。
腾讯新闻《潜望》:你会all in追赶GPT-4吗?
杨植麟:(GPT-4)是AGI的必经之路。核心是,不能只满足做到GPT-4的效果。一是要想现在真正的非共识是什么,除了GPT-4,下一步是什么?GPT-5和GPT-6应该是什么样?二是看,你在这里面有哪些独特能力,这点更重要。
腾讯新闻《潜望》:其他大模型公司会公布自己的模型能力和排名,你们好像没做这件事?
杨植麟:刷榜意义很小了。最好的榜就是用户,应该让用户投票。很多榜存在问题。
腾讯新闻《潜望》:在中国大模型公司的竞赛中最快达到GPT-4,是你的目标吗?快与慢有区别吗?
杨植麟:肯定有,如果把时间放到足够长周期,最终所有人都能达到。但要看你早晚是多长周期。半年或以上的周期是有意义的,也取决于你能用这个周期做什么事。
腾讯新闻《潜望》:你们预计会在什么时间达到GPT-4?
杨植麟:应该会很快,具体时间还没办法对外说。
腾讯新闻《潜望》:你们会是最快的吗?
杨植麟:这要动态去看,但我们有概率。
腾讯新闻《潜望》:推出Kimi之后,你的北极星指标是什么?
杨植麟:今天是把产品做得更好,有更多升维(即新的维度)。举个例子,不应该只去卷一个搜索场景,搜索在后面只是这个产品有价值的很小一部分,这个产品应该有更大增量。比传统搜索引擎好个10%、20%,没什么太大价值——只有一个颠覆性的东西,才配得上AGI这三个字。
独特价值是你增量的智能。要抓住这个点,智能永远是最核心的增量价值。如果你这个产品最核心价值只有10%-20%来自于AI,就不成立。
## 我一点也不焦虑落地
> “user scaling和model scaling需要同时做”
腾讯新闻《潜望》:2023年中是一个巨大分水岭,市场从狂热迅速转冷。你的感知是怎样的?
杨植麟:这个判断我不完全认同,我们确实在下半年完成了一轮融资。而且,持续有新东西出来。今天的模型能力在去年底无法想象。越来越多AI公司的用户量和revenue(收入)一直在上升。它持续地证明了价值。
腾讯新闻《潜望》:上半年和下半年对于你来说,不同感受是?
杨植麟:没有太大变化,变量肯定存在,但回到第一性原理——怎么给用户提供好产品。最终,我们要满足用户需求,而不是赢得一场比赛。我们不是为了竞争而建立的公司。
腾讯新闻《潜望》:业界认为,2023年上半年和下半年一个显著区别是,关注重心变了。上半年提AGI更多,下半年开始讲怎么落地、怎么商业化。你有没有这么做?
杨植麟:我肯定要做AGI嘛,这是接下来十年唯一有意义的事。但不是说我们不做应用。或者,不应该把它定义成一个“应用”。
“应用”听起来好像你有一个技术,你想把它用在什么地方,有商业化闭环。但“应用”不是准确的词。它跟AGI是相辅相成的。它本身是实现AGI的手段,也是实现AGI的目的。“应用”听起来更像目的:我为了让它有用。你是要combine东西方的哲学,要赚钱,也要有理想。
今天用户帮我们发现了很多从没考虑过的场景。他拿这个筛选简历,这是我们设计产品时没想过的,但它天然work。用户的输入反过来让模型变得更好。Midjourney为什么效果好?它在用户端做了scaling——user scaling和model scaling需要同时做。反过来,你如果只关注应用,不关注模型能力迭代,不关注AGI,贡献也有限。
腾讯新闻《潜望》:朱啸虎(金沙江创投主管合伙人)就只投大模型的应用。他有一个观点:核心最难的是AIGC的PMF——你十个人找不到PMF,你投一百个人也找不到,和人数、和成本没关系,不要砸钱。他说“用LLaMA训练两三个月,至少能做到人类top 30的水平,立马可以取代人”。你怎么看他的观点?
杨植麟:AI不是我在接下来一两年找到什么PMF,而是接下来十到二十年如何改变世界——这是两种不同思维。
我们是坚定的长期主义者。当你实现AGI或更强智能,今天的一切会被改写。PMF固然重要,但如果着急找PMF,你很可能又被降维打击。降维打击发生过太多次。以前很多人做客服、对话系统,做slot filling(槽填充),有些规模不错的公司。但是,全是降维打击了,很难受。
它不是说不成立。假设你今天找到一个场景,用现在的技术能力,且从0到1增量价值巨大,从1到n空间又没那么大,这种场景OK。Midjourney就是,或者做文案生成,相对简单一点的任务,从0到1效果又很明显。这种是只关注应用的机会。但是,最大机会不在这。你的目的假设是商业化,你不可能脱离AGI去思考。我现在只做应用,那好,可能过一年你就被碾压了。
腾讯新闻《潜望》:可以偷偷把底层模型升级啊。
杨植麟:但这个不可能做得比它更大。技术是这个时代唯一新变量,其他变量没变。回到第一性原理,AGI是所有事情的核心。基于这个,我们推导出来:超级应用肯定需要有最强的技术能力。
腾讯新闻《潜望》:可以用开源的模型吗?(最新消息是Google宣布开源模型Gemma)
杨植麟:开源落后于闭源,这也是个事实。
腾讯新闻《潜望》:会不会只是暂时落后?
杨植麟:目前看起来不是。
腾讯新闻《潜望》:为什么开源追不上闭源?
杨植麟:因为开源的开发方式跟以前不一样了,以前是所有人都可以contribute(贡献)到开源,现在开源本身还是中心化的。开源的贡献可能很多都没有经过算力验证。闭源会有人才聚集和资本聚集,最后一定是闭源更好,是一个consolidation(对市场的整合)。
如果我今天有一个领先的模型,开源出来,大概率不合理。反而是落后者可能会这么做,或者开源小模型,搅局嘛,反正不开源也没价值。
腾讯新闻《潜望》:你怎么对抗国内的焦虑情绪?他们会说,大模型公司如果没有快速做出能兑现投资人预期的落地场景和产品,难以融到下一笔钱。
杨植麟:需要有长期和短期的平衡。完全没有用户、没有收入,肯定不行。
可以看到,从GPT-3.5到GPT-4,解锁了很多应用;从GPT-4到GPT-4.5再到GPT-5,大概率会持续解锁更多,甚至是指数型的应用。所谓“场景摩尔定律”,就是你能用的场景数量会随着时间指数级上升。我们需要边提升模型能力,边找更多场景,需要这样的平衡。
它是个螺旋。看你投入多少分配在短期,多少分配在长期。要在你能活下去的情况下,追求长期。长期一定不能没有,否则你会错过整个时代。今天下结论,确实太早了。
腾讯新闻《潜望》:你认可王慧文(美团联合创始人、光年之外创始人)提出的“双轮驱动”吗?
杨植麟:这是个好问题。一定程度上是这个逻辑。但你真正怎么去做,有很大区别。是不是能真的做一些“有概率的非共识”?
腾讯新闻《潜望》:我理解他们说的双轮驱动,也需要快速找到那个新的应用场景,否则不知道技术何以落地。
杨植麟:还是model scaling(模型扩展)和user scaling(用户扩展)之间的区别。
腾讯新闻《潜望》:国内除了你是model scaling的思维,还有谁是?
杨植麟:这个我就不好评价了。
腾讯新闻《潜望》:大多数人可能是user scaling的思维。或者能不能这么说,这是学院派和商业落地派的区别?
杨植麟:我们不是学院派,学院派绝对不work。
腾讯新闻《潜望》:很多大模型公司会通过to B落地(毕竟to B的确定性高),你们做吗?
杨植麟:我们不做。我们从第一天就决定做to C。
看你要什么东西。如果你知道这不是你想要的,你就不会FOMO。因为得到了,也没啥。
腾讯新闻《潜望》:你焦虑吗?过去一年。
杨植麟:更多是兴奋、激动。因为这件事我想了非常久。我们可能是我们最早想去探索月之暗面的人。你今天发现你真的在造一架火箭,每天在讨论往火箭里加什么燃料跑得更快,怎么样不让它炸了。
腾讯新闻《潜望》:总结一下你所做过的“有概率的非共识”决定,除了to C、长文本,还有吗?
杨植麟:更多在过程中,希望尽快跟大家见面。
腾讯新闻《潜望》:中国上一代创业者在应用和场景上吃到甜头,所以他们更看产品、用户、数据飞轮。以你为代表的新一代AI创业者,能代表新的未来吗?
杨植麟:我们也很关注用户,用户是我们最终的目标,但也是共创的过程。最大区别是,这次会更加技术驱动——还是那个马车和汽车的问题——现在属于从马车到汽车的跳跃过程,应该尽可能想怎么给用户提供一辆汽车。
腾讯新闻《潜望》:你会觉得孤独吗?
杨植麟:哈哈哈……你这个问题很有意思。我觉得还好,因为我们还有大几十、100号人一起在战斗。
## GPT-4还没赶上,Sora又来了
> “现在就有点像视频生成的GPT-3.5,是阶跃式提升”
腾讯新闻《潜望》:今年Sora的突然出现,多少在你的意料之中,多少在你的意料之外?
杨植麟:Generative AI(生成式AI)做到这个效果,在意料之内,意外的是时间——比之前预估更早。这也反映了现在AI的发展很快,很多scaling的红利没有被完全吃下来。
腾讯新闻《潜望》:去年业界就判断,2024年大模型一定会卷多模态叙事,视频的生成效果会像2023年文生图一样迅速提升。Sora的技术能力是超出、符合还是低于你的预期?
杨植麟:解决了很多之前比较难的问题。比如,能在一个比较长的时间窗口内保持生成的一致性,这是关键点,是一个巨大的提升。
腾讯新闻《潜望》:它对于全球产业格局来说意义是什么?2024年大模型会有哪些新叙事?
杨植麟:一是短期的应用价值,可以在生产环节进一步提升效率,当然更期待在目前能力基础上,有更多延展。二是和其他模态结合。它本身是对世界建模,有了这个知识,对现有文本是非常好的补充。在这个基础上,不管在agent还是和物理世界的连接方面,有蛮多空间和机会。
腾讯新闻《潜望》:你们总体怎么判断Sora?
杨植麟:我们本来也在筹划类似方向,做了一段时间。方向上,倒没有太大意外,更多是技术细节。
腾讯新闻《潜望》:应该学习的技术细节是?
杨植麟:很多OpenAI也没完全讲清楚。它讲了大致的,会有一些关键细节。这要从它的效果或已有信息再去判断,也结合我们之前的实验。至少对我们来说,在开发过程中会加上更多数据点,有更多数据输入。
腾讯新闻《潜望》:之前视频生成相对文字生成来说,主要瓶颈有哪?这次可以看到OpenAI找到了哪些解决办法?
杨植麟:主要瓶颈,核心还是数据,你怎么去规模化地拟合这个数据?之前没被验证过。特别是,当你的动作比较复杂,生成的效果photo realistic(照片逼真)。在这样的条件下,能够去规模化,它这次解决了这些。
剩下的是它也没有完全解决,比如需要一个统一的architecture(架构)。DiT这个architecture仍然不是非常通用。在单纯对视觉信号的marginal probability(边际概率)去建模,它可以做得非常好,但怎么泛化成一个通用的新计算机?还是需要更unified architecture(统一的架构),这个东西还是有空间。
腾讯新闻《潜望》:你读了OpenAI出的Sora报告没有?——《Video generation models as world simulators》,里面有什么关键点值得划重点?
杨植麟:读了。考虑到当前的竞争情况,最重点它肯定都不会写出来。但还是值得学习,这个东西本来是付费内容,你可能要花钱做很多实验才知道,但现在你知道的有一些东西,不用花钱做实验,就大概有一个认知吧。
腾讯新闻《潜望》:你从里面提取到的关键信号是?
杨植麟:这个东西一定程度上是scalable的。此外,它也给出了比较具体的architecture到底怎么做。但也有可能不同architecture在这个事情上不一定有那么本质的区别。
腾讯新闻《潜望》:你认可它那句话吗?——“扩展视频生成模型是构建物理世界通用模拟器的一条有前途的途径。”
杨植麟:我非常认同,这两个东西优化的是同一个目标函数,没有太大疑问。
腾讯新闻《潜望》:你怎么看杨立昆又跳出来反对生成式AI?他的观点是:“通过生成像素对世界进行建模是一种浪费,并且注定会失败。生成恰好适用文本,因为文本是离散的具有有限数量的符号。这种情况下,处理预测中的不确定性很容易,处理高纬连续感官输入中的预测不确定性是非常棘手的。”
杨植麟:我现在觉得,你通过对视频的边际概率去建模,本质是在做无损压缩,跟语言模型next token predictions没有本质区别。只要你压缩得足够好,就可以把这个世界可以被解释的东西去进行解释。
但同时也有重要的还没做的事:它怎么跟已有的已经被压缩的能力结合起来?
可以理解成有两种不同压缩。一种是压缩原始世界,这是视频模型在做的。另一种是压缩人类产生的行为,因为人类产生的行为经过了人的大脑,这是世界上唯一能产生智能的东西。你可以认为视频模型在做第一种,文本模型在做第二种,当然视频模型也一定程度包含了第二种,一些人创造出来的视频包含了创作者的智能。
它最终可能会是mix,需要通过这两种方式从不同角度学习,但最终对智能的增长都有帮助。
所以,生成可能不是目的,它只是压缩这个函数。如果你压缩足够好,最后生成的效果就会很好。反过来,如果你这个模型本身没办法生成,是不是也存在可能把它压缩得非常好?这点存疑。有可能生成非常好,是压缩非常好的一个必要条件。
腾讯新闻《潜望》:Sora相对于去年的ChatGPT来说,是两个不一样的milestone,哪个更重大?
杨植麟:都很重要。现在就有点像(视频生成的)GPT-3.5,是阶跃式提升。它的模型也还比较小,可预见的是会有更大的模型,是确定性的效果提升。
腾讯新闻《潜望》:也有人评价说,对于做多模态,Google Gemini突破更重要一些。
杨植麟:Gemini是follow GPT-4V的路线,把这个理解也放进去了。都很重要,只是最终需要把这些东西放在同一个模型,这还没解决。
腾讯新闻《潜望》:为什么放在同一个模型那么难?
杨植麟:大家还不知道怎么做,还不存在一个被验证过的architecture。
腾讯新闻《潜望》:Sora + GPT会产生什么?
杨植麟:Sora马上可以用到视频生产过程中,但如果跟语言模型结合,就有可能打通数字世界和物理世界。另外,你也可以去更加端到端完成任务,因为现在你对这个世界的建模比之前更好,它甚至能用来提升你对多模态输入的理解能力。所以你最后能在不同模态之间做比较多切换。
总结下来,你对世界的理解更好了,你可以在数字世界里做更加端到端的任务,甚至去架起一座桥梁,连接物理世界,完成一些物理世界里的任务。这是起点。比方说,自动驾驶,或者一些家务,理论上都是打通物理世界的一个概念。
所以数字世界的突破是确定的了,但它也还是潜在有通往物理的可能。
腾讯新闻《潜望》:Sora对国产大模型公司意味着什么?有什么应对策略?
杨植麟:没什么区别,这本来就是确定性方向。
腾讯新闻《潜望》:国产大模型GPT-4还没赶上,Sora又来了,你怎么看?两个世界好像差得越来越远,你感觉焦虑吗?
杨植麟:这就是客观的事实嘛。但实际上的差距可能还在缩小,这是技术发展的规律。
腾讯新闻《潜望》:什么意思?就是说,一开始技术曲线很陡峭,接着慢慢放缓。
杨植麟:是的。我倒没有很意外,OpenAI一直在做下一代模型。但客观上差距会持续存在一段时间,甚至在国内不同公司之间的差距也会持续一段时间,现在是技术爆发期。
但再过两三年,有可能中国顶尖的公司可以在这里面去做好更多基础性工作,包括技术的基建、人才的储备和组织文化的沉淀,有这些打磨后,更有可能在某一些方面有领先可能性——但需要一定的耐心。
腾讯新闻《潜望》:中美最终有没有可能形成的是完全不一样的AI科技生态?
杨植麟:生态有可能不一样,如果你是从产品和商业化角度。但从技术角度,通用能力不会是完全不同的技术路线,基础通用能力肯定会差不多。但因为AGI空间很大,在通用能力基础上去有差异化,这个更可能发生。
腾讯新闻《潜望》:硅谷一直有一个争论:one model rules all还是many specialized (smaller) models(一个通用模型来处理各种任务,还是采用许多专门的较小模型来处理特定任务),你怎么看?
杨植麟:我的观点是第一个。
杨植麟:在这一点上,中美会呈现巨大不同吗?
杨植麟:我觉得最终不会。
## 我接受有失败的概率
> “它已经改变了我的生命”
腾讯新闻《潜望》:大模型创业在中国是比较怪异的存在,你们融了这么多钱,但似乎一大笔钱都要花在做科学实验上,这种情况下怎么说服投资人愿意掏钱?
杨植麟:跟在美国没有区别。我们今天拿到的钱还不算特别多。所以,我们还要更多向OpenAI学习。
腾讯新闻《潜望》:我想知道做到GPT-4还需要多少钱?做到Sora还需要多少钱?
杨植麟:GPT-4和Sora都不需要那么多,现在的钱更多是为了下一代甚至下下代模型做储备,做前沿探索。
腾讯新闻《潜望》:中国大模型创业公司虽然拿了巨头的钱,但巨头也在训练自己的模型——你怎么看大模型创业公司和巨头的关系?
杨植麟:这里面有竞争,也有合作。巨头和创业公司第一目标不一样,今天你去看每个大厂的第一目标,跟AGI公司的第一目标不同。第一目标会影响动作、结果,最终在生态里是不同的关系。
腾讯新闻《潜望》:为什么巨头同时对多家大模型公司投入一点钱,而不重注一家公司?
杨植麟:这是阶段问题。下面会有更多的consolidation(资源整合),会有更少的公司。
腾讯新闻《潜望》:有人说大模型公司的终局是被巨头收购,你认可吗?
杨植麟:我觉得不一定,但是他们有可能有很深入合作关系。
腾讯新闻《潜望》:比如说,可以怎么合作?
杨植麟:OpenAI和微软就是典型合作模式,这里面很多可以参考,也有一些可以优化。
腾讯新闻《潜望》:过去一年,在你看来创业中的曲折体现在了哪?
杨植麟:外部变量很多——资本、人才、卡、产品、研发、技术。有高光时刻,也有困难要克服。比如说卡。
中间有很多back and forth(来回)。一段时间很紧张,一段时间供应变好。最夸张的是,有一段时间每天在变,今天一台机器价格260,明天340了,过两天又跌回来,是一个动态变化的过程。要对这件事密切关注。价格一直变,策略也要一直变,到底从什么渠道,买还是租,有很多不同选择。
腾讯新闻《潜望》:这个动态因素是受什么影响?
杨植麟:有geo-political(地缘政治)原因,生产本身有批次,也受市场情绪变化。我们观察到很多公司开始退卡,他们发现自己不一定要训这个模型。市场情绪和大家的决策变化,供求关系跟着变化。好消息是,最近整个市场供应好了非常多。我个人判断至少在接下来一到两年,卡不会成为很大瓶颈。
腾讯新闻《潜望》:你似乎一直在思考组织,在团队构建上是怎么做的?
杨植麟:招人思路发生过一些变化。世界上AGI人才非常有限,有经验的人很少。我们最早期的画像是,专注找对口的genius(天才)。这个证明非常成功。之前有对模型动手术的能力,有训练超大规模模型直接的经验,就可以很快做出来。包括Kimi发布,资本效率和组织效率其实很高。
腾讯新闻《潜望》:花了多少钱?
杨植麟:一个挺小的数,相比很多其他花费,是花小钱办大事。我们很长一段时间是30-40人的状态。现在80人。我们追求人才密度。
人才画像后来发生了变化。最早期招genius,认为他的上限高,公司上限是由人的上限决定的。但后面我们补齐了更多维度的人——产品运营侧的人,leader型的人,能把事情做到极致的人。现在是一个更完整、有韧性、能打仗的团队。
腾讯新闻《潜望》:在中国大模型创业一年,怎么评价现在取得的阶段性成果?
杨植麟:造了一个火箭的原型,现在点火试飞。积累了一个团队,弄清楚了一些燃料的配方,多多少少还能看到一个PMF的雏形。
可以说,登月走了第一步。
腾讯新闻《潜望》:你怎么看杨立昆说,他不看好现有技术路线,认为自监督的语言模型没办法习得真正世界的知识,随着模型规模的扩大出现谬误,也就是机器幻觉的几率会越来越高。他提出了“世界模型”的观点。
杨植麟:没有本质瓶颈。当token space足够大,变成一个新型计算机解决通用性问题就OK了,它就是一个通用世界模型。
(他这么说)很重要一点在于,大家都能看到现在的局限性。但解决方式并不一定需要全新框架。AI唯一work就是next token prediction + scaling law,只要token足够完整,都是可以做的。当然今天他指出的问题存在,但这些问题就是你把token space变得很通用,就可以了。
腾讯新闻《潜望》:他是放大了局限性。
杨植麟:我觉得是。但底层第一性原理没什么问题,只是说现在有些小技术问题没解决。
腾讯新闻《潜望》:你怎么看Geoffrey Hinton(深度学习之父)一而再、再而三呼吁AI Safety的问题?
杨植麟:Safety反而表明了,他对接下来技术能力的提升有极大信心。他们是相反的。
腾讯新闻《潜望》:幻觉的问题怎么解决?
杨植麟:还是scaling law,就是scale的是不一样的东西。
腾讯新闻《潜望》:有多大概率scaling law走到最后发现根本走不通?
杨植麟:可能约等于0。
腾讯新闻《潜望》:怎么看你的CMU校友陆奇的观点:OpenAI未来肯定比Google大,只不过是大一倍、五倍还是十倍的问题?
杨植麟:未来最成功的AGI公司肯定是会比现在所有公司都大。这点没有疑问,它最终可能是double、triple GPT的事。它不一定是OpenAI,有可能是别的公司,但肯定有这样的公司。
腾讯新闻《潜望》:如果你恰巧成了这家AI帝国的CEO,你会做什么用以保护人类吗?
杨植麟:现在想这个问题还缺少一些前提条件。但我们肯定愿意跟社会不同角色去合作和提升,包括在模型上有更多安全措施。
腾讯新闻《潜望》:你2024年的目标是什么?
杨植麟:第一是技术突破,我们现在应该能做出比2023年好得多的模型。第二是用户和产品,希望有更多成规模的用户和黏性。
腾讯新闻《潜望》:2024年对于全球大模型产业有哪些预测?
杨植麟:今年还会有更多capability出现,但格局不会跟今天有太大差别,top这几个还是会领先。在能力上应该今年下半年会有一些比较大的突破,很多会来自OpenAI,它肯定还有下一代模型——有可能是4.5,也有可能是5,感觉是大概率事件。视频的生成模型肯定还能继续scale。
腾讯新闻《潜望》:2024年对于国产大模型产业有哪些预测?
杨植麟:一是可以看到新的独特能力产生。你会看到国产模型,因为前期的投入,有合适的团队,做出世界领先的某一些维度的能力。二是会出现更多用户量级更大的产品,这是大概率的。三是会有进一步的consolidation和路线选择的分化。
腾讯新闻《潜望》:创业你最害怕的一件事情是什么?
杨植麟:还好,就是要无所(畏惧)往前冲啊。
腾讯新闻《潜望》:想对同行说什么?
杨植麟:一起努力。
腾讯新闻《潜望》:说一个你对于大模型行业现在还不知道但最想知道的问题。
杨植麟:我不知道AGI的上限是什么样的,它会产生一个什么样的公司,这个公司能产生出来什么样的产品。这是我现在最想知道的事。
腾讯新闻《潜望》:AGI这么发展下去,你最不想看到的一件事是什么?
杨植麟:我对这个比较乐观,它可以让人类文明往下一个阶段去发展。
腾讯新闻《潜望》:有没有人评价你,太过于理想主义?
杨植麟:我们也是很脚踏实地的,我们真的也做了一些事,不是只是在说嘛。
腾讯新闻《潜望》:如果你今天拿到的钱是最后一笔钱,你会怎么花这笔钱?
杨植麟:我希望这个永远不会发生,因为我们未来还需要很多钱。
腾讯新闻《潜望》:如果你没有做成什么,会觉得自己失败了?
杨植麟:关系不是那么大,我接受有失败的概率。
这个事情它已经完全改变了我的生命,我是充满感激的。
——————End——————
(如需交流,欢迎添加作者微信:benita_zhangxj,烦请备注公司+职务)
----
<https://k.sina.cn/article_1642720480_61e9ece00270195by.html>
# 对话月之暗面创始人杨植麟:AI 未来最重要的能力是思考和交互
发布于:爱范儿公众号 2024-11-19
发布于:广东
月之暗面和杨植麟大概是近期最受关注的国产大模型公司和创始人,仲裁争议和出海产品收缩的消息让其站在风口浪尖。
更重要的原因,当然还是月之暗面的 Kimi 是国内 AI 应用的头部产品,目前 Kimi 的月活已经超过 3600万。
在 Kimi Chat 全面开放一周年的时候,Kimi 正式发布新一代数学推理模型 k0-math,对标 OpenAI o1 系列。
月之暗面创始人杨植麟认为,最适合让 AI 锻炼思考能力的场景就是数学。他在向 APPSO 等媒体介绍 k0-math 时,引用了伽利略的一段话:
这个宇宙如果你把它看成一本很大的书,宇宙它其实是用数学来写的,数学是表达这个宇宙的语言。
基准测试显示,Kimi k0-math 的数学能力可对标全球领先的 OpenAI o1 系列可公开使用的两个模型:o1-mini和o1-preview。
在中考、高考、考研以及包含入门竞赛题的MATH等 4 个数学基准测试中,k0-math 初代模型成绩超过 o1-mini 和 o1-preview 模型。
在两个难度更大的竞赛级别的数学题库 OMNI-MATH 和 AIME 基准测试中,k0-math 初代模型的表现分别达到了 o1-mini 最高成绩的 90% 和 83%。
杨植麟向我们演示了 k0-math 解答数学题的一些过程,比如在面对一道很难的竞赛题,它可以通过大量的尝试。它可能尝试了八九种不同的做法,最后发现没有真正能达到最终的解答。
但是它会在多次尝试后突然发现我可以把前面的两三个不同的想法结合起来,就可以得到一个正确的答案。
为了让 AI 具备深度思考能力, k0-math 并没有前置设计很多模板。杨植麟希望 AI 在学习的过程中,自己衍生出来思考方式,而且它针对每道题是不一样的,它需要做大量的反思和检验的过程。
不过 k0-math 依然还存在一些局限,比如问它 1+1 的简单问题,会出现过度思考。k0-math 的回答大概是这样的:
它会说这个问题看似简单,但是你不能掉以轻心,所以它觉得这个可能有坑,所以开始分析,还要做了一下可视化。甚至它把类比成两个苹果。
这还不够,还要再检查一遍,如果你是苹果成立,如果变成小时,一个小时+一个小时变成两个小时,它确认了很多次,最后说 OK,可以确认 1+1=2。
具体效果如何还要用起来才知道,杨植麟透露,k0-math 模型和更强大的 Kimi 探索版,未来几周将会分批陆续上线 Kimi 网页版和 Kimi 智能助手 APP,帮助大家解决更有挑战的数学和搜索调研类任务。
我们也希望通过更强的推理能力,因为我觉得AI产品包括AI技术接下来的发展很重要的能力我觉得还是更加深度的推理,能够把现在只是短链路的简单的问答,变成更长链路的组合式的任务的操作。
爱范儿旗下 AI 新媒体 APPSO 受邀参加了这场 Kimi 的沟通会,并和杨植麟对于公司和产品的一些问题进行了互动,以下是交流沟通的部分实录:
Q:如何看待 AI 创业公司被收购,人才回流大厂的?你们最近出现了人才流失吗?
杨植麟:我们没有人才流失。
这个问题我们没有遇到,但可能有一些别的公司遇到。因为行业发展进入了一个新的阶段,它从一开始有很多公司在做。变成了现在少一点的公司在做,接下来大家做的东西会逐渐不一样,我觉得这是必然的规律。
其实我们主动的选择做了业务的减法,在几个大模型创业公司里,我们始终保持人数最少,我们始终保持卡和人的比例是最高的,我觉得这个是非常关键的。
我们不希望把团队扩那么大,一扩那么大对创新的影响是有一个致命性的伤害。如果你想把团队保持在一定的规模,那你其实最好的方式是业务上做一些减法。
我们一开始确实也尝试过比如说几个产品一块做,这个在一定的时期内有可能是有效的,到后来发现还是要聚焦,你把一个产品做到极致是最重要的。
因为你砍业务本质上也是在控制人数,你不希望人数涨得特别猛。比如如果现在三个业务一起做,我就活生生把自己变成大厂,我就没有任何的优势。
Q:聚焦 Kimi (缩减产品线)这个念头是从什么时候开始出现的?是什么样的因素让你复盘要进行重新的布局?
杨植麟:大概今年二、三月份吧。一个是基于美国市场的判断,二是基于我们自己的观察,主要是这两点。还有就是做本身,确实得做减法,不是疯狂地做加法。
Q:你现在觉得最核心的任务是什么?
杨植麟:最核心的任务就是提升留存,或者把留存作为一个重要的衡量指标。因为我觉得基本上它跟你的技术的成熟度或者技术的水平也是一个正相关的过程。所以这个对我们来说当前是最重要的,我觉得还有很大的提升空间。
Q:留存到多少会满意?
杨植麟:永无止境。
Q:o1 发了以后大家也会觉得深层推理,还有包括你今天说的数学模型,它离普通用户比较远,你怎么看这个功能和用户的关系?
杨植麟:其实也不远。数学来讲我觉得是两个方面的价值,第一个方面它今天在教育产品上其实有非常大的价值。在我们整体的流量里也起到很重要的作用。
第二个,我觉得它是技术上的迭代和验证。以及我们可以把这个技术去放在更多的场景里,比如我们刚刚说的探索版去做很多的搜索,我觉得它会有两层这样的含义。
Q:据说 Sora 马上就要发了,为什么你们一直不做多模态?
杨植麟:我们也做,我们几个多模态的能力在内测。我觉得 AI 接下来最重要的是思考和交互这两个能力。
思考的重要性远大于交互,不是说交互不重要,我觉得思考会决定上限,交互我觉得是一个必要条件,比如说 vision 的能力,如果没有 vision 的能力没法做交互。所以我觉得它两个不太一样,你就看你要做这个任务你标注任务的难度有很大,你到底需要一个博士去标,还是每个人都可以标,哪个东西更难找到这样的人,那个东西就是 AI 的上限。
所以我觉得多模态它肯定是必要的,但是我觉得是思考决定它的上限。
Q:怎么看待 Kimi 跟豆包的竞争?
杨植麟:我更希望关注在怎么能给用户真正价值上,我不希望我们过多去关注竞争本身,因为竞争本身并不产生价值。
如何提供更好的技术和产品,这是我们现在最核心的问题。我们会更聚焦在怎么提升模型的思考推理能力,通过这个东西给用户带来更大的价值,我们要去做正确的事情,而不是专门去做不一样的事情。
我认为无论是谁能实现 AGI 都是非常好的结果。
Q:AI 的超级应用何时出现?
杨植麟:ChatGPT 月活已经超过5亿,它是不是超级应用,至少半个吧,有 5 亿人每个月在用,这个问题已经很大程度上被验证了。
Q:如何看待近期大模型预训练遭遇瓶颈的讨论,Scaling law 撞墙了吗
杨植麟:我觉得预训练还有空间,半代到一代的模型。这个空间会在明年释放出来,明年领先的模型会把预训练做到一个比较极致的阶段,今天比如说我们去看最好的模型它大概有这样的空间可以去压榨。
但是我们判断接下来最重点的东西会在强化学习上,就是范式上会产生一些变化,但是它还是 Scaling,并不是它不用 Scale,只是说你会通过不同的方式去 Scale,这是我们的判断。
你说Scaling law会不会是一个天花板或者是上限,这个相对来说我比较乐观一点。核心就在于原来你用静态数据集,静态数据集其实是比较简单粗暴的使用方式,现在用强化学习的方式很多情况下是有人在参与这个过程的,但是人没有办法给你标注那么多数据,不可能把每道题具体的思路都标出来,所以你其实用 AI 本身把人的东西加上一个杠杆。
比如说你标 100 条数据,你就能产生非常大的作用,因为剩下的它都是在自己思考,我觉得更多的会用这种方式去解决。我觉得这个大概率可以通过这种方式去做出来,所以我觉得它上限是很高的。
Q:我们距离 AGI 有多远?
杨植麟:我觉得现在还是初级阶段,当然每年都有一些比较大的进步,如果我们今年用去年的产品,你会发现可能根本没法忍受。
但是可能还有很多东西,比如说我刚刚说的思考能力还不够强,交互不够丰富,所以它今天能做的交互还比较有限,这个交互可能是两个维度的,一个是跟用户的交互,一个是跟本身客观世界的交互我觉得都还有很大的提升空间。
----
* 月之暗面 - Moonshot AI
* 杨植麟 - Yang Zhilin
* 数据瓶颈 - data bottleneck
* 多模态 - multimodal
* 合成数据 - synthetic data
* 计算范式 - compute paradigm
* 算力利用率 - compute utilization rate
* 同质化内卷 - involution among equalized competitors
* 效用主义 - utilitarianism
* 数据飞轮 - data flywheel
* 推理成本 - inference cost
* 可解释性 - explainability
* 微调 - finetune
* 白月光 - "white moonlight" [something that is wished for, but not achievable]
* 破圈 - breakout [from the small circle of AI insiders, to the general public]
* 竞争壁垒 - competitive moat
* 雕花 - "detailed woodcarving" [Spending a lot of time to make a small thing better. The opposite of the Bitter Lesson.]
* 通用模型 - generalist model
* 入口逻辑 - being the user-ingest side of a product chain
* 信息入口 - information ingest
* 马太效应 - Matthew effect
* 头部效应 - the Superstar premium [The No. 1, even if only slightly better than No. 2, is way more popular than No. 2.]
* 卡 - GPUs [as a capital investment]
* 可规模化 - make it scalable
* 功利主义 - utilitarianism
* 技术理想主义 - techno-idealism
* 差异化 - differentiation
* 辨识度 - recognizability
* 刷榜 - Goodhart that benchmark ranking
* 卷 - involute
* 阶跃式提升 - gapping up
* to B - to business
* to C - to customer
* 北极星指标 - guiding northstar
* 边际概率 - marginal probability
* 无损压缩 - lossless compression
* 端到端 - end-to-end
* 巨头 - incumbent megacorp
* 创业公司 - startup
* 人才画像 - talent profile
* 退卡 - sell off GPUs
* PMF - Product-Market Fit
* 第一性原理 - First Principles
* 幻觉 - hallucination
* 一起努力 - Ganbatte
* 留存 - user retention
* 静态数据集 - static dataset [not updated. Similar to ImageNet. A dynamically updated dataset would be used in, for example, DAgger (Dataset Aggregation), and replay buffers.]
* 海外独角兽 - Overseas Unicorn
* 腾讯新闻《潜望》 - Tencent News "Periscope"
* 豆包 - Doubao AI
* 爱范儿 - Aifaner
0 comments
Comments sorted by top scores.