Spotlight on Deepseek: China’s AI Research Lab
Following the growing virality of Chinese research lab DeepSeek's latest models, V3 and R1, there have been growing ripple effects throughout the technology ecosystem in both the public and private markets. Given the inbound questions from many investors, we have looked to best address and summarize the latest.
In December 2024, DeepSeek released V3, a 671B parameter sparse mixture-of-experts (MoE)[1] model that activates only 37B parameters during inference. What does that mean?
It indicates that Deepseek could match the capability of models with 100k GPUs (~11x more compute than they used). Following the release of this V3 model, the company launched its chatbot application on January 13th in the Apple App Store. With the timeline above articulated, the US AI research industry has been aware of many of these breakthroughs for over a month, and the DeepSeek chatbot has been downloadable for nearly two weeks in the US. It's important to highlight these items because:
a) DeepSeek is not a new player in the game; they have been building open-source models since 2023, and
b) Many AI research labs test reinforcement learning ("RL") techniques and fine-tuning using DeepSeek.
Last week, DeepSeek released R-1, a reasoning model that nearly matches the performance of OpenAI's o1 while being dramatically more cost-effective. Notably, R1 was trained using pure RL on synthetic data generated by R1-Zero, an AI system trained without human-supervised fine-tuning. While this begins to get technical, the research implies that there were clear breakthroughs in model distillation for reasoning models. In other words, DeepSeek could take a very complex and capital-intensive model from OpenAI and/or Anthropic and spend a fraction of the cost making their own model that's nearly as good.
The result? V3 outperforms Meta's LLaMa and matches the capabilities of GPT-4 and Anthropic's Claude models at the current state.
DeepSeek claims that training V3 costs only $5.6M, orders of magnitude less than the estimated budgets for GPT-4 or Claude. However, it's critical to note that this figure captures only the final training run, not the extensive R&D costs incurred to invent the novel architectures and training procedures that enabled this efficiency. While the exact amount is unknown, it likely involved millions in additional investment. There are also multiple reports and semi-research that imply they currently have 50K GPUs across a variety of product SKUs. It's impossible to know the actual size and costs, but what is clear is that they were forced to leverage a combination of last-gen and new-gen GPUs to achieve this outcome due to sanctions versus the latest H100 chips.
Putting aside the cost debate, DeepSeek's breakthroughs in cost efficiency are extremely impressive! On the training side, they have pioneered new MoE architectures that dramatically reduce the number of parameters that need to be trained. For inference, they leverage consumer hardware like the Apple’s M2 Ultra, making AI potentially accessible without high-end Nvidia GPUs.
Perhaps most impressive is how DeepSeek's R1 model develops its reasoning capabilities through pure reinforcement learning without expensive human feedback. After training a base model to convergence using RL, DeepSeek trains a final model by combining the synthetic data with a small amount of curated data to refine its outputs.
Again, these leaps are driven by OpenAI and Anthropic— leveraging model distillation to learn from OpenAI and Anthropic's models despite those labs' efforts to restrict access. This demonstrates the difficulty in preserving competitive moats to foreign adversaries as the diffusion of AI capabilities accelerates (a political / regulatory discussion for the new administration).
The release of V3 and R1 has five major ramifications that are being digested today:
It puts significant pressure on Nvidia, revealing a potential path to cutting-edge AI without depending on expensive, high-margin Nvidia GPUs. If model efficiency increases, pre-training costs could decrease (fewer GPUs needed) and would put more pressure on Nvidia to win in inference capabilities. Based on some of the breakthroughs seen via DeepSeek, there could be paths where CUDA isn't as strong of a MOAT in inference training. Our work to date shows overwhelming demand for Nvidia on the inference side, but these question marks open a more combative debate today vs. yesterday.
The emergence of alternative training and inference solutions could dampen GPU demand. We view multiple headlines from Microsoft ($80B into Stargate), Meta (reverberating $65B in Capex), China ($150B capex commitment) and Stargate ($500B), that clearly counter these claims. Nvidia also has yet to ship their latest Blackwell chips, implying we have yet to accrue net gains from their latest technology. We continue to see their networking technology (NVLink) as the best, which points positively to inference demand and market positioning going forward.
It temporarily narrows the MOAT of leading Western labs like OpenAI and Anthropic. With distilled models of comparable quality emerging, their differentiation will depend more on proprietary data, fine-tuning, and product rather than model architecture. It's important to remember that Anthropic's latest Sonnet model was first released over ~ six months ago.
Anthropic's enterprise solution isn't simply a self-serve API model. It provides safety, security, customizability, and customer support to some of the largest enterprises in the world. Given that it is more high touch, we are in the camp that LLMs as a homogenous offering is overbaked. As enterprises have looked to build a relationship with Anthropic over the last year, the company has invested across S&M and Customer support to drive further entrenchment. This bottoms up buildout is reminiscent of cloud infra in the 2010s.
It highlights how US export controls on AI chips to China may have backfired. By constraining Chinese researchers, the US incentivized them to dramatically improve efficiency to enable powerful AI without high-end hardware. This seems to have inadvertently spurred China's domestic AI ecosystem.
DeepSeek's achievements herald a new era of global competition in AI. With the cost of training and inference showing continued signs of compression, we expect a further explosion of new companies and more adoption. It's important to signal that this is a net positive — the biggest winners are consumers and businesses who can anticipate a future of effectively free AI products and services. Jevon's Paradox will rule the day in the long run, and everyone who uses AI will be the biggest winners.
As investors, we are constantly re-examining our assumptions on the defensibility of AI models, the demand for high-performance AI chips, and the geopolitical balance of power in AI technology. The value will accrue to companies that can apply AI to solve customer problems rather than those focused on training frontier models. The key differentiators will be superior data, domain expertise, and customer access.
We will continue monitoring this space closely to identify the next generation of winners. As always, please feel free to reach out with any questions.
[1] A mixture of experts ("MoE") is an approach used in machine learning, particularly in large language models, to efficiently scale up the model size while keeping computational costs manageable. In an MoE architecture, the model is divided into several submodels called "experts." Each expert specializes in a particular subset of the task or data. When an input is given to the model, a gating network determines which experts are most relevant for processing that input. The input is then routed to those selected experts, and their outputs are combined to produce the final output.