Introduction
The release of DeepSeek R1 marks a pivotal moment in the history of artificial intelligence (AI). Unlike its predecessors, DeepSeek R1 challenges the status quo by adopting an open-weight model that breaks away from proprietary restrictions. This shift is comparable to major milestones in the software industry, such as the release of Linux and the push for Firefox as an alternative browser. These historical events allowed control of critical technologies to shift from centralized corporations to the broader community impacted by these technologies.
With the release, DeepSeek R1 will be huge moment in democratizing AI, making it more accessible, affordable, and open for anyone to innovate using LLMs. This transformation is particularly important for smaller markets like Denmark, where businesses and researchers can leverage these advancements for local solutions, highlighting the potential for Danish companies to build upon this monumental work to leverage AI in transformative ways.
“At the core of DeepSeek R1 lie several architectural
and conceptual breakthroughs that redefine AI efficiency.”
Technical Innovations Powering DeepSeek R1
At the core of DeepSeek R1 lie several architectural and conceptual breakthroughs that redefine AI efficiency. While a full technical breakdown is beyond the scope of this article, yet understanding the crucial developments that led to this moment is key to understanding the trends of the near future.
Where did DeepSeek come from?
The company behind DeepSeek, High-Flyer, is a hedge fund which began trading in 2016 after several years of discretionary trading experience by its Co-Founder Liang Wenfeng. In 2023, DeepSeek was created as a spin-off company after years of innovations in applying AI in the hedge-fund trading environment. The company is responsible for key innovations at the neural architecture level as well as selecting a controversial approach to how their AI should be trained that has upended the progress that Silicon Valley entrepreneurs have touted as being “bleeding-edge”.
What did DeepSeek do differently?
The crucial innovation that DeepSeek fielded is the implementation of multi-head latent attention. This innovation was a small change in how transformer-based neural networks, the key component behind technologies like chatGPT and Google’s Gemini, compute which tokens (i.e. words) to generate in response to a prompt.
Multi Head Latent Attention changes how the KV Cache (the LLMs “working memory”) is stored: instead of explicitly storing the contents of it, the content is compressed into a “latent” vector, similar to how a hologram (a 3D photo) appears whole but does not store all the information that makes the original object whole. This lowers storage requirements for the “working memory”.
The other innovation of DeepSeek R1’s development is its creative approach to avoid the usual and pricier way to train LLMs. Historically, LLMs are trained in the following manner:
- Pretraining – Learning language structure and basic knowledge in a self-supervised manner by learning to predict the next token (i.e. word) in a sequence
- Supervised Fine-Tuning (SFT) – The model is refined with high-quality human-labeled data
- Reward Model Training – Training a separate model to predict human preferences
- Reinforcement Learning (RL) + Human Feedback (RLHF) – Further refining the model by rewarding it for generating tokens that humans would prefer
“In order to achieve reasoning capabilities without the need for specific datasets,
DeepSeek has innovated a new approach to training LLMs in situations where
human-labeled data might be expensive to obtain …”
These are the usual key elements to training LLMs as we know them today. However, in order to achieve reasoning capabilities without the need for specific datasets, DeepSeek has innovated a new approach to training LLMs in situations where human-labeled data might be expensive to obtain: by using Group Relative Policy Optimization (GRPO).
GRPO was first applied to a model trained specifically to advance math reasoning. The discovery that using this approach can lead to an emergence of reasoning capabilities without human instruction was first observed in the training or R1-Zero, a predecessor to the R1 we know. Choosing to forgo the step of SFT and instead relying purely on Reinforcement Learning evolved the ability to reason independently, which has not been done before. Through the careful design of the Reward Policy for the Reinforcement Learning algorithm, the training of DeepSeek R1 benefitted from R1-Zero and has allowed reasoning capabilities similar to those of chatGPT-o1 and o3 to emerge in the published version of R1.
The Distillation Controversy
Beyond its novel architecture, DeepSeek R1’s training approach has stirred debate. DeepSeek may have leveraged a technique called Model Distillation, where knowledge from an existing AI system is transferred into a new model using a “teacher-student” framework. This method allows a team to generate synthetic training data from an existing model and use it to train a new system at a fraction of the cost.
[OpenAI has suggested] (https://www.inc.com/ben-sherry/openai-seems-concerned-that-deepseek-copied-their-work/91140698?t) that DeepSeek may have used their API for distillation purposes, though independent verification remains difficult. If true, this would raise ethical and competitive concerns, as it challenges the boundaries of intellectual property in AI development – which is in itself ironic, considering how most of the currently leading AI services have been trained on massive amounts of publicly accessible but not licensed data.
In a nutshell: DeepSeek got creative in the pursuit of knowledge for its models, distilling the knowledge of other models as well as employing a much more efficient attention mechanism. The two novelties were crowned with the creative innovation of the Reward Policy, which allowed R1 to “self-evolve” sophisticated reasoning patterns without being constrained by predefined supervised data. As a result, the model’s structure—represented by its weights—became inherently more efficient. These effects form the advantage of DeepSeek R1.
Global Implications
DeepSeek R1’s open-source release disrupts many a preconceived notion about the course the evolution of AI will take. Glass ceiling after glass ceiling has been broken with the use of more computing power, both in training and now during inference. The leading U.S. tech giants, such as OpenAI, Google, Nvidia, and Microsoft, have made significant investment announcements with respect to the data centers they plan to build to serve the whole world with AI. These companies have aimed to control access to advanced AI by offering proprietary models that require significant financial and computational resources that only they can muster. By challenging these oligopolies, DeepSeek R1 democratizes access to cutting-edge AI, allowing a broader spectrum of users to innovate without depending on expensive API access or restrictive licensing agreements. How exactly will that come about?
The key element is cost: cost of training and cost of inference, i.e. running the AI. The cost of training has been exorbitant for GPT-3 alone and has only gone up since. If the cost and thus effort of training can be significantly reduced, training advanced models comes into reach of average businesses and their service providers, which in turn enables use-cases that are not very well captured in the generalized datasets that global AI companies rely on.
What good is a cheap car if needs 40 liters per 100km? Not much. The same holds true for operating language models: the cheaper the price per prompt and per output token (this is called inference), the more experimentation is possible. The innovations outlined above have dropped the cost of inference so low that running R1 on a high-end Desktop computer is possible.
Implications for Denmark
DeepSeek’s innovations will undoubtedly be adopted by all competitors, starting the AI race anew, this time with a more level playing field. We will see the use of distillation and model fusion grow significantly, creating a Cambrian explosion of cheap-yet-capable models that any country with a passable telecom infrastructure will be able to operate and improve.
This bodes well for developing countries that are caught in the web of geopolitical tensions that is on all the news channels as I write this publishing of this blog: leapfrogging the technology that made the developed nations what they are in the first place is a hallmark of these. Before DeepSeek R1, those nations had to wonder: can I rely on long-term support for advanced infrastructure and technology from the U.S. You know the answer by now. The leapfrogging effect will be amplified 10-fold in those nations over the next decade.
In a few years, everyday institutions will be perfused with LLMs that can be run at home, inside a business, in a government office, all supported by regional datacenters or even in-house. This would not be possible without the efficiency advancements published by DeepSeek which make training and running state-of-the-art reasoning-capable LLMs possible for the cost of a used family van. By leveraging economies of scale via relying on running such models in the cloud on a pay-per-use basis this cost will drop even lower.
The benefits are obvious: with hardware able to run such advanced models readily available, locally developed solutions will become ubiquitous.
Strong economic moats for international AI service providers will weaken, drifting towards specialized cloud providers who provide infrastructure for LLMs and the deep, local trust needed for running and protecting such specialized models.
“Think first-responders to crime-scenes, forensic experts,
data scientists and industrial designers:
what could they build if they could iterate at the speed of thought?”
At the same time, these technologies will allow “superteams” to emerge. These “superteams” will allow extremely close collaboration between people that usually required several layers of intermediaries to translate between highly specialized subject matter experts. Think first-responders to crime-scenes, forensic experts, data scientists and industrial designers: what could they build if they could iterate at the speed of thought? Or experts in special needs education combined with UI/UX designers and IT personnel? How fast could they arrive at solutions that meaningfully improve outcomes while addressing all possible concerns of a new solution before it is even first presented to anyone outside the team? In other words: the race for national and local AI systems has now begun and centralized AI providers will need to rethink how far they can go in their push for dominating global markets. For the people forming these teams, AI will be a crucial enabler of rapid, exponential progress.
On the other hand, ubiquitous access to very convincing AI endangers everyone. That access is crucial because it starts an arms race in the information domain. Think about it: with advanced multilingual, video understanding and reasoning capabilities that can efficiently run on a gaming PC, any scam call center operator can now replace the human workforce with better and more consistent LLM with reasoning capabilities that can work 24/7/365. If you are a high-value target e.g. politically exposed person: expect all text communication to be an attack vector, including live speech! Live video is heavier to compute, but we have maybe a year up to two before that becomes ubiquitous. After that, ensuring you are actually talking to a human being will be really hard for people who aren’t aware of this development. Thankfully, there has been significant research into such security nightmare scenarios over the last 3 decades, so at least it won’t be a cold shower if it happens.
For the global AI community, this signifies a shift towards decentralization, empowering universities, startups, and underrepresented regions to participate in AI innovation. On a geopolitical level, the ability to train high-performing AI models with less resource-intensive methods undermines the dominance of nations with vast computational resources. This decentralization fosters collaboration and mitigates the risk of technological inequalities, enabling emerging economies to compete on a more level playing field. Personally, I find this exciting, because this allows an empowerment of a lot more people, just like electrification improved the quality of life for many. But this time, there is tangible potential for abuse that can scale faster than any countermeasures can react.
We need to ask ourselves then: given the risks of AI abuse but also the incredible potential to achieve exponentially compounding improvements in the breadth of society: should we bring back the concept of sandboxes, deregulated environments created for FinTech innovation in the 2010s, to give these superteams a chance at really achieving exponential results under a clear mandate with rigorous oversight? To me, the answer is an obvious YES.
***
NB! Vi er godt klar over, at DeepSeek er et meget kontroversielt emne i forhold til datadeling og Kina. I artiklen her forholder vi os kun til det teknologiske aspekt.