Select Page

Global Advisors | Quantified Strategy Consulting

reinforcement learning
Term: Reinforcement Learning (RL)

Term: Reinforcement Learning (RL)

“Reinforcement Learning (RL) is a machine learning method where an agent learns optimal behavior through trial-and-error interactions with an environment, aiming to maximize a cumulative reward signal over time.” – Reinforcement Learning (RL)

Definition

Reinforcement Learning (RL) is a machine learning method in which an intelligent agent learns to make optimal decisions by interacting with a dynamic environment, receiving feedback in the form of rewards or penalties, and adjusting its behaviour to maximise cumulative rewards over time.1 Unlike supervised learning, which relies on labelled training data, RL enables systems to discover effective strategies through exploration and experience without explicit programming of desired outcomes.4

Core Principles

RL is fundamentally grounded in the concept of trial-and-error learning, mirroring how humans naturally acquire skills and knowledge.2 The approach is based on the Markov Decision Process (MDP), a mathematical framework that models decision-making through discrete time steps.8 At each step, the agent observes its current state, selects an action based on its policy, receives feedback from the environment, and updates its knowledge accordingly.1

Essential Components

Four core elements define any reinforcement learning system:

  • Agent: The learning entity or autonomous system that makes decisions and takes actions.2
  • Environment: The dynamic problem space containing variables, rules, boundary values, and valid actions with which the agent interacts.2
  • Policy: A strategy or mapping that defines which action the agent should take in any given state, ranging from simple rules to complex computations.1
  • Reward Signal: Positive, negative, or zero feedback values that guide the agent towards optimal behaviour and represent the goal of the learning problem.1

Additionally, a value function evaluates the long-term desirability of states by considering future outcomes, enabling agents to balance immediate gains against broader objectives.1 Some systems employ a model that simulates the environment to predict action consequences, facilitating planning and strategic foresight.1

Learning Mechanism

The RL process operates through iterative cycles of interaction. The agent observes its environment, executes an action according to its current policy, receives a reward or penalty, and updates its knowledge based on this feedback.1 Crucially, RL algorithms can handle delayed gratification-recognising that optimal long-term strategies may require short-term sacrifices or temporary penalties.2 The agent continuously balances exploration (attempting novel actions to discover new possibilities) with exploitation (leveraging known effective actions) to progressively improve cumulative rewards.1

Mathematical Foundation

The self-reinforcement algorithm updates a memory matrix according to the following routine at each iteration:

Given situation s, perform action a

Receive consequence situation s’

Compute state evaluation v(s') of the consequence situation

Update memory: w'(a,s) = w(a,s) + v(s')5

Practical Applications

RL has demonstrated transformative potential across multiple domains. Autonomous vehicles learn to navigate complex traffic environments by receiving rewards for safe driving behaviours and penalties for collisions or traffic violations.1 Game-playing AI systems, such as chess engines, learn winning strategies through repeated play and feedback on moves.3 Robotics applications leverage RL to develop complex motor skills, enabling robots to grasp objects, move efficiently, and perform delicate tasks in manufacturing, logistics, and healthcare settings.3

Distinction from Other Learning Paradigms

RL occupies a distinct position within machine learning’s three primary paradigms. Whereas supervised learning reduces errors between predicted and correct responses using labelled training data, and unsupervised learning identifies patterns in unlabelled data, RL relies on general evaluations of behaviour rather than explicit correct answers.4 This fundamental difference makes RL particularly suited to problems where optimal solutions are unknown a priori and must be discovered through environmental interaction.

Historical Context and Theoretical Foundations

Reinforcement learning emerged from psychological theories of animal learning and played pivotal roles in early artificial intelligence systems.4 The field has evolved to become one of the most powerful approaches for creating intelligent systems capable of solving complex, real-world problems in dynamic and uncertain environments.3

Related Theorist: Richard S. Sutton

Richard S. Sutton stands as one of the most influential figures in modern reinforcement learning theory and practice. Born in 1956, Sutton earned his PhD in computer science from the University of Massachusetts Amherst in 1984, where he worked alongside Andrew Barto-a collaboration that would fundamentally shape the field.

Sutton’s seminal contributions include the development of temporal-difference (TD) learning, a revolutionary algorithm that bridges classical conditioning from animal learning psychology with modern computational approaches. TD learning enables agents to learn from incomplete sequences of experience, updating value estimates based on predictions rather than waiting for final outcomes. This breakthrough proved instrumental in training the world-champion backgammon-playing program TD-Gammon in the early 1990s, demonstrating RL’s practical power.

In 1998, Sutton and Barto published Reinforcement Learning: An Introduction, which became the definitive textbook in the field.10 This work synthesised decades of research into a coherent framework, making RL accessible to researchers and practitioners worldwide. The book’s influence cannot be overstated-it established the mathematical foundations, terminology, and conceptual frameworks that continue to guide contemporary research.

Sutton’s career has spanned academia and industry, including positions at the University of Alberta and Google DeepMind. His work on policy gradient methods and actor-critic architectures provided theoretical underpinnings for deep reinforcement learning systems that achieved superhuman performance in complex domains. Beyond specific algorithms, Sutton championed the view that RL represents a fundamental principle of intelligence itself-that learning through interaction with environments is central to how intelligent systems, biological or artificial, acquire knowledge and capability.

His intellectual legacy extends beyond technical contributions. Sutton advocated for RL as a unifying framework for understanding intelligence, arguing that the reward signal represents the true objective of learning systems. This perspective has influenced how researchers conceptualise artificial intelligence, shifting focus from pattern recognition towards goal-directed behaviour and autonomous decision-making in uncertain environments.

References

1. https://www.geeksforgeeks.org/machine-learning/what-is-reinforcement-learning/

2. https://aws.amazon.com/what-is/reinforcement-learning/

3. https://cloud.google.com/discover/what-is-reinforcement-learning

4. https://cacm.acm.org/federal-funding-of-academic-research/rediscovering-reinforcement-learning/

5. https://en.wikipedia.org/wiki/Reinforcement_learning

6. https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-reinforcement-learning

7. https://www.mathworks.com/discovery/reinforcement-learning.html

8. https://en.wikipedia.org/wiki/Machine_learning

9. https://www.ibm.com/think/topics/reinforcement-learning

10. https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

"Reinforcement Learning (RL) is a machine learning method where an agent learns optimal behavior through trial-and-error interactions with an environment, aiming to maximize a cumulative reward signal over time." - Term: Reinforcement Learning (RL)

read more
Quote: Yann Lecun

Quote: Yann Lecun

“Most of the infrastructure cost for AI is for inference: serving AI assistants to billions of people.”
— Yann LeCun, VP & Chief AI Scientist at Meta

Yann LeCun made this comment in response to the sharp drop in Nvidia’s share price on January 27, 2024, following the launch of Deepseek R1, a new AI model developed by Deepseek AI. This model was reportedly trained at a fraction of the cost incurred by Hyperscalers like OpenAI, Anthropic, and Google DeepMind, raising questions about whether Nvidia’s dominance in AI compute was at risk.

The market reaction stemmed from speculation that the training costs of cutting-edge AI models—previously seen as a key driver of Nvidia’s GPU demand—could decrease significantly with more efficient methods. However, LeCun pointed out that most AI infrastructure costs come not from training but from inference, the process of running AI models at scale to serve billions of users. This suggests that Nvidia’s long-term demand may remain strong, as inference still relies heavily on high-performance GPUs.

LeCun’s view aligned with analyses from key AI investors and industry leaders. He supported the argument made by Antoine Blondeau, co-founder of Alpha Intelligence Capital, who described Nvidia’s stock drop as “vastly overblown” and “NOT a ‘Sputnik moment’”, referencing the concern that Nvidia’s market position was insecure. Additionally, Jonathan Ross, founder of Groq, shared a video titled “Why $500B isn’t enough for AI,” explaining why AI compute demand remains insatiable despite efficiency gains.

This discussion underscores a critical aspect of AI economics: while training costs may drop with better algorithms and hardware, the sheer scale of inference workloads—powering AI assistants, chatbots, and generative models for billions of users—remains a dominant and growing expense. This supports the case for sustained investment in AI infrastructure, particularly in Nvidia’s GPUs, which continue to be the gold standard for inference at scale.

read more
Infographic: Four critical DeepSeek enablers

Infographic: Four critical DeepSeek enablers

The DeepSeek team has introduced several high-impact changes to Large Language Model (LLM) architecture to enhance performance and efficiency:

  1. Multi-Head Latent Attention (MLA): This mechanism enables the model to process multiple facets of input data simultaneously, improving both efficiency and performance. MLA reduces the memory required to compute a transformer’s attention by a factor of 7.5x to 20x, a breakthrough that makes large-scale AI applications more feasible. Unlike Flash Attention, which improves data organization in memory, MLA compresses the KV cache into a lower-dimensional space, significantly reducing memory usage—down to 5% to 13% of traditional attention mechanisms—while maintaining performance.
  2. Mixture-of-Experts (MoE) Architecture: DeepSeek employs an MoE system that activates only a subset of its total parameters during any given task. For instance, in DeepSeek-V3, only 37 billion out of 671 billion parameters are active at a time, significantly reducing computational costs. This approach enhances efficiency and aligns with the trend of making AI models more compute-light, allowing freed-up GPU resources to be allocated to multi-modal processing, spatial intelligence, or genomic analysis. MoE models, as also leveraged by Mistral and other leading AI labs, allow for scalability while keeping inference costs manageable.
  3. FP8 Floating Point Precision: To enhance computational efficiency, DeepSeek-V3 utilizes FP8 floating point precision during training, which helps in reducing memory usage and accelerating computation. This follows a broader trend in AI to optimize training methodologies, potentially influencing the approach taken by U.S.-based LLM providers. Given China’s restricted access to high-end GPUs due to U.S. export controls, optimizations like FP8 and MLA are critical in overcoming hardware limitations.
  4. DeepSeek-R1 and Test-Time Compute Capabilities: DeepSeek-R1 is a model that leverages reinforcement learning (RL) to enable test-time compute, significantly improving reasoning capabilities. The model was trained using an innovative RL strategy, incorporating fine-tuned Chain of Thought (CoT) data and supervised fine-tuning (SFT) data across multiple domains. Notably, DeepSeek demonstrated that any sufficiently powerful LLM can be transformed into a high-performance reasoning model using only 800k curated training samples. This technique allows for rapid adaptation of smaller models, such as Qwen and LLaMa-70b, into competitive reasoners.
  5. Distillation to Smaller Models: The team has developed distilled versions of their models, such as DeepSeek-R1-Distill, which are fine-tuned on synthetic data generated by larger models. These distilled models contain fewer parameters, making them more efficient while retaining significant capabilities. DeepSeek’s ability to achieve comparable reasoning performance at a fraction of the cost of OpenAI’s models (5% of the cost, according to Pelliccione) has disrupted the AI landscape.

The Impact of Open-Source Models:

DeepSeek’s success highlights a fundamental shift in AI development. Traditionally, leading-edge models have been closed-source and controlled by Western AI firms like OpenAI, Google, and Anthropic. However, DeepSeek’s approach, leveraging open-source components while innovating on training efficiency, has disrupted this dynamic. Pelliccione notes that DeepSeek now offers similar performance to OpenAI at just 5% of the cost, making high-quality AI more accessible. This shift pressures proprietary AI companies to rethink their business models and embrace greater openness.

Challenges and Innovations in the Chinese AI Ecosystem:

China’s AI sector faces major constraints, particularly in access to high-performance GPUs due to U.S. export restrictions. Yet, Chinese companies like DeepSeek have turned these challenges into strengths through aggressive efficiency improvements. MLA and FP8 precision optimizations exemplify how innovation can offset hardware limitations. Furthermore, Chinese AI firms, historically focused on scaling existing tech, are now contributing to fundamental advancements in AI research, signaling a shift towards deeper innovation.

The Future of AI Control and Adaptation:

DeepSeek-R1’s approach to training AI reasoners poses a challenge to traditional AI control mechanisms. Since reasoning capabilities can now be transferred to any capable model with fewer than a million curated samples, AI governance must extend beyond compute resources and focus on securing datasets, training methodologies, and deployment platforms. OpenAI has previously obscured Chain of Thought traces to prevent leakage, but DeepSeek’s open-weight release and published RL techniques have made such restrictions ineffective.

Broader Industry Context:

  • DeepSeek benefits from Western open-source AI developments, particularly Meta’s LLama model disclosures, which provided a foundation for its advancements. However, DeepSeek’s success also demonstrates that China is shifting from scaling existing technology to innovating at the frontier.
  • Open-source models like DeepSeek will see widespread adoption for enterprise and research applications, though Western businesses are unlikely to build their consumer apps on a Chinese API.
  • The AI innovation cycle is exceptionally fast, with breakthroughs assessed daily or weekly. DeepSeek’s advances are part of a rapidly evolving competitive landscape dominated by U.S. big tech players like OpenAI, Google, Microsoft, and Meta, who continue to push for productization and revenue generation. Meanwhile, Chinese AI firms, despite hardware and data limitations, are innovating at an accelerated pace and have proven capable of challenging OpenAI’s dominance.

These innovations collectively contribute to more efficient and effective LLMs, balancing performance with resource utilization while shaping the future of AI model development.

Sources: Global Advisors, Jack Clark – Anthropic, Antoine Blondeau, Alberto Pelliccione, infoq.com, medium.com, en.wikipedia.org, arxiv.org

read more
Quote: Jack Clark

Quote: Jack Clark

“The most surprising part of DeepSeek-R1 is that it only takes ~800k samples of ‘good’ RL reasoning to convert other models into RL-reasoners. Now that DeepSeek-R1 is available people will be able to refine samples out of it to convert any other model into an RL reasoner.” – Jack Clark, Anthropic

Jack Clark, Co-founder of Anthropic, co-chair of the AI Index at Stanford University, co-chair of OECD working group on AI & Compute, shed light on the significance of DeepSeek-R1, a revolutionary AI reasoning model developed by China’s DeepSeek team. In an article posted in his newsletter on the 27th January 2025, Clark highlighted that it only takes approximately 800k samples of “good” RL (Reinforcement Learning) reasoning to convert other models into RL-reasoners.

The Power of Fine-Tuning

DeepSeek-R1 is not just a powerful AI model; it also provides a framework for fine-tuning existing models to enhance their reasoning capabilities. By leveraging the 800k samples curated with DeepSeek-R1, researchers can refine any other model into an RL reasoner. This approach has been demonstrated by fine-tuning open-source models like Qwen and Llama using the same dataset.

Implications for AI Policy

The release of DeepSeek-R1 has significant implications for AI policy and control. As Clark notes, if you need fewer than a million samples to convert any model into a “thinker,” it becomes much harder to control AI systems. This is because the valuable data, including chains of thought from reasoning models, can be leaked or shared openly.

A New Era in AI Development

The availability of DeepSeek-R1 and its associated techniques has created a new era in AI development. With an open weight model floating around the internet, researchers can now bootstrap any other sufficiently powerful base model into being an AI reasoner. This has the potential to accelerate AI progress worldwide.

Key Takeaways:

  • Fine-tuning is key : DeepSeek-R1 demonstrates that fine-tuning existing models with a small amount of data (800k samples) can significantly enhance their reasoning capabilities.
  • Open-source and accessible : The model and its techniques are now available for anyone to use, making it easier for researchers to develop powerful AI reasoners.
  • Implications for control : The release of DeepSeek-R1 highlights the challenges of controlling AI systems, as valuable data can be leaked or shared openly.

Conclusion

DeepSeek-R1 has marked a significant milestone in AI development, showcasing the power of fine-tuning and open-source collaboration. As researchers continue to build upon this work, we can expect to see even more advanced AI models emerge, with far-reaching implications for various industries and applications.

read more
Quote: Marc Andreessen

Quote: Marc Andreessen

“DeepSeek-R1 is AI’s Sputnik moment.” – Marc Andreessen, Andreesen Horowitz

In a 27th January 2025 X statement that sent shockwaves through the tech community, venture capitalist Marc Andreessen declared that DeepSeek’s R1 AI reasoning model is “AI’s Sputnik moment.” This analogy draws parallels between China’s breakthrough in artificial intelligence and the Soviet Union’s historic achievement of launching the first satellite into orbit in 1957.

The Rise of DeepSeek-R1

DeepSeek, a Chinese AI lab, has made headlines with its open-source release of R1, a revolutionary AI reasoning model that is not only more cost-efficient but also poses a significant threat to the dominance of Western tech giants. The model’s ability to reduce compute requirements by half without sacrificing accuracy has sent shockwaves through the industry.

A New Era in AI

The release of DeepSeek-R1 marks a turning point in the AI arms race, as it challenges the long-held assumption that only a select few companies can compete in this space. By making its research open-source, DeepSeek is empowering anyone to build their own version of R1 and tailor it to their needs.

Implications for Megacap Stocks

The success of DeepSeek-R1 has significant implications for megacap stocks like Microsoft, Alphabet, and Amazon, which have long relied on proprietary AI models to maintain their technological advantage. The pen-source nature of R1 threatens to wipe out this advantage, potentially disrupting the business models of these tech giants.

Nvidia’s Nightmare

The news comes as a blow to Nvidia CEO Jensen Huang, who is ramping up production of his Blackwell microchip, a more advanced version of his industry-leading Hopper series H100s. The chip controls 90% of the AI semiconductor market, but R1’s ability to reduce compute requirements may render these chips less essential.

A New Era of Innovation

Perplexity AI founder Aravind Srinivas praised DeepSeek’s team for catching up to the West by employing clever solutions, including switching from binary encoding to floating point 8. This innovation not only reduces costs but also demonstrates that China is no longer just a copycat, but a leader in AI innovation.

read more
Quote: Jeffrey Emanuel

Quote: Jeffrey Emanuel

“With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets.” – Jeffrey Emanuel

Jeffrey Emanuel’s statement (“The Short Case for Nvidia Stock” – 25th January 2025) highlights a groundbreaking achievement in AI with DeepSeek’s R1 model, which has made significant strides in enabling step-by-step reasoning without the traditional reliance on vast supervised datasets:

  1. Innovation Through Reinforcement Learning (RL):
    • The R1 model employs reinforcement learning, a method where models learn through trial and error with feedback. This approach reduces the dependency on large labeled datasets typically required for training, making it more efficient and accessible.
  2. Advanced Reasoning Capabilities:
    • R1 excels in tasks requiring logical inference and mathematical problem-solving. Its ability to demonstrate step-by-step reasoning is crucial for complex decision-making processes, applicable across various industries from autonomous systems to intricate problem-solving tasks.
  3. Efficiency and Accessibility:
    • By utilizing RL and knowledge distillation techniques, R1 efficiently transfers learning to smaller models. This democratizes AI technology, allowing global researchers and developers to innovate without proprietary barriers, thus expanding the reach of advanced AI solutions.
  4. Impact on Data-Scarce Industries:
    • The model’s capability to function with limited data is particularly beneficial in sectors like medicine and finance, where labeled data is scarce due to privacy concerns or high costs. This opens doors for more ethical and feasible AI applications in these fields.
  5. Competitive Landscape and Innovation:
    • R1 positions itself as a competitor to models like OpenAI’s o1, signaling a shift towards accessible AI technology. This fosters competition and encourages other companies to innovate similarly, driving advancements across the AI landscape.

In essence, DeepSeek’s R1 model represents a significant leap in AI efficiency and accessibility, offering profound implications for various industries by reducing data dependency and enhancing reasoning capabilities.

read more

Download brochure

Introduction brochure

What we do, case studies and profiles of some of our amazing team.

Download

Our latest podcasts on Spotify

Sign up for our newsletters - free

Global Advisors | Quantified Strategy Consulting