DeepSeek-R1: How Reinforcement Learning is Driving LLM Innovation

DeepSeek-R1 represents a significant advancement in the field of LLMs, particularly in enhancing reasoning capabilities through reinforcement learning (RL). This model, developed by DeepSeek-AI, distinguishes itself through its unique training pipeline and its capacity to achieve performance comparable to OpenAI’s o1-1217 on various reasoning tasks. The research focuses on exploring the potential of LLMs to develop strong reasoning abilities without relying on extensive supervised fine-tuning (SFT) data, instead using a pure RL approach.

Background
The OpenAI o1 series models were the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought (CoT) reasoning process, significantly improving performance on tasks like mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains a significant research area.

The DeepSeek-R1 project aims to address this gap by improving LLM’s reasoning capabilities using pure RL. It investigates the ability of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process.

Key Contributions

The core contributions of the DeepSeek-R1 project are threefold:

  • Large-Scale Reinforcement Learning: RL is directly applied to the base model without relying on SFT as a preliminary step. This approach allows the model to explore CoT for solving complex problems, resulting in the development of DeepSeek-R1-Zero. This model demonstrates capabilities such as self-verification, reflection, and generating long CoTs purely through RL.
  • Pipeline for DeepSeek-R1: It introduces a pipeline to develop DeepSeek-R1 that incorporates two RL stages to discover improved reasoning patterns and align with human preferences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities.
  • Distillation: It demonstrates that reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models.

DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

  • DeepSeek-R1-Zero is a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step.
  • It was developed to explore the potential of LLMs to develop reasoning capabilities without any supervised data, emphasizing self-evolution through a pure RL process.

Training Process:

  • The base model for both DeepSeek-R1-Zero and DeepSeek-R1 is DeepSeek-V3-Base.
  • Group Relative Policy Optimization (GRPO) is used to perform RL. GRPO does not use the critic model, rather estimates the baseline from group scores. For each question \(q\), GRPO samples a group of outputs \( \{o1, o2, …, oG\} \) from the old policy to set the baseline and then optimizes the policy by generating next group of outputs t maximize advantage.
  • Reward Modelling: The reward system is essential for guiding the optimization direction of RL. The reward system for DeepSeek-R1-Zero consists of two types of rewards:
    • Accuracy Rewards (rule-based): These rewards evaluate whether the response is correct. For math problems, the model must provide the final answer in a specific format, enabling rule-based verification of correctness. For LeetCode problems, a compiler generates feedback based on test cases.
    • Format Rewards: This model enforces the model to put its thinking process between <think> and </think> tags.
  • Training Template: The training template guides the base model to adhere to specified instructions. The template requires the model to first produce a reasoning process, followed by the final answer.

Performance of DeepSeek-R1-Zero:

  • DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances.
  • The average pass@1 score on AIME 2024 significantly increases from 15.6% to 71.0%, reaching performance levels comparable to OpenAI-o1-0912.
  • The performance of DeepSeek-R1-Zero can be enhanced through majority voting; for example, on the AIME benchmark, the model’s performance increases from 71.0% to 86.7%, surpassing the performance of OpenAI-o1-0912.

The model learns to allocate more thinking time to a problem by reevaluating its initial approach (as shown in above image). This underscores the power of RL to allow the model to autonomously develop advanced problem-solving strategies. This highlights the potential of RL to unlock new levels of intelligence in artificial systems.

Drawbacks of DeepSeek-R1-Zero:
DeepSeek-R1-Zero faces several challenges such as (1) poor readability, and (2) language mixing. To make reasoning processes more readable, DeepSeek-R1 was developed utilizing RL with human-friendly cold-start data.

DeepSeek-R1: Reinforcement Learning with Cold Start

DeepSeek-R1 builds upon the foundations laid by DeepSeek-R1-Zero by incorporating a small amount of high-quality data as a cold start to improve reasoning performance and create a user-friendly model. The pipeline consists of two reinforcement learning (RL) stages and two supervised fine-tuning (SFT) stages:

  1. Cold Start Fine-tuning
  2. Reasoning-oriented Reinforcement Learning
  3. Rejection Sampling and Supervised Fine-Tuning
  4. Reinforcement Learning for all Scenarios

Cold Start Fine-tuning (1st SFT Stage)

  • DeepSeek-R1 uses a small amount of long CoT data to fine-tune the DeepSeek-V3-Base as the starting point for RL.
  • Collection Methods: This data was collected by
    • using few-shot prompting with long CoT examples,
    • directly prompting models to generate detailed answers with reflection and verification,
    • gathering DeepSeek-R1-Zero outputs in a readable format, and
    • refining the results through post-processing by human annotators.
  • Format and Content: The cold-start data was designed to be human-readable and followed a specific pattern: |special_token|<reasoning_process>|special_token|<summary>. This included the CoT for the query (the reasoning process) and a summary of the reasoning results, ensuring the responses were clear and easy to understand. Responses that mixed languages or lacked user-friendly formatting were filtered out.
  • The advantages of cold start data compared to DeepSeek-R1-Zero include:
    • Improved Readability: The content of DeepSeek-R1-Zero is often not suitable for reading with mixed languages or a lack of markdown formatting to highlight answers for users. The cold-start data was specifically designed to address these issues by enforcing a readable format with clear summaries and language consistency.
    • Better Performance: With a careful design for the cold-start data with human priors, the researchers have observed better performance.
    • Stable Start: Using cold-start data aimed to avoid the unstable early phases of RL training.
  • The purpose of this stage was to provide a stable and human-friendly starting point for the subsequent RL process, overcoming the unstable start of RL directly on the base model as seen in DeepSeek-R1-Zero.

Reasoning-oriented Reinforcement Learning (1st RL Stage)

  • After fine-tuning on cold start data, large-scale RL is applied (same training process as in DeepSeek-R1-Zero).
  • This stage improves the model’s reasoning capabilities, particularly in coding, mathematics, etc.
  • Language consistency reward is introduced during RL training to mitigate language mixing. This is calculated as the proportion of target language words in the CoT. Although the performance is degraded slightly, this reward aligns with human preferences (more readable).
  • RL training is conducted by combining the accuracy of reasoning tasks and the reward for language consistency.
  • This stage used the GRPO algorithm, similar to DeepSeek-R1-Zero.

Rejection Sampling and Supervised Fine-Tuning (2nd SFT Stage)

  • This stage includes SFT data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks.
  • This SFT data is collected by curating the generated reasoning trajectories from the checkpoint of the previous RL training.
  • Reasoning data (600k) is curated by
    • rejection sampling of reasoning trajectories (retaining correct responses only),
    • incorporating additional data that uses a generative reward model (for non-rule based reward, through LLM judgement),
    • filtering out nuanced reasoning trajectories with mixed languages, long paragraphs, and code blocks
  • Non-reasoning data (200k), such as writing, factual QA, self-cognition, and translation, is taken from the DeepSeek-V3 pipeline and reuses portions of the SFT dataset of DeepSeek-V3.
  • DeepSeek-V3-Base is fine-tuned for two epochs using the curated dataset of about 800k samples.

Reinforcement Learning for all Scenarios (2nd RL Stage)

  • A secondary RL stage is implemented to improve model’s helpfulness, harmlessness, and refine reasoning capabilities.
  • The model is trained using a combination of reward signals and diverse prompt distributions.
  • Reasoning data is trained using the rule-based reward methodology (as in 1st and 2nd SFT stage), while general data is trained using reward models to capture human preferences (as in 2nd SFT stage).
  • The helpfulness reward focuses on the final summary, while the harmlessness reward evaluates the entire response of the model, to identify and mitigate any potential risks, biases, or harmful content.

Overall Benefits of the Multi-Stage Approach

  • Enhanced Reasoning Capabilities: The iterative RL stages, coupled with the focused SFT stages, allowed DeepSeek-R1 to achieve significant performance boost.
  • Improved Readability and Coherence: The introduction of cold-start data and the language consistency reward specifically addressed the readability and language mixing issues observed in DeepSeek-R1-Zero, making the model’s responses more user-friendly.
  • Greater Versatility: The second SFT stage, incorporating data from diverse domains, expanded DeepSeek-R1’s capabilities beyond reasoning and made it a more versatile general-purpose language model.
  • Alignment with Human Preferences: The final RL stage further refined the model to be helpful and harmless, improving its suitability for general use.

DeepSeek-R1’s multi-stage training approach combined targeted fine-tuning with iterative reinforcement learning to produce a model that was not only better at reasoning, but also more readable, versatile, and aligned with human values compared to DeepSeek-R1-Zero. Each stage of the pipeline played a critical role, leading to the improved overall performance of DeepSeek-R1.

Distillation: Empowering Small Models with Reasoning Capability

  • Qwen and Llama were fine-tuned using the 800k samples curated with DeepSeek-R1. (Models: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct)
  • This direct distillation method significantly enhances the reasoning abilities of smaller models.
  • For distilled models, only SFT is applied, without the RL stage, in order to demonstrate the effectiveness of the distillation technique.

DeepSeek-R1 Evaluation

  • DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3 on knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond due to enhanced accuracy in STEM-related questions.
  • It also performs well on FRAMES, a long-context-dependent QA task, and outperforms DeepSeek-V3 on the factual benchmark SimpleQA.
  • DeepSeek-R1 also delivers impressive results on IF-Eval and demonstrates strong performance on AlpacaEval2.0 and ArenaHard.
  • On math tasks, DeepSeek-R1 performs on par with OpenAI-o1-1217, surpassing other models.
  • A similar trend is observed on coding algorithm tasks, such as LiveCodeBench and Codeforces.

Distilled Model Evaluation:

  • The distilled models, such as DeepSeek-R1-7B, outperform non-reasoning models like GPT-4o-0513.
  • DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks.
Dropdown Block

Concluding Remarks

DeepSeek-R1 represents a successful effort in enhancing model reasoning abilities through RL. DeepSeek-R1-Zero highlights the potential of pure RL without the need for cold-start data, while DeepSeek-R1 improves on this by leveraging cold-start data alongside iterative RL fine-tuning. Distilling reasoning capabilities to smaller dense models is also highly effective, with DeepSeek-R1-Distill-Qwen-1.5B outperforming GPT-4o and Claude-3.5-Sonnet on math benchmarks.

Reference

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top