Smoltalk: Dataset Behind SmolLM2’s Success

  • Smoltalk dataset has been unveiled, which contributed to the exceptional performance of its latest language model “SmolLM2“.
  • This is a mix of synthetic and publicly available dataset designed for supervised finetuning (SFT) of LLMs.
  • It contains 1 Million samples.

Key Features of the Dataset

The dataset used to train Smollm2 is a critical component of its success. Here are the main features:

  • Diversity: The dataset includes a wide range of text types, ensuring that the model can understand and generate diverse content.
  • Size: A substantial volume of data was utilized, allowing the model to learn from a rich variety of linguistic patterns and contexts.
  • Quality: The data was carefully curated to maintain high quality, which is essential for training effective machine learning models.

Dataset Composition

  • New Datasets
    • Smol-Magpie-Ultra (400k samples): Core component, focuses on diverse tasks (instruction following, editing, rewriting, summarization).
    • Smol-contraints (36k samples): Trains model to follow specific formatting instructions.
    • Smol-rewrite (50k samples): Focuses on text rewriting tasks (tone adjustment).
    • Smol-summarize (100k samples): Specialized in email and news summarization.
  • Existing Public Datasets (for specific capabilities)
    • OpenHermes2.5 (100k samples): Improves benchmarks like MMLU, WinoGrande, BBH.
    • MetaMathQA (50k samples): Enhances math and reasoning skills.
    • NuminaMath-CoT (subset): Improves performance on math problems.
    • Self-Oss-Starcoder2-Instruct (subset): Improves coding abilities.
    • SystemChats2.0 (30k samples): Enhances model’s support for system prompts.
    • LongAlign (English, <16k tokens): Improves long-context understanding.
    • Everyday-conversations (subset): Multi-turn conversations for general understanding.
    • APIGen-Function-Calling (80k samples): Improves function calling skills.
    • Explore-Instruct-Rewriting (30k samples): Additional rewriting data.

You can load a dataset using

from datasets import load_dataset

ds = load_dataset("HuggingFaceTB/smoltalk", "all", split="train")
# to load the train split of a specific subset such as smol-magpie-ultra, you can do
ds = load_dataset("HuggingFaceTB/smoltalk", "smol-magpie-ultra", split="train")

Impact on AI Development

The release of this dataset recipe not only highlights the importance of quality data in training AI models but also sets a precedent for future research.
By sharing insights into their methodology, Smoltalk encourages collaboration and innovation within the AI community.

Website |  + posts

Silpa brings 5 years of experience in working on diverse ML projects, specializing in designing end-to-end ML systems tailored for real-time applications. Her background in statistics (Bachelor of Technology) provides a strong foundation for her work in the field. Silpa is also the driving force behind the development of the content you find on this site.

Subscribe to our newsletter!

Scroll to Top