Smoltalk: Dataset Behind SmolLM2’s Success

  • Smoltalk dataset has been unveiled, which contributed to the exceptional performance of its latest language model “SmolLM2“.
  • This is a mix of synthetic and publicly available dataset designed for supervised finetuning (SFT) of LLMs.
  • It contains 1 Million samples.

Key Features of the Dataset

The dataset used to train Smollm2 is a critical component of its success. Here are the main features:

  • Diversity: The dataset includes a wide range of text types, ensuring that the model can understand and generate diverse content.
  • Size: A substantial volume of data was utilized, allowing the model to learn from a rich variety of linguistic patterns and contexts.
  • Quality: The data was carefully curated to maintain high quality, which is essential for training effective machine learning models.

Dataset Composition

  • New Datasets
    • Smol-Magpie-Ultra (400k samples): Core component, focuses on diverse tasks (instruction following, editing, rewriting, summarization).
    • Smol-contraints (36k samples): Trains model to follow specific formatting instructions.
    • Smol-rewrite (50k samples): Focuses on text rewriting tasks (tone adjustment).
    • Smol-summarize (100k samples): Specialized in email and news summarization.
  • Existing Public Datasets (for specific capabilities)
    • OpenHermes2.5 (100k samples): Improves benchmarks like MMLU, WinoGrande, BBH.
    • MetaMathQA (50k samples): Enhances math and reasoning skills.
    • NuminaMath-CoT (subset): Improves performance on math problems.
    • Self-Oss-Starcoder2-Instruct (subset): Improves coding abilities.
    • SystemChats2.0 (30k samples): Enhances model’s support for system prompts.
    • LongAlign (English, <16k tokens): Improves long-context understanding.
    • Everyday-conversations (subset): Multi-turn conversations for general understanding.
    • APIGen-Function-Calling (80k samples): Improves function calling skills.
    • Explore-Instruct-Rewriting (30k samples): Additional rewriting data.

You can load a dataset using

from datasets import load_dataset

ds = load_dataset("HuggingFaceTB/smoltalk", "all", split="train")
# to load the train split of a specific subset such as smol-magpie-ultra, you can do
ds = load_dataset("HuggingFaceTB/smoltalk", "smol-magpie-ultra", split="train")

Impact on AI Development

The release of this dataset recipe not only highlights the importance of quality data in training AI models but also sets a precedent for future research.
By sharing insights into their methodology, Smoltalk encourages collaboration and innovation within the AI community.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top