
- Smoltalk dataset has been unveiled, which contributed to the exceptional performance of its latest language model “SmolLM2“.
- This is a mix of synthetic and publicly available dataset designed for supervised finetuning (SFT) of LLMs.
- It contains 1 Million samples.
Key Features of the Dataset
The dataset used to train Smollm2 is a critical component of its success. Here are the main features:
- Diversity: The dataset includes a wide range of text types, ensuring that the model can understand and generate diverse content.
- Size: A substantial volume of data was utilized, allowing the model to learn from a rich variety of linguistic patterns and contexts.
- Quality: The data was carefully curated to maintain high quality, which is essential for training effective machine learning models.
Dataset Composition
- New Datasets
- Smol-Magpie-Ultra (400k samples): Core component, focuses on diverse tasks (instruction following, editing, rewriting, summarization).
- Smol-contraints (36k samples): Trains model to follow specific formatting instructions.
- Smol-rewrite (50k samples): Focuses on text rewriting tasks (tone adjustment).
- Smol-summarize (100k samples): Specialized in email and news summarization.
- Existing Public Datasets (for specific capabilities)
- OpenHermes2.5 (100k samples): Improves benchmarks like MMLU, WinoGrande, BBH.
- MetaMathQA (50k samples): Enhances math and reasoning skills.
- NuminaMath-CoT (subset): Improves performance on math problems.
- Self-Oss-Starcoder2-Instruct (subset): Improves coding abilities.
- SystemChats2.0 (30k samples): Enhances model’s support for system prompts.
- LongAlign (English, <16k tokens): Improves long-context understanding.
- Everyday-conversations (subset): Multi-turn conversations for general understanding.
- APIGen-Function-Calling (80k samples): Improves function calling skills.
- Explore-Instruct-Rewriting (30k samples): Additional rewriting data.
You can load a dataset using
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smoltalk", "all", split="train")
# to load the train split of a specific subset such as smol-magpie-ultra, you can do
ds = load_dataset("HuggingFaceTB/smoltalk", "smol-magpie-ultra", split="train")
Impact on AI Development
The release of this dataset recipe not only highlights the importance of quality data in training AI models but also sets a precedent for future research.
By sharing insights into their methodology, Smoltalk encourages collaboration and innovation within the AI community.