What is FastText? Quick, Efficient Word Embeddings and Text Models

Imagine you are reading an ancient text, and you come across a word you have never seen before, like “scribe-craft”. While you do not know its exact dictionary definition, you can infer its meaning. You see the parts “scribe” and “craft”, and you immediately think of the skill or art of writing. This is because you are not just looking at the word as a single, indivisible unit; you are analyzing its components.

Traditional word embedding models, like Word2Vec, are like a reader who can only recognize whole words. If they encounter a word not in their vocabulary, they are stumped. They have no way to guess its meaning. This is the “Out-of-Vocabulary” (OOV) problem, a significant challenge in NLP.

This is where FastText, a library developed by Facebook’s AI Research (FAIR) lab, changes the game. FastText takes a page from our own intuitive book: it learns to understand words by breaking them down into smaller pieces. This simple yet powerful idea allows it to generate surprisingly accurate representations for new, rare, or even misspelled words, making it an indispensable tool for real-world NLP applications.

In this article, we will dive deep into FastText. We will start with the intuition, build up to the technical details, and finish with practical code examples and best practices.

The Core Idea: Words are Made of Pieces

The breakthrough of FastText is its use of subword information. Instead of learning a distinct vector for each word, FastText represents a word as a collection of its character n-grams (paper).

What is a character n-gram? It is simply a sequence of ‘n’ characters. For example, let us take the word “vector” and a character n-gram size of 3 (a trigram). The trigrams are: [“vec”, “ect”, “cto”, “tor”]

FastText also adds special boundary symbols, < and >, at the beginning and end of the word. This helps capture prefixes and suffixes. For “vector” with n=3, this would look like: [“<ve”, “vec”, “ect”, “cto”, “tor”, “or>”]

Finally, the full word itself is included as a special n-gram.

This bag of n-grams becomes the new representation for the word. The beauty of this approach is that different words can share n-grams, and this shared information helps the model understand their semantic relationship. For instance, “learning” and “teaching” both contain the n-gram “ing”, which hints at them being verbs or processes.

In supervised classification tasks, FastText can also incorporate entire word n-grams (sequence of whole words) as additional features to capture some word order information.

How FastText Learns: The Model Architecture

FastText’s learning process is heavily inspired by Word2Vec’s Skip-gram model. The core idea of Skip-gram is to use a word to predict its surrounding context words. In Skip-gram, you take a word from your text (the target word) and try to predict a nearby word (the context word). The model learns by adjusting its word vectors to get better at this prediction task.

FastText adopts this very same architecture but with one crucial modification.

The FastText Twist

In a standard Skip-gram model, the input is a single vector representing the target word. In FastText, the input vector for a word is the sum of the vectors of all its character n-grams.

Let’s make this concrete.

Let’s say the word is $w$.
We find all its character n-grams, which we will call the set $G_w$.
Each n-gram $g$ in $G_w$ has its own vector, $v_g$.
The final vector for the word $w$, let’s call it $v_w$, is simply the sum:
$$ v_w = \sum_{g \in G_w} v_g $$

This summed vector is then fed into the Skip-gram architecture to predict the context words. The model’s error is backpropagated to update the vectors $v_g$ for each of the n-grams.

Objective Function

Instead of forcing the model to normalize over the entire vocabulary with a softmax, FastText treats each target–context pair as an independent binary classification problem. For a target word at position $t$, every true context word is a positive example; a small set of words sampled from a noise distribution are negative examples. The model simply learns to assign high probability to real context pairs and low probability to sampled negatives.

Let $w_t$ be the target word and $w_c$ be the observed (positive) context word. Draw $K$ negative samples ${w_{n1}, \ldots, w_{nK}}$ from a noise distribution. Define the scoring function as the dot product between the target vector and a context vector:
$$ s(w_t, w) = v_{w}^T v_{w_t}, $$
where, in FastText, $v_{w_t} = \sum_{g \in G_{w_t}} v_g$ (the sum of its n-gram vectors).

Using the binary logistic (negative sampling) loss, the negative log-likelihood for this training instance is
$$ L = − log σ( s(w_t, w_c) ) − \sum_{i=1}^K log σ( − s(w_t, w{ni}) ), $$
where $\sigma(x) = \frac{1}{1 + e^{-x}}$. Intuitively, the first term pushes the model to increase the score for the true context, while the summed terms push the scores for sampled negatives down.

The objective function for the context word can be simplified as follows:
$$L = – \log \sigma(v_{w_c}^T v_{w_t}) – \sum_{i=1}^K \log \sigma(-v_{w_{ni}}^T v_{w_t})$$
$$= \log(1 + e^{-v_{w_c}^T v_{w_t}}) + \sum_{i=1}^K \log(1 + e^{v_{w_{ni}}^T v_{w_t}})$$

Thus, the overall cost function is a sum over all target–context pairs in the corpus, given by:
$$ \sum_{t=1}^{T} \sum_{c \in C_t} \left[ \log(1 + e^{-v_{w_c}^T v_{w_t}}) + \sum_{i=1}^K \log(1 + e^{v_{w_{ni}}^T v_{w_t}}) \right] $$

Beyond Embeddings: FastText for Text Classification

While FastText is famous for its subword embeddings, it is also an incredibly fast and effective text classifier. The architecture it uses for classification is surprisingly simple yet powerful.

The core idea is to represent a sentence or document by averaging the vectors of all the words (and their n-grams) within it. This single, averaged vector is then fed into a simple linear classifier (a fully connected layer with a softmax activation function) to predict the label.

To keep the model fast, FastText uses a few tricks:

Hierarchical Softmax: Instead of calculating the probability over every single class (which can be slow if you have thousands of classes), it uses a tree structure. This turns a massive single decision into a series of much faster binary decisions.
N-gram Features: To capture word order, which is lost when averaging embeddings, FastText can also treat entire word n-grams (e.g., “was awesome”) as additional features.

The result is a classifier that is often on par with deep learning models in terms of accuracy but is orders of magnitude faster to train and run.

Getting Started: A Practical Guide

Let’s move from theory to practice. Using FastText is straightforward, whether you are using a pre-trained model or training your own.

First, install the fasttext Python package.

pip install fasttext

Using Pre-trained Models

The FastText team provides pre-trained models for over 150 languages. These are a great starting point.

Python

import fasttext
import fasttext.util

# Download the pre-trained model for English (this will take a while)
# The model is about 1GB
fasttext.util.download_model('en', if_exists='ignore')  

# Load the model
ft = fasttext.load_model('cc.en.300.bin')

# Get the vector for a word
vector_learning = ft.get_word_vector('learning')
print(f"Dimension of 'learning' vector: {len(vector_learning)}")

# Get a vector for an out-of-vocabulary word
vector_oov = ft.get_word_vector('supercalifragilisticexpialidocious')
print(f"Successfully created a vector for an OOV word!")

# Find nearest neighbors
neighbors = ft.get_nearest_neighbors('awesome')
print("\nNeighbors of 'awesome':")
for score, neighbor in neighbors:
    print(f"- {neighbor} (Score: {score:.4f})")

import fasttext
import fasttext.util

# Download the pre-trained model for English (this will take a while)
# The model is about 1GB
fasttext.util.download_model('en', if_exists='ignore')  

# Load the model
ft = fasttext.load_model('cc.en.300.bin')

# Get the vector for a word
vector_learning = ft.get_word_vector('learning')
print(f"Dimension of 'learning' vector: {len(vector_learning)}")

# Get a vector for an out-of-vocabulary word
vector_oov = ft.get_word_vector('supercalifragilisticexpialidocious')
print(f"Successfully created a vector for an OOV word!")

# Find nearest neighbors
neighbors = ft.get_nearest_neighbors('awesome')
print("\nNeighbors of 'awesome':")
for score, neighbor in neighbors:
    print(f"- {neighbor} (Score: {score:.4f})")

Training Your Own Embeddings

Training your own model is useful when you have a domain-specific corpus (e.g., medical texts, legal documents).

First, you need a clean text file (data.txt) where each line is a sentence.

Python

import fasttext

# Create a dummy data.txt file for demonstration
with open('data.txt', 'w') as f:
    f.write("my_word_from_corpus is a word from the corpus. Another word from the corpus is corpus.")
    f.write("\nThis is another sentence for training fastText model.")

# Train a Skip-gram model
# This will create a file 'model.bin' with the trained model
# Setting minCount=1 to avoid 'Empty vocabulary' error with small datasets
model = fasttext.train_unsupervised('data.txt', model='skipgram', minCount=1)

# Get a word vector
print(model.get_word_vector('my_word_from_corpus'))
print(model.get_word_vector('corpus'))

# You can also tune hyperparameters
# - dim: size of the vectors (default: 100)
# - ws: size of the context window (default: 5)
# - minn: min length of character n-gram (default: 3)
# - maxn: max length of character n-gram (default: 6)
model_tuned = fasttext.train_unsupervised('data.txt', model='skipgram', dim=150, ws=5, minn=2, maxn=5, minCount=1)

import fasttext

# Create a dummy data.txt file for demonstration
with open('data.txt', 'w') as f:
    f.write("my_word_from_corpus is a word from the corpus. Another word from the corpus is corpus.")
    f.write("\nThis is another sentence for training fastText model.")

# Train a Skip-gram model
# This will create a file 'model.bin' with the trained model
# Setting minCount=1 to avoid 'Empty vocabulary' error with small datasets
model = fasttext.train_unsupervised('data.txt', model='skipgram', minCount=1)

# Get a word vector
print(model.get_word_vector('my_word_from_corpus'))
print(model.get_word_vector('corpus'))

# You can also tune hyperparameters
# - dim: size of the vectors (default: 100)
# - ws: size of the context window (default: 5)
# - minn: min length of character n-gram (default: 3)
# - maxn: max length of character n-gram (default: 6)
model_tuned = fasttext.train_unsupervised('data.txt', model='skipgram', dim=150, ws=5, minn=2, maxn=5, minCount=1)

Training a Text Classifier

To train a classifier, your data file needs to be in a specific format. Each line should contain the text preceded by __label__ and the class name.

Python

import fasttext

# Create dummy train_data.txt and test_data.txt for demonstration
# Format: __label__<label_name> <text>
with open('train_data.txt', 'w') as f:
    f.write('__label__positive This was a great film.\n')
    f.write('__label__negative This movie was terrible.\n')
    f.write('__label__positive I loved every moment of it.\n')
    f.write('__label__negative What a waste of time.\n')
    f.write('__label__positive The plot was captivating and the acting superb.\n')
    f.write('__label__negative I would not recommend this to anyone, completely boring.\n')
    f.write('__label__positive A truly enjoyable experience, definitely worth watching.\n')
    f.write('__label__negative So disappointing, a complete mess from start to finish.\n')

with open('test_data.txt', 'w') as f:
    f.write('__label__positive An excellent cinematic experience.\n')
    f.write('__label__negative Absolutely dreadful.\n')
    f.write('__label__positive Fantastic movie with a strong message.\n')
    f.write('__label__negative Avoid at all costs, a waste of precious time.\n')

# Train the supervised model with minCount=1 to prevent NaN issues with small datasets
# This will create a file 'classifier_model.bin'
classifier = fasttext.train_supervised(input='train_data.txt', minCount=1)

# Predict a label for a new sentence
prediction = classifier.predict("This was a great film.")
print(prediction)

# Predict with probability
prediction_k = classifier.predict("This was a great film.", k=2) # Get top 2 labels
print(prediction_k)

# Evaluate the model on a test set
test_metrics = classifier.test('test_data.txt')
print(f"\nTest Metrics:")
print(f"Precision: {test_metrics[1]:.4f}")
print(f"Recall: {test_metrics[2]:.4f}")

import fasttext

# Create dummy train_data.txt and test_data.txt for demonstration
# Format: __label__<label_name> <text>
with open('train_data.txt', 'w') as f:
    f.write('__label__positive This was a great film.\n')
    f.write('__label__negative This movie was terrible.\n')
    f.write('__label__positive I loved every moment of it.\n')
    f.write('__label__negative What a waste of time.\n')
    f.write('__label__positive The plot was captivating and the acting superb.\n')
    f.write('__label__negative I would not recommend this to anyone, completely boring.\n')
    f.write('__label__positive A truly enjoyable experience, definitely worth watching.\n')
    f.write('__label__negative So disappointing, a complete mess from start to finish.\n')

with open('test_data.txt', 'w') as f:
    f.write('__label__positive An excellent cinematic experience.\n')
    f.write('__label__negative Absolutely dreadful.\n')
    f.write('__label__positive Fantastic movie with a strong message.\n')
    f.write('__label__negative Avoid at all costs, a waste of precious time.\n')

# Train the supervised model with minCount=1 to prevent NaN issues with small datasets
# This will create a file 'classifier_model.bin'
classifier = fasttext.train_supervised(input='train_data.txt', minCount=1)

# Predict a label for a new sentence
prediction = classifier.predict("This was a great film.")
print(prediction)

# Predict with probability
prediction_k = classifier.predict("This was a great film.", k=2) # Get top 2 labels
print(prediction_k)

# Evaluate the model on a test set
test_metrics = classifier.test('test_data.txt')
print(f"\nTest Metrics:")
print(f"Precision: {test_metrics[1]:.4f}")
print(f"Recall: {test_metrics[2]:.4f}")

Practical Tips and Best Practices

Start with Pre-trained Models: Unless you have a very large and specific corpus, pre-trained models are a powerful baseline. They have been trained on massive web-scale datasets like Common Crawl.
Preprocessing: FastText handles tokenization for you, but you should still perform basic cleaning like lowercasing your text and removing punctuation to ensure consistency.
Hyperparameter Tuning: For classification, the learning rate (lr), epoch count (epoch), and word n-gram size (wordNgrams) are the most important parameters to tune. A higher wordNgrams value (e.g., 2 or 3) helps the model capture some word order and can significantly improve accuracy.
Vector Dimension: For embeddings, a dimension between 100 and 300 is usually a good range. Larger dimensions can capture more nuance but require more data and are slower to train.
Subword N-gram Range: The default range for character n-grams (minn=3, maxn=6) works well for most languages. For morphologically rich languages (like German or Turkish), you might want to experiment with a wider range.

Conclusion

FastText is a testament to the power of simple ideas. By treating words as compositions of their parts, it elegantly solves the out-of-vocabulary problem and provides a robust way to represent language. Its speed and efficiency, for both embedding generation and classification, make it a go-to tool for NLP practitioners.

While more complex models like BERT and GPT have since emerged, they come with immense computational costs. FastText remains a highly competitive and practical choice for many applications, proving that sometimes, the most effective solutions are the ones that are both clever and incredibly efficient.

References

P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
fasttext documentation