Imagine you’re teaching a robot to write poetry. You give it a prompt, and it generates a poem. But how do you know if the robot’s poem is any good? You could compare it to a poem written by a human, but what if you have thousands of poems to check? This is where automatic evaluation metrics like BLEU and ROUGE come in—they help us measure how close a machine-generated text is to a reference (human-written) text.
Let’s break down these metrics, starting with simple analogies and building up to the math and code.
BLEU Score: The “Matching Phrases” Game
Intuitive Overview
Think of BLEU (Bilingual Evaluation Understudy) as a game of matching phrases. Imagine you have a bag of words and short phrases from a reference translation, and you want to see how many of those your machine translation can find. The more matches, the better the score.
- BLEU is like checking how many words and short phrases overlap between the predicted output and the reference.
- It’s most commonly used for evaluating machine translation, but also for summarization and other text generation tasks.
Example
Suppose the reference sentence is:
“The cat is on the mat.”
And the machine-generated sentence is:
“The cat sat on the mat.”
Let’s see how many words and short phrases (called n-grams) match.
Technical Details
1. N-gram Precision
BLEU calculates the precision of n-grams (contiguous sequences of n words) in the candidate sentence that appear in the reference.
- Unigram (1-gram): single words
- Bigram (2-gram): pairs of words
- Trigram (3-gram): triplets, etc.
For our example:
- Unigrams in candidate: the, cat, sat, on, the, mat
- Unigrams in reference: the, cat, is, on, the, mat
Count how many unigrams in the candidate appear in the reference (with a cap on how many times each word can be counted, to avoid cheating by repetition).
2. Modified Precision
To avoid inflating the score by repeating words, BLEU uses “modified precision”: each n-gram in the candidate is only counted up to the maximum number of times it appears in any reference.
3. Brevity Penalty
If the candidate is much shorter than the reference, it gets penalized. This prevents the model from just outputting a few common words.
4. BLEU Formula
The BLEU score is calculated as:
\[
\text{BLEU} = \text{BP} \cdot \exp\left( \sum_{n=1}^N w_n \log p_n \right)
\]
- \( p_n \): modified precision for n-grams
- \( w_n \): weight for each n-gram (often uniform, e.g., 0.25 for up to 4-grams)
- BP: brevity penalty
\[
\text{BP} =
\begin{cases}
1 & \text{if } c > r \
e^{(1 – r/c)} & \text{if } c \leq r
\end{cases}
\]
where \( c \) is the length of the candidate, \( r \) is the length of the reference.
Practical Code Example (Python, NLTK)
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
# BLEU-1 (unigram)
bleu1 = sentence_bleu(reference, candidate, weights=(1.0,), smoothing_function=SmoothingFunction().method1)
# BLEU-4 (default)
bleu4 = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)
print(f"BLEU-1: {bleu1:.2f}") # Output: 0.83 (Why? "the", "cat", "on", "the", "mat" from candidate matches with words from reference, thus 5/6 = 0.83)
print(f"BLEU-4: {bleu4:.2f}") # Output: 0.25 (Why? let's discuss.)
For BLEU-4, we look at all possible 4-grams in both sentences:
Reference 4-grams:
the cat is on
cat is on the
is on the mat
Candidate 4-grams:
the cat sat on
cat sat on the
sat on the mat
None of the candidate’s 4-grams match any of the reference’s 4-grams. However, BLEU uses smoothing (as specified with SmoothingFunction().method1), which prevents the score from being exactly zero when there are no matches. The smoothing function assigns a small, nonzero value to the precision for 4-grams, resulting in a BLEU-4 score of 0.25.
In summary, the BLEU-4 score is low (0.25) because the candidate sentence does not have any matching 4-grams with the reference, and the score is not zero only due to the smoothing technique used. This demonstrates how BLEU-4 is much stricter than BLEU-1, which only considers single-word matches.
To correctly compute BLEU-1 (using sentence_bleu), you should set weights=(1.0,)
. Without this, the function does not measure unigram overlap, but instead uses the default BLEU-4 calculation. The default weights for BLEU-4 computation are weights=(0.25, 0.25, 0.25, 0.25)
. Similarly, for computing BLEU-2, you would use weights=(0.5, 0.5)
.
ROUGE Score: The “Recall-Oriented” Lens
Intuitive Overview
If BLEU is about “how much of my output matches the reference,” ROUGE (Recall-Oriented Understudy for Gisting Evaluation) asks, “how much of the reference did I cover in my output?” It’s especially popular for summarization tasks.
- ROUGE measures how much of the reference’s content is captured by the candidate.
- It’s recall-focused: did you remember to include the important stuff?
Example
Reference summary:
“The cat is on the mat.”
Candidate summary:
“The cat sat on the mat.”
ROUGE will check how many words and phrases from the reference appear in the candidate.
Technical Details
There are several ROUGE variants, but the most common are:
- ROUGE-N: Overlap of n-grams (like BLEU, but recall-based)
- ROUGE-L: Longest Common Subsequence (LCS) between candidate and reference
- ROUGE-S: Skip-bigram based ROUGE
1. ROUGE-N (Recall)
\[
\text{ROUGE-N} = \frac{\text{Number of overlapping n-grams}}{\text{Total n-grams in reference}}
\]
2. ROUGE-L (LCS)
Measures the length of the longest common subsequence between candidate and reference.
\[
\text{ROUGE-L} = \frac{\text{LCS length}}{\text{Total tokens in reference}}
\]
3. ROUGE-S (Skip-Bigram)
Measures the overlap of skip-bigrams between candidate and reference.
\[
\text{ROUGE-S} = \frac{\text{Number of overlapping skip-bigrams}}{\text{Total skip-bigrams in reference}}
\]
Practical Code Example (Python, rouge-score
)
from rouge_score import rouge_scorer
reference = "The cat is on the mat."
candidate = "The cat sat on the mat."
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2'], use_stemmer=True)
scores = scorer.score(reference, candidate)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.2f}") # Output: 0.83 (Why? "the", "cat", "on", "the", "mat" from candidate matches with words from reference, thus 5/6 = 0.83)
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.2f}") # Output: 0.60 (Why? "the cat", "on the", "the mat" from candidate matches with words from reference, thus 3/5 = 0.60)
BLEU vs. ROUGE: When to Use Which?
- BLEU is precision-oriented and widely used for machine translation.
- ROUGE is recall-oriented and popular for summarization.
In practice: For translation, you want to avoid adding extra, incorrect information (precision), so BLEU is preferred. For summarization, you want to make sure you cover all the important points (recall), so ROUGE is preferred.
Summary Table
Metric | Focus | Measures |
---|---|---|
BLEU | Precision | n-gram overlap |
ROUGE | Recall | n-gram/LCS overlap |
Both BLEU and ROUGE are imperfect but useful tools. They give us a quick, automated way to compare machine-generated text to human references. But remember: the best evaluation is still a human reader!