How to Evaluate Text Generation: BLEU and ROUGE Explained with Examples

Imagine you’re teaching a robot to write poetry. You give it a prompt, and it generates a poem. But how do you know if the robot’s poem is any good? You could compare it to a poem written by a human, but what if you have thousands of poems to check? This is where automatic evaluation metrics like BLEU and ROUGE come in—they help us measure how close a machine-generated text is to a reference (human-written) text.

Let’s break down these metrics, starting with simple analogies and building up to the math and code.

BLEU Score: The “Matching Phrases” Game

Intuitive Overview

Think of BLEU (Bilingual Evaluation Understudy) as a game of matching phrases. Imagine you have a bag of words and short phrases from a reference translation, and you want to see how many of those your machine translation can find. The more matches, the better the score.

BLEU is like checking how many words and short phrases overlap between the predicted output and the reference.
It’s most commonly used for evaluating machine translation, but also for summarization and other text generation tasks.

Example

Suppose the reference sentence is:

“The cat is on the mat.”

And the machine-generated sentence is:

“The cat sat on the mat.”

Let’s see how many words and short phrases (called n-grams) match.

Technical Details

1. N-gram Precision

BLEU calculates the precision of n-grams (contiguous sequences of n words) in the candidate sentence that appear in the reference.

Unigram (1-gram): single words
Bigram (2-gram): pairs of words
Trigram (3-gram): triplets, etc.

For our example:

Unigrams in candidate: the, cat, sat, on, the, mat
Unigrams in reference: the, cat, is, on, the, mat

Count how many unigrams in the candidate appear in the reference (with a cap on how many times each word can be counted, to avoid cheating by repetition).

2. Modified Precision

To avoid inflating the score by repeating words, BLEU uses “modified precision”: each n-gram in the candidate is only counted up to the maximum number of times it appears in any reference.

3. Brevity Penalty

If the candidate is much shorter than the reference, it gets penalized. This prevents the model from just outputting a few common words.

4. BLEU Formula

The BLEU score is calculated as:

\[
\text{BLEU} = \text{BP} \cdot \exp\left( \sum_{n=1}^N w_n \log p_n \right)
\]

\( p_n \): modified precision for n-grams
\( w_n \): weight for each n-gram (often uniform, e.g., 0.25 for up to 4-grams)
BP: brevity penalty

\[
\text{BP} =
\begin{cases}
1 & \text{if } c > r \
e^{(1 – r/c)} & \text{if } c \leq r
\end{cases}
\]
where \( c \) is the length of the candidate, \( r \) is the length of the reference.

Practical Code Example (Python, NLTK)

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']

# BLEU-1 (unigram)
bleu1 = sentence_bleu(reference, candidate, weights=(1.0,), smoothing_function=SmoothingFunction().method1)
# BLEU-4 (default)
bleu4 = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)

print(f"BLEU-1: {bleu1:.2f}") # Output: 0.83 (Why? "the", "cat", "on", "the", "mat" from candidate matches with words from reference, thus 5/6 = 0.83)
print(f"BLEU-4: {bleu4:.2f}") # Output: 0.25 (Why? let's discuss.)

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']

# BLEU-1 (unigram)
bleu1 = sentence_bleu(reference, candidate, weights=(1.0,), smoothing_function=SmoothingFunction().method1)
# BLEU-4 (default)
bleu4 = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)

print(f"BLEU-1: {bleu1:.2f}") # Output: 0.83 (Why? "the", "cat", "on", "the", "mat" from candidate matches with words from reference, thus 5/6 = 0.83)
print(f"BLEU-4: {bleu4:.2f}") # Output: 0.25 (Why? let's discuss.)

For BLEU-4, we look at all possible 4-grams in both sentences:

Reference 4-grams:

the cat is on
cat is on the
is on the mat

Candidate 4-grams:

the cat sat on
cat sat on the
sat on the mat

None of the candidate’s 4-grams match any of the reference’s 4-grams. However, BLEU uses smoothing (as specified with SmoothingFunction().method1), which prevents the score from being exactly zero when there are no matches. The smoothing function assigns a small, nonzero value to the precision for 4-grams, resulting in a BLEU-4 score of 0.25.

In summary, the BLEU-4 score is low (0.25) because the candidate sentence does not have any matching 4-grams with the reference, and the score is not zero only due to the smoothing technique used. This demonstrates how BLEU-4 is much stricter than BLEU-1, which only considers single-word matches.

To correctly compute BLEU-1 (using sentence_bleu), you should set weights=(1.0,). Without this, the function does not measure unigram overlap, but instead uses the default BLEU-4 calculation. The default weights for BLEU-4 computation are weights=(0.25, 0.25, 0.25, 0.25). Similarly, for computing BLEU-2, you would use weights=(0.5, 0.5).

ROUGE Score: The “Recall-Oriented” Lens

Intuitive Overview

If BLEU is about “how much of my output matches the reference,” ROUGE (Recall-Oriented Understudy for Gisting Evaluation) asks, “how much of the reference did I cover in my output?” It’s especially popular for summarization tasks.

ROUGE measures how much of the reference’s content is captured by the candidate.
It’s recall-focused: did you remember to include the important stuff?

Example

Reference summary:

“The cat is on the mat.”

Candidate summary:

“The cat sat on the mat.”

ROUGE will check how many words and phrases from the reference appear in the candidate.

Technical Details

There are several ROUGE variants, but the most common are:

ROUGE-N: Overlap of n-grams (like BLEU, but recall-based)
ROUGE-L: Longest Common Subsequence (LCS) between candidate and reference
ROUGE-S: Skip-bigram based ROUGE

1. ROUGE-N (Recall)

\[
\text{ROUGE-N} = \frac{\text{Number of overlapping n-grams}}{\text{Total n-grams in reference}}
\]

2. ROUGE-L (LCS)

Measures the length of the longest common subsequence between candidate and reference.

\[
\text{ROUGE-L} = \frac{\text{LCS length}}{\text{Total tokens in reference}}
\]

3. ROUGE-S (Skip-Bigram)

Measures the overlap of skip-bigrams between candidate and reference.

\[
\text{ROUGE-S} = \frac{\text{Number of overlapping skip-bigrams}}{\text{Total skip-bigrams in reference}}
\]

Practical Code Example (Python, `rouge-score`)

from rouge_score import rouge_scorer

reference = "The cat is on the mat."
candidate = "The cat sat on the mat."

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2'], use_stemmer=True)
scores = scorer.score(reference, candidate)

print(f"ROUGE-1: {scores['rouge1'].fmeasure:.2f}") # Output: 0.83 (Why? "the", "cat", "on", "the", "mat" from candidate matches with words from reference, thus 5/6 = 0.83)
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.2f}") # Output: 0.60 (Why? "the cat", "on the", "the mat" from candidate matches with words from reference, thus 3/5 = 0.60)

from rouge_score import rouge_scorer

reference = "The cat is on the mat."
candidate = "The cat sat on the mat."

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2'], use_stemmer=True)
scores = scorer.score(reference, candidate)

print(f"ROUGE-1: {scores['rouge1'].fmeasure:.2f}") # Output: 0.83 (Why? "the", "cat", "on", "the", "mat" from candidate matches with words from reference, thus 5/6 = 0.83)
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.2f}") # Output: 0.60 (Why? "the cat", "on the", "the mat" from candidate matches with words from reference, thus 3/5 = 0.60)

BLEU vs. ROUGE: When to Use Which?

BLEU is precision-oriented and widely used for machine translation.
ROUGE is recall-oriented and popular for summarization.

In practice: For translation, you want to avoid adding extra, incorrect information (precision), so BLEU is preferred. For summarization, you want to make sure you cover all the important points (recall), so ROUGE is preferred.

Summary Table

Metric	Focus	Measures
BLEU	Precision	n-gram overlap
ROUGE	Recall	n-gram/LCS overlap

Both BLEU and ROUGE are imperfect but useful tools. They give us a quick, automated way to compare machine-generated text to human references. But remember: the best evaluation is still a human reader!

BLEU Score: The “Matching Phrases” Game

Intuitive Overview

Example

Technical Details

1. N-gram Precision

2. Modified Precision

3. Brevity Penalty

4. BLEU Formula

Practical Code Example (Python, NLTK)

ROUGE Score: The “Recall-Oriented” Lens

Intuitive Overview

Example

Technical Details

1. ROUGE-N (Recall)

2. ROUGE-L (LCS)

3. ROUGE-S (Skip-Bigram)

Practical Code Example (Python, rouge-score)

BLEU vs. ROUGE: When to Use Which?

Summary Table

Related Posts

Practical Code Example (Python, `rouge-score`)