Label Smoothing: Intuition, Mathematics, Gradients, and Practical Use

Imagine a teacher grading a multiple-choice exam. If the teacher says, “Only this one answer has any value, and all others are worth exactly zero,” the student may learn to become extremely confident, even when some alternatives are semantically similar. Label smoothing changes that teaching signal slightly. It still says which answer is correct, but it also leaves a tiny amount of probability mass for the other classes. That small change often leads to models that generalize better and behave less overconfidently.

In classification, one-hot labels tell the model that the true class has probability $1$ and every other class has probability $0$. Label smoothing softens that target just a little. The correct class still carries most of the probability mass, but the remaining classes receive a small share too.

At first glance, label smoothing looks almost too simple to matter. It modifies only the target distribution, not the architecture, not the optimizer, and not the forward pass. Yet that one modification changes the geometry of the learning problem and the gradients seen during training.

This article builds the idea in stages. It starts with intuition, then formalizes the method mathematically, derives the gradient with respect to the logits, and finally discusses when label smoothing helps, when it can hurt, and how to reason about it in practice.

1. What Label Smoothing Is

In ordinary multi-class classification, the target for each example is a one-hot vector. For a 4-class problem where class 2 is correct, the target is:

$$
\mathbf{y} = [0, 1, 0, 0]
$$

That vector says two things very aggressively:

the correct class should receive probability $1$
every incorrect class should receive probability $0$

Label smoothing replaces that hard target with a softened target:

$$
y_k^{\text{smooth}} = (1 – \epsilon) y_k + \frac{\epsilon}{K}
$$

where:

$K$ is the number of classes
$\epsilon$ is the smoothing strength

If $K = 4$ and $\epsilon = 0.1$, then the smoothed target becomes:

$$
\mathbf{y}^{\text{smooth}} = [0.025, 0.925, 0.025, 0.025]
$$

The correct class still dominates, but the target is no longer an extreme point of the simplex. That is the whole method.

1.1. A visual intuition

One-hot labels behave like a sharp spike: all probability mass sits on a single class. Label smoothing flattens that spike just a little. The model is still taught to prefer the correct class, but it is no longer rewarded for driving every alternative all the way to zero.

You can think of the picture this way: the one-hot target sits at a corner of the probability simplex, while the smoothed target is pulled slightly inward. Training is still directional, but it is no longer extreme.

1.2. Where label smoothing is commonly used

Label smoothing is most common in:

image classification
sequence classification
language modeling with categorical targets
speech recognition and speech classification
large-scale transformer training

It is primarily a technique for single-label multi-class classification with softmax outputs.

2. Why This Small Change Matters

The easiest way to understand label smoothing is to ask what ordinary cross-entropy is trying to do. With one-hot targets, the model is trained to make the correct class probability approach $1$ and every other class probability approach $0$. In practice, that often means pushing logit gaps to become much larger than necessary.

Label smoothing weakens that pressure. The model must still identify the winner, but it is not encouraged to become infinitely certain.

2.1. It reduces overconfidence

Modern neural networks can be very accurate and still badly calibrated. A model may output $0.999$ confidence for examples that are not nearly that certain. One-hot training tends to amplify this behavior because the target itself is absolute.

Label smoothing changes the message from “assign all mass to the winner” to “assign most mass to the winner.” That tends to reduce the drive toward extreme probabilities.

2.2. It acts as target-side regularization

A helpful way to think about label smoothing is that it regularizes the training signal, not the parameters directly.

Weight decay says, “Do not let the parameters grow too freely.” Dropout says, “Do not rely too much on any one internal pathway.” Label smoothing says, “Do not fit the target as if it were infinitely certain.”

That often helps with:

overfitting
poor probability calibration
noisy labels
unnecessarily large logit margins

2.3. It better matches real datasets

Real datasets are rarely perfectly crisp.

Some labels are wrong.
Some classes overlap semantically.
Some examples are genuinely ambiguous.
Some annotation schemes force a single class even when the underlying signal is softer.

One-hot supervision assumes the opposite: complete certainty. Label smoothing injects a small amount of humility into the target distribution.

2.4. It changes the geometry of optimization

For a $K$-class classifier, the output distribution lies on a probability simplex. In a 3-class problem, that simplex is a triangle.

one-hot targets lie at the corners
smoothed targets lie slightly inside the triangle

That inward shift matters. The optimizer is no longer asked to hit the most extreme corners of the simplex. Instead, it is asked to match distributions in the interior. Geometrically, this means smaller required logit separations and less pressure to form excessively sharp decision surfaces.

3. The Mathematical Formulation

Let:

$K$ be the number of classes
$\mathbf{z} \in \mathbb{R}^K$ be the logits for one example
$p_k$ be the softmax probability for class $k$
$\mathbf{y}$ be the one-hot target
$y^{\text{smooth}}$ be the smoothed target

The softmax probabilities are:

$$
p_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}
$$

The smoothed target is:

$$
y_k^{\text{smooth}} = (1 – \epsilon) y_k + \frac{\epsilon}{K}
$$

The cross-entropy loss for one example becomes:

$$
\mathcal{L} = -\sum_{k=1}^{K} y_k^{\text{smooth}} \log p_k
$$

3.1. What the target values become

Suppose class $c$ is the correct class.

For the correct class:

$$
y_c^{\text{smooth}} = 1 – \epsilon + \frac{\epsilon}{K}
$$

For any incorrect class $j \ne c$:

$$
y_j^{\text{smooth}} = \frac{\epsilon}{K}
$$

So label smoothing does two things simultaneously:

it lowers the target assigned to the correct class
it gives each incorrect class a small positive target

3.2. Alternative conventions

Some references use a slightly different definition:

$$
y_c^{\text{smooth}} = 1 – \epsilon, \qquad y_j^{\text{smooth}} = \frac{\epsilon}{K-1} \quad \text{for } j \ne c
$$

The intuition is the same. The total smoothing mass is still $\epsilon$, but it is distributed only among the incorrect classes.

This article uses the all-classes convention:

$$
y_k^{\text{smooth}} = (1 – \epsilon) y_k + \frac{\epsilon}{K}
$$

When comparing papers or libraries, always check which convention is being used. The numeric value of $\epsilon$ is not directly interchangeable across all definitions.

4. How Label Smoothing Changes Cross-Entropy

With ordinary one-hot labels, the loss for the correct class $c$ is:

$$
\mathcal{L}_{\text{one-hot}} = -\log p_c
$$

This objective rewards the model for pushing $p_c \to 1$, which usually means driving $z_c – z_j \to \infty$ for all $j \ne c$. The loss does not explicitly say “be calibrated.” It says “make the winner absolute.” That pressure can contribute to overconfident predictions, especially in over-parameterized models.

With label smoothing, the loss becomes:

$$
\mathcal{L}_{\text{smooth}} = -\sum_{k=1}^{K} y_k^{\text{smooth}} \log p_k
$$

Now the loss no longer depends only on the correct class probability. Every class participates, because every class has some target mass.

4.1. Information-theoretic interpretation

Let $U$ denote the uniform distribution over the $K$ classes, so that $U_k = 1/K$. Since

$$
\mathbf{y}^{\text{smooth}} = (1 – \epsilon) \mathbf{y} + \epsilon U,
$$

the smoothed loss can be written as:

$$
\mathcal{L}_{\text{smooth}} = (1 – \epsilon) \, CE(\mathbf{y}, \mathbf{p}) + \epsilon \, CE(U, \mathbf{p})
$$

and because

$$
CE(U, \mathbf{p}) = H(U) + KL(U || \mathbf{p}),
$$

where $H(U)$ is the entropy of the uniform distribution and $KL(U || \mathbf{p})$ is the KL divergence from $U$ to $\mathbf{p}$, we can further expand to get:

$$
\mathcal{L}_{\text{smooth}} = (1 – \epsilon) \, CE(\mathbf{y}, \mathbf{p}) + \epsilon \, KL(U || \mathbf{p}) + \epsilon \, H(U)
$$

The term $H(U)$ is constant with respect to the model, so for optimization purposes label smoothing is equivalent to combining:

$$
\mathcal{L}_{\text{smooth}} = (1 – \epsilon) \, CE(\mathbf{y}, \mathbf{p}) + \epsilon \, KL(U || \mathbf{p})
$$

the usual hard-label cross-entropy, scaled by $(1-\epsilon)$
a regularizer that discourages predictions from drifting too far away from the uniform distribution

This is the cleanest information-theoretic interpretation. Label smoothing does not simply replace cross-entropy with a heuristic softer loss. It adds a bias toward less concentrated output distributions.

What this means intuitively:

This decomposition explains why label smoothing often improves calibration. The model is still rewarded for assigning high probability to the correct class, but it is also softly penalized for making the distribution sharper than necessary.

This tends to reduce:

extremely peaked output distributions
oversized logit gaps
brittle confidence estimates on the training set

It is not a guarantee of perfect calibration, but it often moves the model in the right direction.

4.2. Finite Optimal Logits

A crucial but often misunderstood mathematical property of label smoothing is that it bounds the optimal logit magnitudes. With one-hot targets, the minimum loss is achieved when $p_c \to 1$ and $p_j \to 0$, which requires the logit difference $z_c – z_j \to \infty$. The network is incentivized to make the logits infinitely large.

With label smoothing, the minimum loss is found exactly when the predicted probabilities match the softened targets: $p_k = y_k^{\text{smooth}}$.

Because the target probabilities are bounded away from $0$ and $1$, the optimal logit difference becomes finite:

$$
z_c^* – z_j^* = \log\left(\frac{p_c^*}{p_j^*}\right) = \log\left(\frac{1 – \epsilon + \epsilon/K}{\epsilon/K}\right)
$$

This mathematically proves why label smoothing reduces overconfidence—it structurally prevents the optimizer from blowing up the logit distances to infinity.

5. Deriving the Gradient with Respect to the Logits

This is the most important mathematical section because it shows exactly how the training signal changes.

We derive the gradient for a single example.

5.1. Setup

Let the logits be:

$$
\mathbf{z} = [z_1, z_2, \dots, z_K]
$$

and the softmax probabilities:

$$
p_k = \frac{e^{z_k}}{\sum_{m=1}^{K} e^{z_m}}
$$

The smoothed cross-entropy loss is:

$$
\mathcal{L} = -\sum_{k=1}^{K} y_k^{\text{smooth}} \log p_k
$$

We want:

$$
\frac{\partial \mathcal{L}}{\partial z_j}
$$

for any class $j$.

5.2. The key softmax identity

For softmax, a standard identity is:

$$
\frac{\partial \log p_k}{\partial z_j} = \delta_{kj} – p_j
$$

where $\delta_{kj}$ is the Kronecker delta:

$$
\delta_{kj} =
\begin{cases}
1 & \text{if } k = j \
0 & \text{if } k \ne j
\end{cases}
$$

5.3. Differentiate the loss

Start from:

$$
\mathcal{L} = -\sum_{k=1}^{K} y_k^{\text{smooth}} \log p_k
$$

Differentiate with respect to $z_j$:

$$
\frac{\partial \mathcal{L}}{\partial z_j} = -\sum_{k=1}^{K} y_k^{\text{smooth}} \frac{\partial \log p_k}{\partial z_j}
$$

Substitute the softmax identity:

$$
\frac{\partial \mathcal{L}}{\partial z_j}=-\sum_{k=1}^{K} y_k^{\text{smooth}} (\delta_{kj} – p_j)
$$

Distribute the sum:

$$
\frac{\partial \mathcal{L}}{\partial z_j}=-\sum_{k=1}^{K} y_k^{\text{smooth}} \delta_{kj} + \sum_{k=1}^{K} y_k^{\text{smooth}}
$$

Since $p_j$ does not depend on $k$, pull it out of the second sum:

$$
\frac{\partial \mathcal{L}}{\partial z_j}=-y_j^{\text{smooth}} + p_j \sum_{k=1}^{K} y_k^{\text{smooth}}
$$

Because the smoothed labels still form a probability distribution,

$$
\sum_{k=1}^{K} y_k^{\text{smooth}} = 1
$$

so the gradient simplifies to:

$$
\boxed{
\frac{\partial \mathcal{L}}{\partial z_j} = p_j – y_j^{\text{smooth}}
}
$$

Structurally, this is the same gradient form as ordinary cross-entropy. The only difference is that the target vector has changed.

5.4. Correct class versus incorrect classes

If $j = c$, then:

$$
\frac{\partial \mathcal{L}}{\partial z_c} = p_c – \left(1 – \epsilon + \frac{\epsilon}{K}\right)
$$

Without smoothing, the gradient would be:

$$
\frac{\partial \mathcal{L}}{\partial z_c} = p_c – 1
$$

So the correct-class gradient becomes less negative. Under gradient descent, that means the optimizer pushes the correct logit upward less aggressively.

If $j \ne c$, then:

$$
\frac{\partial \mathcal{L}}{\partial z_j} = p_j – \frac{\epsilon}{K}
$$

Without smoothing, the gradient would simply be:

$$
\frac{\partial \mathcal{L}}{\partial z_j} = p_j
$$

So each incorrect-class gradient is shifted downward by $\epsilon/K$.

This creates two regimes:

If $p_j > \epsilon/K$, the gradient remains positive, so gradient descent pushes that incorrect logit downward.
If $p_j < \epsilon/K$, the gradient becomes negative, so gradient descent nudges that incorrect logit upward slightly.

That second regime is the subtle but important one. Label smoothing does not train the model to annihilate all non-target probabilities. More precisely, it does not impose a hard floor on predicted probabilities at inference time; it changes the training target so that pushing every alternative all the way toward zero is no longer the preferred solution.

5.5. Side-by-side summary

For the correct class:

one-hot: $p_c – 1$
smoothed: $p_c – \left(1 – \epsilon + \epsilon/K\right)$

For an incorrect class:

one-hot: $p_j$
smoothed: $p_j – \epsilon/K$

One-hot training says, “make the winner absolute.” Label smoothing says, “make the winner clear, but not infinitely sharp.”

5.6. A Tiny Numerical Example

Suppose:

$K = 5$
the correct class is $c = 1$ (using 0-based indexing)
$\epsilon = 0.1$
the predicted probabilities are

$$
\mathbf{p} = [0.05, 0.80, 0.05, 0.06, 0.04]
$$

Then the smoothed target is:

$$
\mathbf{y}^{\text{smooth}} = [0.02, 0.92, 0.02, 0.02, 0.02]
$$

So the gradient is:

$$
\nabla_{\mathbf{z}} \mathcal{L} = \mathbf{p} – \mathbf{y}^{\text{smooth}} = [0.03, -0.12, 0.03, 0.04, 0.02]
$$

Without label smoothing, the one-hot target would be:

$$
\mathbf{y} = [0, 1, 0, 0, 0]
$$

and the gradient would be:

$$
\mathbf{p} – \mathbf{y} = [0.05, -0.20, 0.05, 0.06, 0.04]
$$

The difference is easy to read:

the correct logit receives a weaker upward push
the incorrect logits receive weaker downward pushes
the entire update is softer and less extreme

That is the optimization effect of label smoothing in one line.

6. Why It Often Improves Generalization and Calibration

Label smoothing does not magically make a model smarter. What it often does is prevent the model from learning an unnecessarily brittle version of the task.

6.1. Better calibrated probabilities

Many modern classifiers are accurate but poorly calibrated. They may predict 0.999 even when they should be less certain.

Label smoothing often improves calibration because it trains the model against a less extreme target distribution.

6.2. Less brittle internal representations

When a model is pushed toward absolute certainty on the training set, it may learn features that are sharper than necessary. Those representations can be sensitive to nuisance variation and small perturbations.

Label smoothing can encourage more compact class clusters and more moderate margins in representation space.

label-smoothing-tight-cluster-representation

The above image shows the clustering of penultimate layer representations in a 2D projection (check this paper for details). The main intuition is that one-hot training can produce broader or less organized class regions, while label smoothing can encourage tighter within-class clustering and clearer separation across categories. That is one plausible route to better generalization.

6.3. Reduced pressure to memorize noisy labels

If a portion of the training set is mislabeled, one-hot cross-entropy can fit those examples very aggressively. Label smoothing lowers that pressure because even the target distribution itself admits some uncertainty.

This is not a full solution to label noise, but it can be a useful buffer.

6.4. An important caveat

Label smoothing often improves calibration, but not always. In some settings it can make the model under-confident, especially when combined with other strong regularizers or heavy target-softening techniques such as mixup. The correct mental model is not “label smoothing is always better,” but “label smoothing shifts the confidence profile, often in a helpful direction.”

7. When to Use It and When to Avoid It

Label smoothing is usually a good candidate when:

the task is single-label multi-class classification
the model is visibly overconfident
the dataset is large enough that regularization is helpful
probability calibration matters alongside accuracy
the labels are somewhat noisy or ambiguously defined

Be careful or skip it when:

the task is multi-label classification with independent sigmoids
the target distribution is already soft and meaningful
the task requires preserving very sharp probabilities
the model is already under-confident
several other techniques are already softening the targets or outputs
a large $\epsilon$ would blur distinctions that the task genuinely needs

The guiding question is simple: is the target distribution supposed to encode certainty, or only a supervised preference? Label smoothing helps more in the second case than in the first.

8. How to Choose the Smoothing Value

The most common starting values are:

$\epsilon = 0.05$
$\epsilon = 0.1$

Sometimes larger values such as $0.2$ appear in very large-scale or noisy settings, but they are not safe defaults.

The trade-off is straightforward:

too small, and the effect may be negligible
too large, and the model may become under-confident or lose accuracy

In practice, $0.1$ is a strong default for a first experiment, but it should still be validated rather than assumed.

One useful diagnostic is to monitor both accuracy and calibration. If top-1 accuracy changes little while negative log-likelihood or expected calibration error improves, the smoothing strength may be doing exactly what it should.

9. Implementation

In modern libraries, the built-in implementation is usually the best choice because it is stable, simple, and easy to audit.

9.1. Built-in loss

Python

import torch
import torch.nn as nn

batch_size = 8
num_classes = 5

logits = torch.randn(batch_size, num_classes, requires_grad=True)
targets = torch.randint(0, num_classes, size=(batch_size,))

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
loss = criterion(logits, targets)
loss.backward()

print("loss:", float(loss))
print("gradient shape:", logits.grad.shape)
# The exact loss value will differ from run to run unless you set a random seed.
# The gradient shape will be: torch.Size([8, 5])

import torch
import torch.nn as nn

batch_size = 8
num_classes = 5

logits = torch.randn(batch_size, num_classes, requires_grad=True)
targets = torch.randint(0, num_classes, size=(batch_size,))

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
loss = criterion(logits, targets)
loss.backward()

print("loss:", float(loss))
print("gradient shape:", logits.grad.shape)
# The exact loss value will differ from run to run unless you set a random seed.
# The gradient shape will be: torch.Size([8, 5])

Two practical reminders:

pass raw logits, not softmax probabilities
evaluate accuracy against the original hard labels

9.2. Implementing label smoothing from scratch

Writing the loss manually is useful for learning, debugging, or custom research code.

Python

import torch
import torch.nn.functional as F

def cross_entropy_with_label_smoothing(logits, targets, epsilon=0.1):
    """
    logits:  [batch_size, num_classes]
    targets: [batch_size] with integer class ids
    """
    num_classes = logits.size(1)
    log_probs = F.log_softmax(logits, dim=1)

    with torch.no_grad():
        true_dist = torch.full_like(log_probs, epsilon / num_classes)
        true_dist.scatter_(
            1,
            targets.unsqueeze(1),
            1.0 - epsilon + epsilon / num_classes,
        )

    loss = -(true_dist * log_probs).sum(dim=1).mean()
    return loss


batch_size = 4
num_classes = 3

logits = torch.tensor(
    [
        [2.0, 0.5, -1.0],
        [0.1, 1.8, 0.2],
        [1.2, 0.7, 0.3],
        [0.0, -0.2, 2.4],
    ],
    requires_grad=True,
)

targets = torch.tensor([0, 1, 0, 2])

loss = cross_entropy_with_label_smoothing(logits, targets, epsilon=0.1)
loss.backward()

print("loss:", float(loss))
print("gradients:\n", logits.grad)
# Expected output:
#     loss: 0.47310787439346313
#     gradients:
#     tensor([[-0.0369,  0.0355,  0.0014],
#             [ 0.0247, -0.0528,  0.0281],
#             [-0.1091,  0.0670,  0.0422],
#             [ 0.0111,  0.0076, -0.0187]])

import torch
import torch.nn.functional as F

def cross_entropy_with_label_smoothing(logits, targets, epsilon=0.1):
    """
    logits:  [batch_size, num_classes]
    targets: [batch_size] with integer class ids
    """
    num_classes = logits.size(1)
    log_probs = F.log_softmax(logits, dim=1)

    with torch.no_grad():
        true_dist = torch.full_like(log_probs, epsilon / num_classes)
        true_dist.scatter_(
            1,
            targets.unsqueeze(1),
            1.0 - epsilon + epsilon / num_classes,
        )

    loss = -(true_dist * log_probs).sum(dim=1).mean()
    return loss


batch_size = 4
num_classes = 3

logits = torch.tensor(
    [
        [2.0, 0.5, -1.0],
        [0.1, 1.8, 0.2],
        [1.2, 0.7, 0.3],
        [0.0, -0.2, 2.4],
    ],
    requires_grad=True,
)

targets = torch.tensor([0, 1, 0, 2])

loss = cross_entropy_with_label_smoothing(logits, targets, epsilon=0.1)
loss.backward()

print("loss:", float(loss))
print("gradients:\n", logits.grad)
# Expected output:
#     loss: 0.47310787439346313
#     gradients:
#     tensor([[-0.0369,  0.0355,  0.0014],
#             [ 0.0247, -0.0528,  0.0281],
#             [-0.1091,  0.0670,  0.0422],
#             [ 0.0111,  0.0076, -0.0187]])

The manual implementation is especially helpful because you can inspect true_dist directly and verify that the targets are being constructed exactly as intended.

10. Practical Workflow and Common Mistakes

The easiest way to use label smoothing well is to treat it like any other regularizer: change one thing, keep the comparison clean, and measure what actually improved.

10.1. A simple workflow

Train a baseline with ordinary cross-entropy.
Add label smoothing with $\epsilon = 0.05$ or $0.1$.
Keep the optimizer, learning-rate schedule, augmentation, and architecture fixed.
Compare validation accuracy, negative log-likelihood, and calibration.
Tune $\epsilon$ only if the initial comparison justifies it.

10.2. What to monitor

Beyond top-1 accuracy, inspect:

validation loss
negative log-likelihood
expected calibration error
reliability diagrams
confidence histograms

If accuracy stays similar while calibration improves, label smoothing may still be a net win.

10.3. Common mistakes

using too large an $\epsilon$
applying it to the wrong task type
passing probabilities instead of logits into cross-entropy
comparing smoothed and unsmoothed training losses without context
forgetting that different papers and libraries use different smoothing conventions
applying smoothing on top of already soft targets without a clear reason

10.4. Interactions with other techniques

Label smoothing often works well with:

weight decay
dropout
data augmentation

It can also be combined with mixup or cutmix, but that combination needs extra care because those methods already soften the supervision signal.

But regularizers accumulate. If several components already soften the training signal, extra smoothing can become redundant or make the model too cautious.

10.5. Sequence models need an extra check

In NLP or speech pipelines, some positions correspond to padding or ignored tokens. Smoothing should be applied only to valid targets, and ignored positions should remain excluded from the loss. Otherwise the training signal becomes subtly wrong.

11. Relationship to Nearby Ideas

Label smoothing is close to several other techniques, but it is not interchangeable with them.

11.1. Confidence penalty

Instead of smoothing labels, a confidence penalty adds an entropy-based regularizer to the loss:

$$
\mathcal{L} = CE – \lambda H(\mathbf{p})
$$

where $H(\mathbf{p})$ is the entropy of the predicted distribution.

Conceptually:

label smoothing changes the target distribution
confidence penalty changes the objective directly by discouraging low-entropy predictions

Both aim to reduce excessive certainty, but they do so from different angles.

11.2. Knowledge distillation

Distillation uses a teacher model to provide example-specific soft targets.

label-smoothing-vs-distillation — *Image source: Distillation vs. Label Smoothing*

Label smoothing, by contrast, usually uses a fixed prior-like target adjustment, often uniform across classes. It does not encode the teacher’s example-specific beliefs about which alternatives are plausible.

11.3. Mixup

Mixup creates soft labels by interpolating two training examples and their labels. In that case, the softness comes from the data construction itself.

Label smoothing changes only the targets. The input remains unchanged.

11.4. Why not just use label smoothing for a teacher during knowledge distillation?

A teacher trained with label smoothing can achieve better standalone accuracy and still transfer less useful information to a student than a teacher trained on hard targets.

The intuition is mathematically profound. Label smoothing encourages the teacher to preserve the main class distinction, but forces the predicted probabilities of all incorrect classes to converge toward the exact same value ($\epsilon/K$). This effectively artificially flattens the distribution and destroys the variance between incorrect classes.

In standard one-hot training, a network might assign slightly higher probability to “dog” than “airplane” when shown a “cat”, because dogs are visually and semantically closer. This is often called “dark knowledge.” Label smoothing penalizes this natural similarity, aggressively pulling both “dog” and “airplane” predictions to $\epsilon/K$. Because it actively erases these fine-grained relative similarities, a teacher trained with label smoothing becomes a less informative instructor for distillation, even if its own generalization improves.

label-smoothing-distillation-experimental-results

As shown in the above figure from this paper, a teacher trained with hard targets can transfer more knowledge to a student than a teacher trained with label smoothing, even if the smoothed teacher has better standalone accuracy. The student learns more from the richer, less smoothed output distribution of the hard-target teacher, where the natural “dark knowledge” remains intact.

12. Variants of Label Smoothing

Standard uniform label smoothing is the most common version, but it is not the only one.

Class-dependent smoothing:
Not every wrong class is equally plausible. In some domains, domain knowledge or a confusion matrix can justify assigning more mass to classes that are semantically close to the target.

For example, “cat” may reasonably share more smoothing mass with “dog” than with “airplane.”

Adaptive smoothing:
Instead of using a fixed $\epsilon$, adaptive methods vary the smoothing strength based on the training step, example difficulty, or model confidence.

These methods can be more flexible, but they are also harder to tune and explain.

Soft targets from other sources:
If the target distribution already comes from a teacher model, weak supervision, or annotation aggregation, standard label smoothing may not be appropriate. In that case, you already have a soft target, and additional smoothing can distort useful information.

13. Summary

Label smoothing is a small change with a large conceptual payoff. It replaces a one-hot target with a slightly softened target, so the model is asked to be correct without becoming maximally certain.

The core equation is:

$$
y_k^{\text{smooth}} = (1 – \epsilon) y_k + \frac{\epsilon}{K}
$$

and the key gradient result is:

$$
\boxed{\frac{\partial \mathcal{L}}{\partial z_j} = p_j – y_j^{\text{smooth}}}
$$

That single gradient formula explains the main behavior:

the correct class is pushed upward less aggressively
the incorrect classes are pushed downward less aggressively
the optimizer is discouraged from creating unnecessarily sharp distributions

That is why label smoothing often improves calibration and can improve generalization.

If you remember only one sentence, remember this one: label smoothing tells a classifier, “Be confident enough to decide, but not so confident that it stops respecting uncertainty.”

S L Happy

Machine Learning Engineer at HP | Website | + posts

Happy is a seasoned ML professional with over 15 years of experience. His expertise spans various domains, including Computer Vision, Natural Language Processing (NLP), and Time Series analysis. He holds a PhD in Machine Learning from IIT Kharagpur and has furthered his research with postdoctoral experience at INRIA-Sophia Antipolis, France. Happy has a proven track record of delivering impactful ML solutions to clients.

Subscribe to our newsletter!

1. What Label Smoothing Is

1.1. A visual intuition

1.2. Where label smoothing is commonly used

2. Why This Small Change Matters

2.1. It reduces overconfidence

2.2. It acts as target-side regularization

2.3. It better matches real datasets

2.4. It changes the geometry of optimization

3. The Mathematical Formulation

3.1. What the target values become

3.2. Alternative conventions

4. How Label Smoothing Changes Cross-Entropy

4.1. Information-theoretic interpretation

4.2. Finite Optimal Logits

5. Deriving the Gradient with Respect to the Logits

5.1. Setup

5.2. The key softmax identity

5.3. Differentiate the loss

5.4. Correct class versus incorrect classes

5.5. Side-by-side summary

5.6. A Tiny Numerical Example

6. Why It Often Improves Generalization and Calibration

6.1. Better calibrated probabilities

6.2. Less brittle internal representations

6.3. Reduced pressure to memorize noisy labels

6.4. An important caveat

7. When to Use It and When to Avoid It

8. How to Choose the Smoothing Value

9. Implementation

9.1. Built-in loss

9.2. Implementing label smoothing from scratch

10. Practical Workflow and Common Mistakes

10.1. A simple workflow

10.2. What to monitor

10.3. Common mistakes

10.4. Interactions with other techniques

10.5. Sequence models need an extra check

11. Relationship to Nearby Ideas

11.1. Confidence penalty

11.2. Knowledge distillation

11.3. Mixup

11.4. Why not just use label smoothing for a teacher during knowledge distillation?

12. Variants of Label Smoothing

13. Summary

S L Happy

Related Posts