Target encoding, also known as mean encoding or impact encoding, is a powerful feature engineering technique used to transform high-cardinality categorical features into numerical representations by leveraging the information contained in the target variable. This method is particularly useful when standard techniques like One-Hot Encoding would create too many sparse features.
What is Target Encoding?
Many algorithms (linear models, SVMs, neural nets) prefer or require numeric inputs and can not operate on raw category strings (like ‘USA’, ‘Canada’, ‘Mexico’). Target encoding replaces each category with a single number capturing its relationship with the target outcome — compressing predictive signal into one scalar.
Target Encoding directly leverages the relationship between the category and the target variable to create a meaningful numerical representation.
For a given categorical feature, you replace each category with the average of the target variable associated with that category.
How Target Encoding Works
The core idea is to replace each category with the mean of the target variable for all observations belonging to that category.
- For Regression: The category is replaced by the average target value for that category.
- For Classification: The category is replaced by the proportion (or probability) of the positive class for that category.
This is considered a supervised technique because the encoding depends directly on the target variable ($y$).
The Danger: Data Leakage and Overfitting
The simple method shown above has a critical flaw, especially when dealing with categories that appear very few times: Data Leakage.
If a category rarely appears in the real world or test set, the model learns an overly strong and specific relationship that doesn’t generalize well. This is overfitting.
To prevent this, robust target encoding techniques introduce regularization or smoothing.
Robust Target Encoding: Introducing Smoothing (The M-Estimate)
To mitigate the impact of low-frequency categories, we blend the category-specific mean with the global target mean (the average target value across the entire dataset). This “pulls” the encoded value of rare categories toward the center, making them less extreme.
The most common smoothing technique is the M-Estimate:
$$\text{Encoded Value} = \frac{n}{n + m} \times (\text{Category Mean}) + \frac{m}{n + m} \times (\text{Global Mean})$$
Where:
- \(n\): The number of times the category appears in the training data.
- \(m\): The smoothing factor (a chosen hyperparameter, often set to a value like 5 or 10). It controls how much weight is given to the global mean. A larger $m$ means more smoothing, pulling the encoding for rare categories closer to the overall average.
- Category Mean: The average target value for that specific category.
- Global Mean: The average target value across the entire training set.
How Smoothing Works:
- High \(n\) (Frequent Category): If \(n\) is large (e.g., 1000) and \(m\) is small (e.g., 10), the fraction \(\frac{n}{n+m}\) is close to 1. The encoded value is dominated by the Category Mean.
- Low \(n\) (Rare Category): If \(n\) is small (e.g., 1) and \(m\) is 10, the fraction \(\frac{n}{n+m}\) is small (\(\frac{1}{11} \approx 0.09\)). The encoded value is mostly determined by the Global Mean, preventing the outlier influence of that single data point.
Proper Implementation Strategy (Avoiding Leakage)
The most critical step in using target encoding is preventing the target information from “leaking” into the feature during training. You must calculate the encoding statistics without using the data point you are currently encoding. [scikit-learn]
This is achieved through a cross-validation approach:
- Split Data: Divide your training data into \(K\) equal folds (e.g., \(K=5\)).
- Iterate Folds (\(k=1\) to \(K\)):
- Training Set: Use all folds except fold \(k\) to calculate the Category Means and the Global Mean.
- Encoding Set: Apply the calculated means (using the M-Estimate formula) to encode the categories only in fold \(k\).
- Combine: After iterating through all \(K\) folds, the resulting encoded column contains values that were never exposed to the target value of the row they represent, effectively preventing leakage.
- Test Set Encoding: Calculate the final Global Mean and Category Means using the entire original training set. Apply these final, stable statistics to encode the test set.
Example: Mixed Categorical + Numerical Features
Predict House Price with:
- Categorical:
City(medium),Neighborhood(high),HouseStyle(low) - Numerical:
SqFt,Age
Raw Sample (Before Encoding)
| Row | City | Neighborhood | HouseStyle | SqFt | Age | Price |
|---|---|---|---|---|---|---|
| 1 | A | N1 | Ranch | 1800 | 12 | 310000 |
| 2 | B | N2 | Colonial | 2500 | 5 | 425000 |
| 3 | A | N3 | Ranch | 1600 | 20 | 295000 |
| 4 | C | N4 | Modern | 3000 | 3 | 760000 |
| 5 | B | N2 | Colonial | 2550 | 6 | 440000 |
| 6 | B | N5 | Ranch | 1900 | 15 | 365000 |
| 7 | A | N1 | Modern | 2100 | 10 | 330000 |
| 8 | D | N6 | Ranch | 1400 | 30 | 220000 |
| 9 | C | N4 | Modern | 3100 | 4 | 770000 |
| 10 | A | N3 | Colonial | 1650 | 19 | 305000 |
Suppose we target-encode Neighborhood (high cardinality) and leave low-cardinality HouseStyle as OHE. Let the global mean (illustrative) be 440k; smoothing factor m=5.
| Neighborhood | Count n | Mean Price | Encoded (M-Estimate) |
|---|---|---|---|
| N1 | 2 | 320k | (2/(2+5))320k + (5/(2+5))440k ≈ 402.9k |
| N2 | 2 | 432.5k | (2/(2+5))432.5k + (5/(2+5))440k ≈ 437.1k |
| N3 | 2 | 300k | (2/(2+5))300k + (5/(2+5))440k ≈ 400.0k |
| N4 | 2 | 765k | (2/(2+5))765k + (5/(2+5))440k ≈ 534.3k |
| N5 | 1 | 365k | (1/(1+5))365k + (5/(1+5))440k ≈ 427.5k |
| N6 | 1 | 220k | (1/(1+5))220k + (5/(1+5))440k ≈ 403.3k |
After Encoding (Neighborhood replaced)
| Row | City | Neighborhood_enc | HouseStyle (OHE example) | SqFt | Age | Price |
|---|---|---|---|---|---|---|
| 1 | A | 402900 | Ranch=1,Colonial=0,Modern=0 | 1800 | 12 | 310000 |
| 2 | B | 437100 | Ranch=0,Colonial=1,Modern=0 | 2500 | 5 | 425000 |
| 3 | A | 400000 | Ranch=1,Colonial=0,Modern=0 | 1600 | 20 | 295000 |
| 4 | C | 534300 | Ranch=0,Colonial=0,Modern=1 | 3000 | 3 | 760000 |
| 5 | B | 437100 | Ranch=0,Colonial=1,Modern=0 | 2550 | 6 | 440000 |
| 6 | B | 427500 | Ranch=1,Colonial=0,Modern=0 | 1900 | 15 | 365000 |
| 7 | A | 402900 | Ranch=0,Colonial=0,Modern=1 | 2100 | 10 | 330000 |
| 8 | D | 403300 | Ranch=1,Colonial=0,Modern=0 | 1400 | 30 | 220000 |
| 9 | C | 534300 | Ranch=0,Colonial=0,Modern=1 | 3100 | 4 | 770000 |
| 10 | A | 400000 | Ranch=0,Colonial=1,Modern=0 | 1650 | 19 | 305000 |
Similarly, you can encode City column using the same procedure.
Target Encoding vs. Other Methods
| Encoding Technique | Good For | Consideration for Random Forest |
|---|---|---|
| One-Hot Encoding (OHE) | Low-to-moderate cardinality, nominal data. | Creates many features; trees are less efficient at splitting on many binary features. |
| Label/Ordinal Encoding | Features with a true rank/order. | Implies a false order for nominal data, which can mislead trees. |
| Target/Mean Encoding | High cardinality, nominal data. | Excellent fit as it captures predictive power in a single dimension. Requires careful regularization to avoid leakage. |
| Frequency Encoding | High cardinality, nominal data. | Encodes categories by their frequency; can help trees but may not capture target relationship. |
| Binary Encoding | High cardinality, nominal data. | Reduces dimensionality compared to OHE; splits categories into binary digits, which trees can handle efficiently. |
| Hash Encoding | Very high cardinality, nominal data. | Maps categories to a fixed number of columns using a hash function; risk of collisions, but useful for scalability. |
| Leave-One-Out Encoding | High cardinality, nominal data. | Similar to target encoding but excludes the current row from the mean calculation; helps reduce leakage. |
| Sum Coding | Nominal data, especially for regression. | Encodes categories as deviations from the overall mean; can help interpret effects but less common for trees. |
| Ordinal Coding | Ordered categorical features. | Assigns integer values based on order; useful for ordinal data, but can mislead trees if order is not meaningful. |
| CatBoost Encoding | High cardinality, nominal data. | Uses ordered statistics and permutations to avoid leakage; highly effective for tree-based models like CatBoost. |
| Gray Encoding | High cardinality, nominal data. | Encodes categories using Gray code (binary sequence where only one bit changes at a time); can help trees find splits efficiently. |
Takeaways
- Target encoding is a powerful technique for handling high-cardinality categorical features by leveraging the target variable.
- Proper implementation with smoothing and leakage prevention is crucial to avoid overfitting.
- Different encoding methods have their own strengths and weaknesses; choose based on data characteristics and model requirements.

