Scaling and Encoding
✕Feature Scaling
- Process of normalizing or standardizing features to a common scale.
- Since most of algorithm uses distance, scaling is important to prevent feature dominance.
- In table below, salary dominates algorithm if it's not scaled.
- Types:
MinMaxScaler,StandardScaler,RobustScaler,MaxAbsScaler
Unscaled Features Example:
| Age | Experience | Salary |
|---|---|---|
| 23 | 3 | 3000 |
| 20 | 1 | 3500 |
| 30 | 7 | 5000 |
Unscaled features can lead to dominance of one feature over others in distance-based algorithms.
MinMaxScaler
- Scales features to a specified range, typically [0, 1].
- Prevents original distribution distortion
- Sensitive to outliers.
- Formula:
X_scaled = (X - X_min) / (X_max - X_min)
MinMax Scaled Features Example:
| Age | Experience | Salary | Age_Scaled | Experience_Scaled | Salary_Scaled |
|---|---|---|---|---|---|
| 23 | 3 | 3000 | (23-20)/(30-20) = 0.3 | (3-1)/(7-1) = 0.33 | (3000-3000)/(5000-3000)= 0 |
| 20 | 1 | 3500 | (20-20)/(30-20) = 0 | (1-1)/(7-1) = 0 | (3500-3000)/(5000-3000)= 0.25 |
| 30 | 7 | 5000 | (30-20)/(30-20) = 1 | (7-1)/(7-1) = 1 | (5000-3000)/(5000-3000)= 1 |
MinMaxScaler transforms features to a common scale but can be affected by outliers.
StandardScaler / Z - score Scaling
- Scales features to have zero mean and unit variance.
- Less sensitive to outliers compared to MinMaxScaler.
- Widely used in ML.
- Data below mean will be -ve and above mean will be +ve.
- Formula:
X_scaled = (X - μ) / σ
Standard Scaled Features Example:
| Age | Experience | Salary | Age_Scaled | Experience_Scaled | Salary_Scaled |
|---|---|---|---|---|---|
| 23 | 3 | 3000 | (23-24.33)/4.16 = -0.32 | (3-3.66)/2.31 = -0.29 | -0.80 |
| 20 | 1 | 3500 | (20-24.33)/4.16 = -1.04 | (1-3.66)/2.31 = -1.15 | -0.32 |
| 30 | 7 | 5000 | (30-24.33)/4.16 = 1.36 | (7-3.66)/2.31 = 1.44 | 1.12 |
StandardScaler transforms features to have zero mean and unit variance, making it less sensitive to outliers than MinMaxScaler.
RobustScaler
- Scales features using statistics that are robust to outliers (median and IQR).
- Less sensitive to outliers than both MinMaxScaler and StandardScaler.
- Useful when data contains many outliers.
- Formula:
X_scaled = (X - median) / IQR
Robust Scaled Features Example:
| Age | Experience | Salary | Age_Scaled | Experience_Scaled | Salary_Scaled |
|---|---|---|---|---|---|
| 23 | 3 | 3000 | (23-23)/7 = 0 | (3-3)/4 = 0 | (3000-3000)/2000= 0 |
| 20 | 1 | 3500 | (20-23)/7 = -0.43 | (1-3)/4 = -0.5 | (3500-3000)/2000= 0.25 |
| 30 | 7 | 5000 | (30-23)/7 = 1 | (7-3)/4 = 1 | (5000-3000)/2000= 1 |
RobustScaler uses median and IQR to scale features, making it effective for datasets with many outliers.
MaxAbsScaler
- Scales features by their maximum absolute value, preserving sparsity.
- Useful for data that is already centered at zero and sparse.
- Scales data to the range [-1, 1].
- Formula:
X_scaled = X / max(|X|)
MaxAbs Scaled Features Example:
| Age | Experience | Salary | Age_Scaled | Experience_Scaled | Salary_Scaled |
|---|---|---|---|---|---|
| 23 | 3 | 3000 | 23/30 = 0.77 | 3/7 = 0.43 | 3000/5000= 0.6 |
| 20 | 1 | 3500 | 20/30 = 0.67 | 1/7 = 0.14 | 3500/5000= 0.7 |
| 30 | 7 | 5000 | 30/30 = 1 | 7/7 = 1 | 5000/5000= 1 |
MaxAbsScaler scales features by their maximum absolute value, preserving sparsity and scaling data to the range [-1, 1].
Scaler Comparision
Scaler Comparison:
| Scaling Method | Range | Outliers | Best For |
|---|---|---|---|
| Min-Max | 0 to 1 | No | Image data |
| Standard | Mean 0, Std 1 | No | Most ML models |
| Robust | No fixed | Yes | Outlier-heavy data |
| MaxAbs | -1 to 1 | No | Sparse data |
Comparison of different scaling methods based on their properties and use cases.
Encoding
- Scikit learn doesn't work with categorical features.
- They have to be encoded into numbers before training model.
Common Encoding Techniques and their usages
| Name | Best For |
|---|---|
| Ordinal Encoding | Ordinal categorical features with clear order (e.g., low, medium, high) |
| Label Encoding | Nominal categorical features without inherent order |
| One-Hot Encoding | Nominal categorical features without inherent order |
| Binary Encoding | High-cardinality nominal categorical features |
| Frequency Encoding | Categorical features with varying frequencies |
| Target Mean Encoding | Categorical features where the target variable is known |
Common encoding techniques and their best use cases for handling categorical features in machine learning.
Oridnal Encoding
- Assigns integer values to categories based on their order.
- Not suitable for nominal features as it implies a relationship between categories.
Ordinal Encoding Example:
| size | brand | warrenty | color | price | size_encoded |
|---|---|---|---|---|---|
| small | addidas | 1 | red | 100 | 1 |
| large | nike | 2 | blue | 150 | 3 |
| medium | nike | 2 | red | 200 | 2 |
| medium | addidas | 1.5 | blue | 250 | 2 |
| small | addidas | 1 | green | 300 | 1 |
Ordinal encoding for size feature.
Label Encoding
- Assigns a unique integer to each category.
- Assigns number alphabetically and can mislead relationship.
Label Encoding Example:
| size | brand | warrenty | color | price | size_encoded |
|---|---|---|---|---|---|
| small | addidas | 1 | red | 100 | 3 |
| large | nike | 2 | blue | 150 | 1 |
| medium | nike | 2 | red | 200 | 2 |
| medium | addidas | 1.5 | blue | 250 | 2 |
| small | addidas | 1 | green | 300 | 3 |
Label encoding for size feature.
One-Hot Encoding
- Creates binary columns for each category.
- Prevents misinterpretation of relationships between categories.
- Can lead to high dimensionality with many categories.
One-Hot Encoding Example:
| size | brand | warrenty | color | price | brand_addidas | brand_nike |
|---|---|---|---|---|---|---|
| small | addidas | 1 | red | 100 | 1 | 0 |
| large | nike | 2 | blue | 150 | 0 | 1 |
| medium | nike | 2 | red | 200 | 0 | 1 |
| medium | addidas | 1.5 | blue | 250 | 1 | 0 |
| small | addidas | 1 | green | 300 | 1 | 0 |
One-hot encoding for brand feature.
Binary Encoding
- Converts categories to binary digits and creates new columns for each digit.
- Reduces dimensionality compared to one-hot encoding for high-cardinality features.
- Can be less interpretable than one-hot encoding.
Binary Encoding Example:
| size | brand | warrenty | color | price | color_1 | color_2 |
|---|---|---|---|---|---|---|
| small | addidas | 1 | red | 100 | 0 | 1 |
| large | nike | 2 | blue | 150 | 1 | 0 |
| medium | nike | 2 | red | 200 | 0 | 1 |
| medium | addidas | 1.5 | blue | 250 | 1 | 0 |
| small | addidas | 1 | green | 300 | 1 | 1 |
Binary encoding for color feature.
Frequency Encoding
- Replaces categories with their frequency in the dataset.
- Can capture the importance of categories based on their occurrence.
- May not be suitable for all types of categorical features.
Frequency Encoding Example:
| size | brand | warrenty | color | price | color_freq |
|---|---|---|---|---|---|
| small | addidas | 1 | red | 100 | 2 |
| large | nike | 2 | blue | 150 | 2 |
| medium | nike | 2 | red | 200 | 2 |
| medium | addidas | 1.5 | blue | 250 | 2 |
| small | addidas | 1 | green | 300 | 1 |
Frequency encoding for color feature.
Target Mean Encoding
- Replaces categories with the mean of the target variable for that category.
- Can capture relationship between categorical feature and target variable.
Target Mean Encoding Example:
| size | brand | warrenty | color | price | color_target_mean |
|---|---|---|---|---|---|
| small | addidas | 1 | red | 100 | 150 |
| large | nike | 2 | blue | 150 | 200 |
| medium | nike | 2 | red | 200 | 150 |
| medium | addidas | 1.5 | blue | 250 | 200 |
| small | addidas | 1 | green | 300 | 300 |
Target mean encoding for color feature.
