Scaling and Encoding

Feature Scaling

  • Process of normalizing or standardizing features to a common scale.
  • Since most of algorithm uses distance, scaling is important to prevent feature dominance.
  • In table below, salary dominates algorithm if it's not scaled.
  • Types: MinMaxScaler, StandardScaler, RobustScaler, MaxAbsScaler
Unscaled Features Example:
AgeExperienceSalary
2333000
2013500
3075000
Unscaled features can lead to dominance of one feature over others in distance-based algorithms.

MinMaxScaler

  • Scales features to a specified range, typically [0, 1].
  • Prevents original distribution distortion
  • Sensitive to outliers.
  • Formula: X_scaled = (X - X_min) / (X_max - X_min)
MinMax Scaled Features Example:
AgeExperienceSalaryAge_ScaledExperience_ScaledSalary_Scaled
2333000(23-20)/(30-20) = 0.3(3-1)/(7-1) = 0.33(3000-3000)/(5000-3000)= 0
2013500(20-20)/(30-20) = 0(1-1)/(7-1) = 0(3500-3000)/(5000-3000)= 0.25
3075000(30-20)/(30-20) = 1(7-1)/(7-1) = 1(5000-3000)/(5000-3000)= 1
MinMaxScaler transforms features to a common scale but can be affected by outliers.

StandardScaler / Z - score Scaling

  • Scales features to have zero mean and unit variance.
  • Less sensitive to outliers compared to MinMaxScaler.
  • Widely used in ML.
  • Data below mean will be -ve and above mean will be +ve.
  • Formula: X_scaled = (X - μ) / σ
Standard Scaled Features Example:
AgeExperienceSalaryAge_ScaledExperience_ScaledSalary_Scaled
2333000(23-24.33)/4.16 = -0.32(3-3.66)/2.31 = -0.29-0.80
2013500(20-24.33)/4.16 = -1.04(1-3.66)/2.31 = -1.15-0.32
3075000(30-24.33)/4.16 = 1.36(7-3.66)/2.31 = 1.441.12
StandardScaler transforms features to have zero mean and unit variance, making it less sensitive to outliers than MinMaxScaler.

RobustScaler

  • Scales features using statistics that are robust to outliers (median and IQR).
  • Less sensitive to outliers than both MinMaxScaler and StandardScaler.
  • Useful when data contains many outliers.
  • Formula: X_scaled = (X - median) / IQR
Robust Scaled Features Example:
AgeExperienceSalaryAge_ScaledExperience_ScaledSalary_Scaled
2333000(23-23)/7 = 0(3-3)/4 = 0(3000-3000)/2000= 0
2013500(20-23)/7 = -0.43(1-3)/4 = -0.5(3500-3000)/2000= 0.25
3075000(30-23)/7 = 1(7-3)/4 = 1(5000-3000)/2000= 1
RobustScaler uses median and IQR to scale features, making it effective for datasets with many outliers.

MaxAbsScaler

  • Scales features by their maximum absolute value, preserving sparsity.
  • Useful for data that is already centered at zero and sparse.
  • Scales data to the range [-1, 1].
  • Formula: X_scaled = X / max(|X|)
MaxAbs Scaled Features Example:
AgeExperienceSalaryAge_ScaledExperience_ScaledSalary_Scaled
233300023/30 = 0.773/7 = 0.433000/5000= 0.6
201350020/30 = 0.671/7 = 0.143500/5000= 0.7
307500030/30 = 17/7 = 15000/5000= 1
MaxAbsScaler scales features by their maximum absolute value, preserving sparsity and scaling data to the range [-1, 1].

Scaler Comparision

Scaler Comparison:
Scaling MethodRangeOutliersBest For
Min-Max0 to 1NoImage data
StandardMean 0, Std 1NoMost ML models
RobustNo fixedYesOutlier-heavy data
MaxAbs-1 to 1NoSparse data
Comparison of different scaling methods based on their properties and use cases.

Encoding

  • Scikit learn doesn't work with categorical features.
  • They have to be encoded into numbers before training model.
Common Encoding Techniques and their usages
NameBest For
Ordinal EncodingOrdinal categorical features with clear order (e.g., low, medium, high)
Label EncodingNominal categorical features without inherent order
One-Hot EncodingNominal categorical features without inherent order
Binary EncodingHigh-cardinality nominal categorical features
Frequency EncodingCategorical features with varying frequencies
Target Mean EncodingCategorical features where the target variable is known
Common encoding techniques and their best use cases for handling categorical features in machine learning.

Oridnal Encoding

  • Assigns integer values to categories based on their order.
  • Not suitable for nominal features as it implies a relationship between categories.
Ordinal Encoding Example:
sizebrandwarrentycolorpricesize_encoded
smalladdidas1red1001
largenike2blue1503
mediumnike2red2002
mediumaddidas1.5blue2502
smalladdidas1green3001
Ordinal encoding for size feature.

Label Encoding

  • Assigns a unique integer to each category.
  • Assigns number alphabetically and can mislead relationship.
Label Encoding Example:
sizebrandwarrentycolorpricesize_encoded
smalladdidas1red1003
largenike2blue1501
mediumnike2red2002
mediumaddidas1.5blue2502
smalladdidas1green3003
Label encoding for size feature.

One-Hot Encoding

  • Creates binary columns for each category.
  • Prevents misinterpretation of relationships between categories.
  • Can lead to high dimensionality with many categories.
One-Hot Encoding Example:
sizebrandwarrentycolorpricebrand_addidasbrand_nike
smalladdidas1red10010
largenike2blue15001
mediumnike2red20001
mediumaddidas1.5blue25010
smalladdidas1green30010
One-hot encoding for brand feature.

Binary Encoding

  • Converts categories to binary digits and creates new columns for each digit.
  • Reduces dimensionality compared to one-hot encoding for high-cardinality features.
  • Can be less interpretable than one-hot encoding.
Binary Encoding Example:
sizebrandwarrentycolorpricecolor_1color_2
smalladdidas1red10001
largenike2blue15010
mediumnike2red20001
mediumaddidas1.5blue25010
smalladdidas1green30011
Binary encoding for color feature.

Frequency Encoding

  • Replaces categories with their frequency in the dataset.
  • Can capture the importance of categories based on their occurrence.
  • May not be suitable for all types of categorical features.
Frequency Encoding Example:
sizebrandwarrentycolorpricecolor_freq
smalladdidas1red1002
largenike2blue1502
mediumnike2red2002
mediumaddidas1.5blue2502
smalladdidas1green3001
Frequency encoding for color feature.

Target Mean Encoding

  • Replaces categories with the mean of the target variable for that category.
  • Can capture relationship between categorical feature and target variable.
Target Mean Encoding Example:
sizebrandwarrentycolorpricecolor_target_mean
smalladdidas1red100150
largenike2blue150200
mediumnike2red200150
mediumaddidas1.5blue250200
smalladdidas1green300300
Target mean encoding for color feature.