Scaling and Encoding

Feature Scaling

Process of normalizing or standardizing features to a common scale.
Since most of algorithm uses distance, scaling is important to prevent feature dominance.
In table below, salary dominates algorithm if it's not scaled.
Types: MinMaxScaler, StandardScaler, RobustScaler, MaxAbsScaler

Unscaled Features Example:

Age	Experience	Salary
23	3	3000
20	1	3500
30	7	5000

Unscaled features can lead to dominance of one feature over others in distance-based algorithms.

MinMaxScaler

Scales features to a specified range, typically [0, 1].
Prevents original distribution distortion
Sensitive to outliers.
Formula: X_scaled = (X - X_min) / (X_max - X_min)

MinMax Scaled Features Example:

Age	Experience	Salary	Age_Scaled	Experience_Scaled	Salary_Scaled
23	3	3000	`(23-20)/(30-20)` = 0.3	`(3-1)/(7-1)` = 0.33	`(3000-3000)/(5000-3000)`= 0
20	1	3500	`(20-20)/(30-20)` = 0	`(1-1)/(7-1)` = 0	`(3500-3000)/(5000-3000)`= 0.25
30	7	5000	`(30-20)/(30-20)` = 1	`(7-1)/(7-1)` = 1	`(5000-3000)/(5000-3000)`= 1

MinMaxScaler transforms features to a common scale but can be affected by outliers.

StandardScaler / Z - score Scaling

Scales features to have zero mean and unit variance.
Less sensitive to outliers compared to MinMaxScaler.
Widely used in ML.
Data below mean will be -ve and above mean will be +ve.
Formula: X_scaled = (X - μ) / σ

Standard Scaled Features Example:

Age	Experience	Salary	Age_Scaled	Experience_Scaled	Salary_Scaled
23	3	3000	`(23-24.33)/4.16` = -0.32	`(3-3.66)/2.31` = -0.29	-0.80
20	1	3500	`(20-24.33)/4.16` = -1.04	`(1-3.66)/2.31` = -1.15	-0.32
30	7	5000	`(30-24.33)/4.16` = 1.36	`(7-3.66)/2.31` = 1.44	1.12

StandardScaler transforms features to have zero mean and unit variance, making it less sensitive to outliers than MinMaxScaler.

RobustScaler

Scales features using statistics that are robust to outliers (median and IQR).
Less sensitive to outliers than both MinMaxScaler and StandardScaler.
Useful when data contains many outliers.
Formula: X_scaled = (X - median) / IQR

Robust Scaled Features Example:

Age	Experience	Salary	Age_Scaled	Experience_Scaled	Salary_Scaled
23	3	3000	`(23-23)/7` = 0	`(3-3)/4` = 0	`(3000-3000)/2000`= 0
20	1	3500	`(20-23)/7` = -0.43	`(1-3)/4` = -0.5	`(3500-3000)/2000`= 0.25
30	7	5000	`(30-23)/7` = 1	`(7-3)/4` = 1	`(5000-3000)/2000`= 1

RobustScaler uses median and IQR to scale features, making it effective for datasets with many outliers.

MaxAbsScaler

Scales features by their maximum absolute value, preserving sparsity.
Useful for data that is already centered at zero and sparse.
Scales data to the range [-1, 1].
Formula: X_scaled = X / max(|X|)

MaxAbs Scaled Features Example:

Age	Experience	Salary	Age_Scaled	Experience_Scaled	Salary_Scaled
23	3	3000	`23/30` = 0.77	`3/7` = 0.43	`3000/5000`= 0.6
20	1	3500	`20/30` = 0.67	`1/7` = 0.14	`3500/5000`= 0.7
30	7	5000	`30/30` = 1	`7/7` = 1	`5000/5000`= 1

MaxAbsScaler scales features by their maximum absolute value, preserving sparsity and scaling data to the range [-1, 1].

Scaler Comparision

Scaler Comparison:

Scaling Method	Range	Outliers	Best For
Min-Max	0 to 1	No	Image data
Standard	Mean 0, Std 1	No	Most ML models
Robust	No fixed	Yes	Outlier-heavy data
MaxAbs	-1 to 1	No	Sparse data

Comparison of different scaling methods based on their properties and use cases.

Encoding

Scikit learn doesn't work with categorical features.
They have to be encoded into numbers before training model.

Common Encoding Techniques and their usages

Name	Best For
Ordinal Encoding	Ordinal categorical features with clear order (e.g., low, medium, high)
Label Encoding	Nominal categorical features without inherent order
One-Hot Encoding	Nominal categorical features without inherent order
Binary Encoding	High-cardinality nominal categorical features
Frequency Encoding	Categorical features with varying frequencies
Target Mean Encoding	Categorical features where the target variable is known

Common encoding techniques and their best use cases for handling categorical features in machine learning.

Oridnal Encoding

Assigns integer values to categories based on their order.
Not suitable for nominal features as it implies a relationship between categories.

Ordinal Encoding Example:

size	brand	warrenty	color	price	size_encoded
small	addidas	1	red	100	1
large	nike	2	blue	150	3
medium	nike	2	red	200	2
medium	addidas	1.5	blue	250	2
small	addidas	1	green	300	1

Ordinal encoding for size feature.

Label Encoding

Assigns a unique integer to each category.
Assigns number alphabetically and can mislead relationship.

Label Encoding Example:

size	brand	warrenty	color	price	size_encoded
small	addidas	1	red	100	3
large	nike	2	blue	150	1
medium	nike	2	red	200	2
medium	addidas	1.5	blue	250	2
small	addidas	1	green	300	3

Label encoding for size feature.

One-Hot Encoding

Creates binary columns for each category.
Prevents misinterpretation of relationships between categories.
Can lead to high dimensionality with many categories.

One-Hot Encoding Example:

size	brand	warrenty	color	price	brand_addidas	brand_nike
small	addidas	1	red	100	1	0
large	nike	2	blue	150	0	1
medium	nike	2	red	200	0	1
medium	addidas	1.5	blue	250	1	0
small	addidas	1	green	300	1	0

One-hot encoding for brand feature.

Binary Encoding

Converts categories to binary digits and creates new columns for each digit.
Reduces dimensionality compared to one-hot encoding for high-cardinality features.
Can be less interpretable than one-hot encoding.

Binary Encoding Example:

size	brand	warrenty	color	price	color_1	color_2
small	addidas	1	red	100	0	1
large	nike	2	blue	150	1	0
medium	nike	2	red	200	0	1
medium	addidas	1.5	blue	250	1	0
small	addidas	1	green	300	1	1

Binary encoding for color feature.

Frequency Encoding

Replaces categories with their frequency in the dataset.
Can capture the importance of categories based on their occurrence.
May not be suitable for all types of categorical features.

Frequency Encoding Example:

size	brand	warrenty	color	price	color_freq
small	addidas	1	red	100	2
large	nike	2	blue	150	2
medium	nike	2	red	200	2
medium	addidas	1.5	blue	250	2
small	addidas	1	green	300	1

Frequency encoding for color feature.

Target Mean Encoding

Replaces categories with the mean of the target variable for that category.
Can capture relationship between categorical feature and target variable.

Target Mean Encoding Example:

size	brand	warrenty	color	price	color_target_mean
small	addidas	1	red	100	150
large	nike	2	blue	150	200
medium	nike	2	red	200	150
medium	addidas	1.5	blue	250	200
small	addidas	1	green	300	300

Target mean encoding for color feature.