Model Evaluation Techniques
✕Evaluating ML model
- Critical to evaluate how model performs on unseen data.
- Appropriate metrics to be used based on problem type & business goals.
- Split data into training and testing sets.
- Train the model on the training set.
- Evaluate the model on the testing set.
- Select appropriate metrics based on problem type & business goals.
- Compare results across different models.
Evaluation Steps:
Regression Metrics
Common Regression Metrics:
| Metric | Description | Formulae |
|---|---|---|
| MAE: Mean Absolute Error | Average of absolute error. | Σ(|ya - yp|)/n |
| MSE: Mean Squared Error | Average of squared errors. | Σ(ya - yp)²/n |
| RMSE: Root Mean Squared Error | Square root of MSE. | √MSE |
| R² (R-squared) | Variation in dependent variable that is explained by independent variables. | 1 - (Σ(ya - yp)² / Σ(ya - y_mean)²) |
Common regression metrics used to evaluate model performance.
Calculating Regression Metrics
- MAE =
(100 + 200 + 50 + 100)/4 = 112.5 - Median Absolute Error =
100 - MSE =
(10000 + 40000 + 2500 + 10000)/4 = 15625
Calculating Regression Metrics Example:
| Years | Company | Position | Salary | Predicted | |err| | err² |
|---|---|---|---|---|---|---|
| 5 | Developer | 1000 | 1100 | 100 | 10000 | |
| 8 | Microsoft | Data Engineer | 1500 | 1300 | 200 | 40000 |
| 1 | Microsoft | Data Engineer | 1100 | 1150 | 50 | 2500 |
| 2 | Developer | 800 | 900 | 100 | 10000 |
Dataset for regression task: Predicting Salary based on features
Confusion Matrix
- Summarizes classification performance with counts of: TP => Model correctly predicts positive. TN => Model correctly predicts negative. FP => Model predicts positive but it was negative (Type-I error). FN => Model predicts negative but it was positive (Type-II error).
- Using it, we can calculate metrics
accuracy,precision,recall,F1-score.
Confusion Matrix Example:
| TP: 50 | FN: 10 |
| FP: 5 | TN: 100 |
Confusion matrix for a binary classification problem.
Classification Metrics
Common Classification Metrics:
| Metric | Description | Formulae |
|---|---|---|
| Accuracy | Proportion of correct predictions. | (TP + TN) / (TP + TN + FP + FN) |
| Precision | Out of predicted positives, how many were correct? | TP / (TP + FP) |
| Recall (Sensitivity) | Out of total positives, how many model catch? | TP / (TP + FN) |
| F1-Score | Harmonic mean of precision and recall. | 2 * prcsn * rcl / (prcsn + rcl) |
Common classification metrics used to evaluate model performance.
Calculating Classification Metrics
Calculating Classification Metrics Example:
| Temperature | Wind | humidity | rain_actual | rain_predicted | predict_type |
|---|---|---|---|---|---|
| Hot | Strong | High | No | No | TN |
| Mild | Weak | High | Yes | Yes | TP |
| Cool | Weak | Normal | Yes | No | FN |
| Mild | Strong | High | No | Yes | FP |
| Cool | Strong | Normal | Yes | Yes | TP |
| Hot | Weak | High | No | No | TN |
Dataset for classification task: Predicting Rain based on weather features
Confusion Matrix for Classification Example:
| TP: 2 | FN: 1 |
| FP: 1 | TN: 2 |
Confusion matrix derived from the classification dataset.
Calculated Classification Metrics:
| Metric | Value |
|---|---|
| Accuracy | (2 + 2) / (2 + 2 + 1 + 1) = 0.67 |
| Precision | 2 / (2 + 1) = 0.67 |
| Recall (Sensitivity) | 2 / (2 + 1) = 0.67 |
| F1-Score | 2 * 0.67 * 0.67 / (0.67 + 0.67) = 0.67 |
Calculated classification metrics based on the confusion matrix.
How to Select Evaluation Metric
- Accuracy can be misleading for imbalanced datasets.
- Example: On blood test dataset with 99% healthy and 1% sick, a model that predicts all healthy would have 99% accuracy but it's useless.
- Choice of metric should align with business goals and problem context.
- Medical Dignostics:
FNcritical. i.e Don't miss sick patients. [Recall] - Spam Detection:
FPcritical. i.e Don't block genuine emails. [Precision] - Fraud Detection: Both
FPandFNare costly. [F1-Score] - General Classification: Balanced importance. [Accuracy]
- Terrorist Detection:
FNcritical. i.e Don't miss potential threats. [Recall] - Same task can have different optimal metrics based on domain context.
Example on Metric Selection:
