Model Tuning & Pipeline
✕Cross-Validation
- Model performance depends on how data is split into training and testing.
- Cross-validation (CV) is technique to evaluate model on unseen data.
- K-Fold: Split data into K subsets, train on K-1, test on 1. Repeated K times.
- Using more folds can be computationally expensive.
- scoring metric for CV can be choosen based on problem type.
- Example:
from sklearn.model_selection import cross_val_score cv_result = cross_val_score(model, X, y, cv=5, scoring="recall") print("Average CV Recall:", cv_result.mean())
Tuning Hyperparameters
- Settings that control model training and complexity.
- Examples:
kin KNN,C,kernelin SVM,fit_interceptin Linear Regression. - Tuning hyperparameters can significantly improve model performance.
- Model is trained on different hyperparameter combinations and evaluated on validation set to find the best configuration.
- Approaches:
GridSearchCVandRandomizedSearchCV.
Grid Search vs Randomized Search
- Systematically tests all combinations of hyperparameters.
- Guarantees finding the best configuration within the specified range.
- Can be computationally expensive.
- Samples hyperparameter combinations at random.
- Faster than Grid Search for large parameter spaces.
- May not find the absolute best configuration.
from sklearn.model_selection import RandomizedSearchCV param_dist = {"n_neighbors": [3, 5], "weights": ["uniform", "distance"]} random_search = RandomizedSearchCV(model, param_dist, cv=5) random_search.fit(X, y) print("Best Randomized Search Params:", random_search.best_params_) print("Best Randomized Search Score:", random_search.best_score_) best_model = random_search.best_estimator_
Grid Search
Randomized Search
Example
Pipeline
- Pipeline allows us to chain multiple steps (e.g.
transformation,modeling) together. - Ensures that all steps are applied consistently during training and testing.
- ColumnTransformer can be used to apply different transformations to different features.
- Example:
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OrdinalEncoder numeric_features = ["age", "income"] categorical_features = ["gender", "occupation"] numeric_transformer = StandardScaler() categorical_transformer = OrdinalEncoder(handle_unknown="ignore") preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features)]) pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", SVC())]) pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test)
Saving ML Model
- After tuning hyperparameters and finding the best model, we need to save it for future use.
jobliborpicklecan be used to save the trained model to disk.- We need to save
transformers(e.g.StandardScaler,OneHotEncoder) along with the model. - We can also create a pipeline that includes both preprocessing and modeling steps, and save the entire pipeline.
- Example:
import joblib joblib.dump(pipeline, "knn_pipeline.pkl") # To load the model later loaded_pipeline = joblib.load("knn_pipeline.pkl") predictions = loaded_pipeline.predict(X_test)
