Model Tuning & Pipeline

Cross-Validation

Model performance depends on how data is split into training and testing.
Cross-validation (CV) is technique to evaluate model on unseen data.
K-Fold: Split data into K subsets, train on K-1, test on 1. Repeated K times.
Using more folds can be computationally expensive.
scoring metric for CV can be choosen based on problem type.
Example: from sklearn.model_selection import cross_val_score cv_result = cross_val_score(model, X, y, cv=5, scoring="recall") print("Average CV Recall:", cv_result.mean())

Settings that control model training and complexity.
Examples: k in KNN, C, kernel in SVM, fit_intercept in Linear Regression.
Tuning hyperparameters can significantly improve model performance.
Model is trained on different hyperparameter combinations and evaluated on validation set to find the best configuration.
Approaches: GridSearchCV and RandomizedSearchCV.

Grid Search

Randomized Search

Example

from sklearn.model_selection import RandomizedSearchCV param_dist = {"n_neighbors": [3, 5], "weights": ["uniform", "distance"]} random_search = RandomizedSearchCV(model, param_dist, cv=5) random_search.fit(X, y) print("Best Randomized Search Params:", random_search.best_params_) print("Best Randomized Search Score:", random_search.best_score_) best_model = random_search.best_estimator_

Pipeline allows us to chain multiple steps (e.g. transformation, modeling) together.
Ensures that all steps are applied consistently during training and testing.
ColumnTransformer can be used to apply different transformations to different features.
Example: from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OrdinalEncoder numeric_features = ["age", "income"] categorical_features = ["gender", "occupation"] numeric_transformer = StandardScaler() categorical_transformer = OrdinalEncoder(handle_unknown="ignore") preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features)]) pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", SVC())]) pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test)

After tuning hyperparameters and finding the best model, we need to save it for future use.
joblib or pickle can be used to save the trained model to disk.
We need to save transformers (e.g. StandardScaler, OneHotEncoder) along with the model.
We can also create a pipeline that includes both preprocessing and modeling steps, and save the entire pipeline.
Example: import joblib joblib.dump(pipeline, "knn_pipeline.pkl") # To load the model later loaded_pipeline = joblib.load("knn_pipeline.pkl") predictions = loaded_pipeline.predict(X_test)