Model Tuning & Pipeline

Cross-Validation

  • Model performance depends on how data is split into training and testing.
  • Cross-validation (CV) is technique to evaluate model on unseen data.
  • K-Fold: Split data into K subsets, train on K-1, test on 1. Repeated K times.
  • Using more folds can be computationally expensive.
  • scoring metric for CV can be choosen based on problem type.
  • Example: from sklearn.model_selection import cross_val_score cv_result = cross_val_score(model, X, y, cv=5, scoring="recall") print("Average CV Recall:", cv_result.mean())

Tuning Hyperparameters

  • Settings that control model training and complexity.
  • Examples: k in KNN, C, kernel in SVM, fit_intercept in Linear Regression.
  • Tuning hyperparameters can significantly improve model performance.
  • Model is trained on different hyperparameter combinations and evaluated on validation set to find the best configuration.
  • Approaches: GridSearchCV and RandomizedSearchCV.

Grid Search vs Randomized Search

    Grid Search
    1. Systematically tests all combinations of hyperparameters.
    2. Guarantees finding the best configuration within the specified range.
    3. Can be computationally expensive.
    Randomized Search
    1. Samples hyperparameter combinations at random.
    2. Faster than Grid Search for large parameter spaces.
    3. May not find the absolute best configuration.
    Example
    1. from sklearn.model_selection import RandomizedSearchCV param_dist = {"n_neighbors": [3, 5], "weights": ["uniform", "distance"]} random_search = RandomizedSearchCV(model, param_dist, cv=5) random_search.fit(X, y) print("Best Randomized Search Params:", random_search.best_params_) print("Best Randomized Search Score:", random_search.best_score_) best_model = random_search.best_estimator_

Pipeline

  • Pipeline allows us to chain multiple steps (e.g. transformation, modeling) together.
  • Ensures that all steps are applied consistently during training and testing.
  • ColumnTransformer can be used to apply different transformations to different features.
  • Example: from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OrdinalEncoder numeric_features = ["age", "income"] categorical_features = ["gender", "occupation"] numeric_transformer = StandardScaler() categorical_transformer = OrdinalEncoder(handle_unknown="ignore") preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features)]) pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", SVC())]) pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test)

Saving ML Model

  • After tuning hyperparameters and finding the best model, we need to save it for future use.
  • joblib or pickle can be used to save the trained model to disk.
  • We need to save transformers (e.g. StandardScaler, OneHotEncoder) along with the model.
  • We can also create a pipeline that includes both preprocessing and modeling steps, and save the entire pipeline.
  • Example: import joblib joblib.dump(pipeline, "knn_pipeline.pkl") # To load the model later loaded_pipeline = joblib.load("knn_pipeline.pkl") predictions = loaded_pipeline.predict(X_test)