Linear Regression

Linear Regression Introduction

Predicts continuous target variable based on linear relationship with features.
Model is represented as y = β0 + β1x1 + β2x2 + ... + βnxn.
Coefficients (β) are estimated using least squares method to minimize the sum of squared residuals.

For simple linear regression with one feature (x) and one target variable (y). The model is y = β0 + β1x.
Assume we have n data points (x1, y1), (x2, y2), ..., (xn, yn).
The goal is to find β0 and β1 that minimize the sum of squared residuals.
For x1, the predicted value by model is y_pred = β0 + β1x1.
The residual for x1 is e1 = y1 - y_pred = y1 - (β0 + β1x1) = y1 - β0 - β1x1.
The sum of squared residuals is RSS = Σ(ei²) = Σ(yi - β0 - β1xi)².
To minimize RSS, we take partial derivatives w.r.t β0 and β1 and set them to zero.

Partial Derivatives w.r.t. β0:

∂RSS/∂β0 = 2*Σ(yi - β0 - β1xi) * (-1) = 0 or Σ(yi - β0 - β1xi) = 0 or Σyi - n*β0 - β1Σxi = 0 or nβ0 = (Σyi - β1Σxi) or β0 = (Σyi - β1Σxi) / n => β0 = Ȳ - β1X̄

Partial Derivatives w.r.t. β1:

∂RSS/∂β1 = 2*Σ(yi - β0 - β1xi) * (-xi) = 0 or Σ(yi - β0 - β1xi) * xi = 0 or Σ(xi*yi) - β0Σxi - β1Σ(xi)² = 0 => β1Σ(xi)² = Σ(xi*yi) - β0Σxi

On solving these two equations, we get: β1 = COV(X, Y) / VAR(X) β0 = Ȳ - β1X̄

Linear Regression Hyperparameters and their Effects:

Hyperparameter	Description	Effect on Model
fit_intercept	Whether to calculate the intercept for this model.	`True` => model includes intercept (β0), `False` => model does not include intercept.
positive	When set to True, forces the coefficients to be positive.	`True` => coefficients constrained to be positive (better interpretability), `False` => coefficients can take any value (allows for more flexibility).

Linear Regression has no hyperparameters to tune, but regularization techniques can be applied to prevent overfitting.

Regularization Techniques for Linear Regression:

Technique	Description	Effect on Model
Lasso Regression (L1)	Adds a penalty equal to the absolute value of the magnitude of coefficients.	Encourages sparsity, can perform feature selection by shrinking some coefficients to zero.
Ridge Regression (L2)	Adds a penalty equal to the square of the magnitude of coefficients.	Encourages smaller coefficients, reduces model complexity, can handle multicollinearity.
Elastic Net	Combines L1 and L2 penalties.	Balances between Ridge and Lasso, can handle correlated features and perform feature selection.

Common regularization techniques for linear regression include Ridge (L2), Lasso (L1), and Elastic Net.