Regression Analysis: Ridge vs Lasso on the Ames Housing Dataset

Linear regression is one of the most misused tools in data science — not because it’s wrong, but because people skip the steps that make it trustworthy. This post walks through building a proper regression model for house price prediction: preprocessing, feature engineering, regularisation, and residual validation.

The dataset is the Ames, Iowa housing data — 2,919 sales with 80 features. A richer version of the classic Boston Housing dataset.

Four Assumptions That Actually Matter

Before fitting any model, it’s worth being explicit about what we’re assuming:

No correlation between dependent and independent variables — the residuals shouldn’t correlate with the predictors
Normally distributed errors — residuals should follow a bell curve
Independent error terms — no autocorrelation (Durbin-Watson test)
Constant error variance — homoscedasticity, not fan-shaped residuals

Most tutorials skip these checks. They matter when you’re deploying a model in a regulated environment.

Preprocessing

The 80-feature dataset needed significant cleaning before modelling:

Dropped useless features — columns with >70% missing values or near-zero variance
Missing value imputation — median for numerical, mode for categorical
Outlier removal — using percentile thresholds (1st and 99th)
Multicollinearity check — feature correlation analysis, removing highly correlated pairs

Feature Engineering

Raw features rarely tell the full story. A few derived variables that improved signal:

Garage condition status — binary flag from the garage quality field
Property age — years since build and years since last remodel
Total living area — combined above-ground and basement finished area

After deriving new features, the originals are dropped to reduce multicollinearity — keeping both the derived and original features would introduce redundant information that inflates variance.

Ridge vs Lasso

Both are regularised regression — they add a penalty term to the loss function to prevent overfitting. The difference is in how they penalise:

Ridge (L2): Squares the coefficients in the penalty. All coefficients shrink toward zero but none reach it. Every feature stays in the model.

Lasso (L1): Takes the absolute value. Coefficients can shrink to exactly zero — Lasso performs automatic feature selection.

Results

Both models found their optimal alpha at 0.0001 via grid search cross-validation.

Most influential feature: MSZoning_RL (Residential Low Density zoning) — the strongest predictor of sale price in this dataset.

Residual validation: Loss analysis confirmed no visible patterns in the residuals — the models are not systematically over- or under-predicting in any region of the feature space.

Why Lasso Wins Here

Lasso performs slightly better than Ridge on this dataset — not because L1 regularisation is inherently superior, but because the dataset has genuine sparsity. With 80 features, many are low-signal. Lasso identifies and zeros them out; Ridge keeps them and pays a small variance cost.

The rule of thumb: if you believe most features contribute something, use Ridge. If you believe only a subset of features actually matter, use Lasso.

For house pricing, a handful of features (size, location, quality) dominate. Lasso is the right prior.

Residual Diagnostics

The final step — and the one most tutorials skip — is verifying the regression assumptions:

Normality: Residual histogram follows a bell curve ✓
Homoscedasticity: Residuals vs fitted values shows no fan shape ✓
Autocorrelation: Durbin-Watson statistic ≈ 2.03 — no autocorrelation ✓

These checks matter. A model that fits well in-sample but violates assumptions will fail in production in predictable ways.

Full code and notebook available at github.com/mauryasameer/Regression_analysis.
Original analysis published on Medium.