Regression Analysis: Ridge vs Lasso on the Ames Housing Dataset
A practical walkthrough of building regularised regression models for house price prediction — from raw data preprocessing to comparing Ridge and Lasso, with residual diagnostics to validate the result.
Linear regression is one of the most misused tools in data science — not because it’s wrong, but because people skip the steps that make it trustworthy. This post walks through building a proper regression model for house price prediction: preprocessing, feature engineering, regularisation, and residual validation.
The dataset is the Ames, Iowa housing data — 2,919 sales with 80 features. A richer version of the classic Boston Housing dataset.
Four Assumptions That Actually Matter
Before fitting any model, it’s worth being explicit about what we’re assuming:
- No correlation between dependent and independent variables — the residuals shouldn’t correlate with the predictors
- Normally distributed errors — residuals should follow a bell curve
- Independent error terms — no autocorrelation (Durbin-Watson test)
- Constant error variance — homoscedasticity, not fan-shaped residuals
Most tutorials skip these checks. They matter when you’re deploying a model in a regulated environment.
Preprocessing
The 80-feature dataset needed significant cleaning before modelling:
- Dropped useless features — columns with >70% missing values or near-zero variance
- Missing value imputation — median for numerical, mode for categorical
- Outlier removal — using percentile thresholds (1st and 99th)
- Multicollinearity check — feature correlation analysis, removing highly correlated pairs
Feature Engineering
Raw features rarely tell the full story. A few derived variables that improved signal:
- Garage condition status — binary flag from the garage quality field
- Property age — years since build and years since last remodel
- Total living area — combined above-ground and basement finished area
After deriving new features, the originals are dropped to reduce multicollinearity — keeping both the derived and original features would introduce redundant information that inflates variance.
Ridge vs Lasso
Both are regularised regression — they add a penalty term to the loss function to prevent overfitting. The difference is in how they penalise:
Ridge (L2): Squares the coefficients in the penalty. All coefficients shrink toward zero but none reach it. Every feature stays in the model.
Lasso (L1): Takes the absolute value. Coefficients can shrink to exactly zero — Lasso performs automatic feature selection.
Results
Both models found their optimal alpha at 0.0001 via grid search cross-validation.
Most influential feature: MSZoning_RL (Residential Low Density zoning) — the strongest predictor of sale price in this dataset.
Residual validation: Loss analysis confirmed no visible patterns in the residuals — the models are not systematically over- or under-predicting in any region of the feature space.
Why Lasso Wins Here
Lasso performs slightly better than Ridge on this dataset — not because L1 regularisation is inherently superior, but because the dataset has genuine sparsity. With 80 features, many are low-signal. Lasso identifies and zeros them out; Ridge keeps them and pays a small variance cost.
The rule of thumb: if you believe most features contribute something, use Ridge. If you believe only a subset of features actually matter, use Lasso.
For house pricing, a handful of features (size, location, quality) dominate. Lasso is the right prior.
Residual Diagnostics
The final step — and the one most tutorials skip — is verifying the regression assumptions:
- Normality: Residual histogram follows a bell curve ✓
- Homoscedasticity: Residuals vs fitted values shows no fan shape ✓
- Autocorrelation: Durbin-Watson statistic ≈ 2.03 — no autocorrelation ✓
These checks matter. A model that fits well in-sample but violates assumptions will fail in production in predictable ways.
Full code and notebook available at github.com/mauryasameer/Regression_analysis.
Original analysis published on Medium.