Underfitting vs. Overfitting — Regularization Spectrum
How much should the model constrain itself to avoid memorizing noise? L1, L2, dropout, batch normalization, early stopping, and data augmentation as regularization levers.
Intent & Description
🎯 Intent
Balance model complexity to achieve optimal generalization by applying the right amount of regularization to prevent underfitting (too simple) or overfitting (too complex).
📋 Context
Models that are too simple underfit — they miss patterns in the data. Models that are too complex overfit — they memorize noise and fail to generalize. Regularization techniques constrain the model to find the sweet spot. Different techniques work in different ways: L1 promotes sparsity, L2 promotes small weights, dropout averages ensemble behavior, batch normalization smooths the loss landscape.
💡 Solution
Use L1 for feature selection in high-dimensional spaces. Use L2 as default for correlated features. Use dropout for deep networks. Use batch normalization for very deep networks. Use early stopping for all gradient-based methods. Use data augmentation for vision, audio, and NLP. Combine techniques (elastic net) when unsure.
Real-world Use Case
📌 TL;DR
Regularization spectrum: L1 (sparsity, feature selection), L2 (small weights), dropout (ensemble averaging), batch norm (smooth loss landscape), early stopping (implicit regularization). Use combination for optimal generalization.
Advantages
- Systematic approach to controlling model complexity
- Each technique has specific strengths for different scenarios
- Regularization improves generalization performance
- Combinations (elastic net) provide balanced approach
Disadvantages
- Adds hyperparameters to tune (regularization strength, dropout rate)
- Over-regularization can cause underfitting
- Different techniques work better for different model architectures
- Requires validation set to find optimal regularization level
# Regularization Spectrum
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.neural_network import MLPClassifier
import torch.nn as nn
# L1 Regularization (Lasso) - Feature selection
l1_model = Lasso(alpha=0.01) # Higher alpha = more regularization
l1_model.fit(X_train, y_train)
# Many coefficients become exactly zero
# L2 Regularization (Ridge) - Shrinkage
l2_model = Ridge(alpha=1.0)
l2_model.fit(X_train, y_train)
# All coefficients small but non-zero
# Elastic Net - Combined L1 + L2
elastic_model = ElasticNet(alpha=0.1, l1_ratio=0.5) # 50% L1, 50% L2
elastic_model.fit(X_train, y_train)
# Dropout for Neural Networks
class DropoutNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.dropout = nn.Dropout(0.5) # 50% dropout
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = self.fc1(x)
x = torch.relu(x)
x = self.dropout(x) # Apply dropout
x = self.fc2(x)
return x
# Batch Normalization
class BatchNormNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.bn = nn.BatchNorm1d(256) # Normalize layer activations
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = self.fc1(x)
x = self.bn(x) # Normalize activations
x = torch.relu(x)
x = self.fc2(x)
return x