Batch vs. Online Learning
Train on fixed historical dataset (batch) or update continuously as new data arrives (online). Latency to adapt, compute cost, stability, and infrastructure complexity trade-offs.
Intent & Description
🎯 Intent
Choose between periodic retraining on historical data versus continuous model updates based on latency requirements, data volatility, and infrastructure constraints.
📋 Context
Batch learning trains on fixed historical datasets periodically — stable but slow to adapt to new patterns. Online learning updates continuously as new data arrives — fast adaptation but vulnerable to noisy data and concept drift. The trade-off involves adaptation speed, compute cost, stability, and infrastructure complexity.
💡 Solution
Use mini-batch SGD as pragmatic middle ground. Implement concept drift detection (ADWIN, Page-Hinkley test) for online learning. Shadow-deploy new model versions before switching traffic. Monitor for distribution shift. Use hybrid approach: online for rapid adaptation, batch for periodic stabilization.
Real-world Use Case
📌 TL;DR
Batch learning: stable, periodic, high compute. Online learning: fast adaptation, continuous, vulnerable to drift. Mini-batch: balanced approach. Use concept drift detection to trigger retraining.
Advantages
- Batch: stable, simpler infrastructure, well-understood
- Online: fast adaptation, low per-update compute
- Mini-batch: balanced approach
- Concept drift detection enables smart retraining triggers
Disadvantages
- Batch: slow adaptation, periodic compute spikes
- Online: vulnerable to noisy/adversarial data, complex infrastructure
- Mini-batch: still requires tuning batch size and learning rate
- Both require monitoring for distribution shift
# Batch vs. Online Learning
from sklearn.linear_model import SGDClassifier
river import linear_model, drift
# Batch Learning: Train on full dataset periodically
def batch_learning(X_train, y_train, X_test, y_test):
model = LogisticRegression()
model.fit(X_train, y_train)
return model
# Online Learning: Update continuously with new data
def online_learning():
model = linear_model.LogisticRegression()
drift_detector = drift.ADWIN()
for X_new, y_new in data_stream:
# Update model with new sample
model.learn_one(X_new, y_new)
# Check for concept drift
if drift_detector.update(y_new, model.predict_one(X_new)):
print("Concept drift detected!")
model = linear_model.LogisticRegression() # Reset
return model
# Mini-batch Learning: Pragmatic middle ground
def mini_batch_learning(X_train, y_train, batch_size=32):
model = SGDClassifier(loss='log_loss', learning_rate='adaptive')
for i in range(0, len(X_train), batch_size):
X_batch = X_train[i:i+batch_size]
y_batch = y_train[i:i+batch_size]
model.partial_fit(X_batch, y_batch, classes=np.unique(y_train))
return model
# Concept Drift Detection
def detect_concept_drift(predictions, true_values, window_size=100):
detector = drift.PageHinkley()
for pred, true in zip(predictions, true_values):
error = 1 if pred != true else 0
detector.update(error)
if detector.drift_detected:
return True # Retraining needed
return False