Exploration vs. Exploitation (Reinforcement Learning)
Balance trying new actions (exploration) against selecting known-good actions (exploitation). ε-Greedy, UCB, Thompson Sampling, and Boltzmann exploration strategies.
Intent & Description
🎯 Intent
Balance discovering new, potentially better actions versus maximizing reward using current knowledge in sequential decision-making scenarios.
📋 Context
In reinforcement learning and bandit problems, agents must explore to discover better actions but must exploit known good actions to maximize cumulative reward. Pure exploration wastes reward on suboptimal actions. Pure exploitation gets stuck in local optima. The trade-off is central to A/B testing, recommender systems, and RL.
💡 Solution
Use ε-greedy for simplicity — decay ε over time. Use UCB (Upper Confidence Bound) for principled uncertainty exploration. Use Thompson Sampling for Bayesian natural uncertainty handling. Use Boltzmann exploration for smooth temperature-controlled exploration. Separate exploration from learning in deep RL with replay buffers and target networks.
Real-world Use Case
📌 TL;DR
Exploration-exploitation: ε-greedy (simple, random exploration), UCB (principled uncertainty), Thompson sampling (Bayesian), Boltzmann (smooth temperature). Use Thompson sampling for production bandits, decay ε over time.
Advantages
- Systematic framework for sequential decision-making
- Different strategies for different use cases
- Theoretical regret bounds for some strategies (UCB)
- Natural handling of uncertainty in Bayesian methods
Disadvantages
- Exploration wastes reward on suboptimal actions
- Optimal strategy depends on environment characteristics
- Hyperparameter tuning required (ε, temperature, etc.)
- Complex to implement in high-dimensional action spaces
# Exploration vs. Exploitation Strategies
import numpy as np
from scipy.stats import beta
class BanditAgent:
def __init__(self, n_arms, strategy='epsilon_greedy'):
self.n_arms = n_arms
self.strategy = strategy
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
self.epsilon = 0.1 # For epsilon-greedy
self.alpha = np.ones(n_arms) # For Thompson sampling
self.beta = np.ones(n_arms) # For Thompson sampling
def select_arm(self):
if self.strategy == 'epsilon_greedy':
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms)
else:
return np.argmax(self.values)
elif self.strategy == 'ucb':
# Upper Confidence Bound
ucb_values = self.values + np.sqrt(2 * np.log(sum(self.counts) + 1) / (self.counts + 1e-5))
return np.argmax(ucb_values)
elif self.strategy == 'thompson_sampling':
# Sample from Beta distribution
samples = [np.random.beta(self.alpha[i], self.beta[i]) for i in range(self.n_arms)]
return np.argmax(samples)
elif self.strategy == 'boltzmann':
# Softmax exploration with temperature
temperature = 0.1
exp_values = np.exp(self.values / temperature)
probs = exp_values / np.sum(exp_values)
return np.random.choice(self.n_arms, p=probs)
def update(self, arm, reward):
self.counts[arm] += 1
self.values[arm] += (reward - self.values[arm]) / self.counts[arm]
if self.strategy == 'thompson_sampling':
self.alpha[arm] += reward
self.beta[arm] += (1 - reward)