Back to Catalog
Agentic AI
Planning & Control Flow
Exploration vs Exploitation
Balance taking the best-known action (exploit) with trying alternatives that might be better (explore).
Intent & Description
🎯 Intent
Balance taking the best-known action (exploit) with trying alternatives that might be better (explore).
📋 Context
A team runs a long-lived agent that repeatedly chooses among a set of options — which tool to call, which prompt template to use, which strategy to try — and can observe an outcome signal after each choice (success, reward, user thumbs-up). Over time the agent should get better at the choice, not just freeze the first decent option in place.
💡 Solution
- Pick an exploration strategy: epsilon-greedy (exploit with probability 1-ε, explore randomly otherwise), upper-confidence-bound (favour under-explored options with a UCB bonus), or Thompson sampling (sample from the posterior over option quality). - Apply the chosen strategy across tools, strategies, or prompt templates at the agent’s decision points. - Track outcomes and adjust posteriors or bandit statistics after each run.
Real-world Use Case
- The agent chooses repeatedly among options (tools, strategies, prompts) and outcomes can be tracked.
- Pure exploitation is locking the agent into local optima.
- A strategy (epsilon-greedy, UCB, Thompson sampling) can be picked and tuned.
Source
Advantages
- Avoids local optima that pure exploitation would lock in.
- Improves with experience as the posterior sharpens.
Disadvantages
- Requires a reward signal; without one, exploration is noise.
- Strategy choice and hyper-parameter tuning are empirical.