Exploration & Discovery
Should you order your usual, or try something new?
You have a favourite dish at a restaurant. Ordering it is safe — you know it's good. But the menu has 20 other dishes; one of them might be even better, and you'll never know unless you try . That's the whole tension of Exploration & Discovery : an agent can exploit the best option it already knows, or explore a new one that might be better. Always playing safe means you can get stuck on "pretty good" and miss "great".
Key points
- Exploit = use the known-best option (safe, no surprises).
- Explore = try something new (risky, but may find better).
- Too much exploit = stuck on okay. Too much explore = never settle.
What is Exploration & Discovery?
Exploration & Discovery is the pattern where an agent deliberately tries untested actions or paths to find better solutions, instead of always repeating whatever worked before. It balances two urges: exploitation (cash in on the current best) and exploration (gamble on something new that could be even better).
Note: Explore to learn what's possible; exploit to cash in on what you know.
Explore vs Exploit (the fork in every decision)
┌──────────────────────┐ AGENT ─────►│ Pick an action... │ └───────────┬──────────┘ │ ┌──────────────┴───────────────┐ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ EXPLOIT 🛡️ │ │ EXPLORE 🎲 │ │ use known-best │ │ try a new option│ │ safe, reliable │ │ risky, may win │ └────────┬────────┘ └────────┬────────┘ │ │ ▼ ▼ steady, 'pretty good' sometimes worse, but never improves sometimes DISCOVERS a better path ✨
Epsilon-greedy: a simple dial between the two
Set epsilon = 0.2 (explore 20% of the time)
For each decision, roll a dice (0.0 .. 1.0):
roll < 0.2 ? ┌───────────────┬──────────────────┐ │ YES (20%) │ NO (80%) │ ▼ ▼ ┌──────────┐ ┌──────────────┐ │ EXPLORE │ │ EXPLOIT │ │ random │ │ best known │ │ option🎲 │ │ option 🛡️ │ └──────────┘ └──────────────┘
Big epsilon ► explores more (adventurous) Small epsilon ► exploits more (conservative) Common trick: start big, shrink over time as you learn
The pieces of explore/exploit
- Exploitation — Choose the option with the best track record so far. Example: Always send the email subject line that got the most opens.
- Exploration — Occasionally pick a random or untried option to gather new info. Example: Test a brand-new subject line on a small slice of users.
- Epsilon (the dial) — A number from 0 to 1 setting how often you explore. Example: epsilon=0.1 means explore 10% of the time, exploit 90%.
- Decay — Shrink epsilon as you learn, so you explore early and exploit later. Example: Start at 0.5, drop toward 0.05 once you trust your data.
Epsilon-greedy in a few lines (read it like English)
With probability epsilon the agent picks a random option (explore); otherwise it picks the best-known option (exploit). That single if is the entire trick — a simple way to keep discovering while still mostly using what works.
import random
wins = {"A": 8, "B": 5, "C": 1} # rewards seen so far
epsilon = 0.2 # explore 20% of the time
def pick():
if random.random() < epsilon:
return random.choice(list(wins)) # EXPLORE
return max(wins, key=wins.get) # EXPLOIT best
print("Chosen option:", pick())
▶ Try it: epsilon-greedy finds the best hidden option
Try epsilon = 0.0 (pure exploit) and watch it sometimes get stuck on the wrong button. Then raise it.
import random
# Three buttons. Their TRUE win-rates are hidden from the agent.
true_rate = {"A": 0.3, "B": 0.8, "C": 0.5} # B is secretly best
wins = {"A": 0, "B": 0, "C": 0}
tries = {"A": 0, "B": 0, "C": 0}
epsilon = 0.2
def avg(name):
return wins[name] / tries[name] if tries[name] else 0
def pick():
if random.random() < epsilon:
return random.choice(list(wins)) # EXPLORE
return max(wins, key=avg) # EXPLOIT best avg
random.seed(7)
for _ in range(300):
choice = pick()
reward = 1 if random.random() < true_rate[choice] else 0
tries[choice] += 1
wins[choice] += reward
for name in wins:
print(f"{name}: tried {tries[name]:3}x win-rate {avg(name):.2f}")
print("\nAgent learned the best button is:", max(wins, key=avg))
When should an agent explore?
| Scenario | Recommendation | Why |
|---|---|---|
| You're unsure which option is truly best | ✅ Explore | Trying alternatives reveals better options you'd otherwise miss. |
| The world changes over time (tastes, prices, data) | ✅ Explore | Yesterday's best may not be today's; keep checking. |
| You already know the clear winner and stakes are high | ❌ Mostly exploit | Random gambles waste resources when the answer is known. |
| A mistake is dangerous or irreversible | ⚠️ Explore carefully | Limit exploration to safe, low-stakes choices. |
Explore/exploit mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| Always exploiting (epsilon = 0). | The agent locks onto the first decent option and never finds better. | Keep a small exploration rate so new options still get a chance. |
| Always exploring (epsilon = 1). | The agent acts randomly forever and never cashes in on what it learned. | Lower epsilon over time so it settles on winners once it has data. |
| Judging an option after one try. | Bad luck on a great option makes the agent abandon it too soon. | Track an average over many tries, not a single result. |
Remember these lines
- Exploit = known-best; Explore = try new. You need both.
- Epsilon-greedy: explore with probability epsilon, else exploit.
- Explore a lot early, exploit more later — decay epsilon over time.
Key takeaways
- Exploration & Discovery balances exploiting the known-best against exploring new options.
- Pure exploit gets stuck on 'okay'; pure explore never settles — you need a mix.
- Epsilon-greedy is a one-line way to control the balance with a single probability.
- Explore more when uncertain or when the world changes; exploit when the winner is clear.
Frequently Asked Questions
What is Exploration & Discovery?
You have a favourite dish at a restaurant. Ordering it is safe — you know it's good.
How does Exploration & Discovery work?
Exploration & Discovery is the pattern where an agent deliberately tries untested actions or paths to find better solutions, instead of always repeating whatever worked before. It balances two urges: exploitation (cash in on the current best) and exploration (gamble on something new that could be even better).
What are the key takeaways about Exploration & Discovery?
Exploration & Discovery balances exploiting the known-best against exploring new options. Pure exploit gets stuck on 'okay'; pure explore never settles — you need a mix. Epsilon-greedy is a one-line way to control the balance with a single probability. Explore more when uncertain or when the world changes; exploit when the winner is clear.
Related topics
Practice this on DevInterviewMaster
Read the full Exploration & Discovery breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.