Two independent agents learn to collect food at fixed locations via shared cooperative reward.
Independent Q-learning with ε-greedy exploration — food positions are fixed within each training run and only re-randomised on Reset, so Q-value maps are spatially meaningful.
Episode
0
Explore ε
1.000
Step
0
Last reward
—
Avg (20 ep)
—
α — learning rate
0.15
γ — discount factor
0.90
ε-decay = 0.9975takes effect on reset ↺
Environment
Food (randomised on reset)Agent AAgent B
Q-value heatmap — maxa Q(s,a)
lowhigh
□ food location · ○ agent position
Learning curve — episode reward
Speed:
Reward structure ·
+10 when either agent steps on an active food pellet (shared by both agents) ·
−0.05 per step (movement cost) ·
Episode ends when all 4 pellets are collected or after 120 steps.
Food positions are fixed within a training run and only re-randomised when Reset is pressed — so Q-value maps remain spatially meaningful during learning.
Parameters ·
α and γ sliders take effect on the next Reset — adjusting them mid-run does not disrupt ongoing learning.