MARL — Cooperative Grid World

Two independent agents learn to collect food at fixed locations via shared cooperative reward. Independent Q-learning with ε-greedy exploration — food positions are fixed within each training run and only re-randomised on Reset, so Q-value maps are spatially meaningful.

Episode
0
Explore ε
1.000
Step
0
Last reward
Avg (20 ep)
α — learning rate 0.15
γ — discount factor 0.90
ε-decay = 0.9975 takes effect on reset ↺
Environment
Food (randomised on reset) Agent A Agent B
Q-value heatmap — maxa Q(s,a)
low high
□ food location  ·  ○ agent position
Learning curve — episode reward
Speed:
Reward structure  ·  +10 when either agent steps on an active food pellet (shared by both agents)  ·  −0.05 per step (movement cost)  ·  Episode ends when all 4 pellets are collected or after 120 steps. Food positions are fixed within a training run and only re-randomised when Reset is pressed — so Q-value maps remain spatially meaningful during learning.
Parameters  ·  α and γ sliders take effect on the next Reset — adjusting them mid-run does not disrupt ongoing learning.