MARL — Cooperative Grid World

Two independent agents learn to collect food at fixed locations via shared cooperative reward. Independent Q-learning with ε-greedy exploration — food positions are fixed within each training run and only re-randomised on Reset, so Q-value maps are spatially meaningful.

Episode

Explore ε

1.000

Step

Last reward

—

Avg (20 ep)

—

α — learning rate 0.15

γ — discount factor 0.90

ε-decay = 0.9975 takes effect on reset ↺

Environment

Food (randomised on reset) Agent A Agent B

Q-value heatmap — max_a Q(s,a)

low

high

□ food location · ○ agent position

Learning curve — episode reward

Speed:

Reward structure · +10 when either agent steps on an active food pellet (shared by both agents) · −0.05 per step (movement cost) · Episode ends when all 4 pellets are collected or after 120 steps. Food positions are fixed within a training run and only re-randomised when Reset is pressed — so Q-value maps remain spatially meaningful during learning.
Parameters · α and γ sliders take effect on the next Reset — adjusting them mid-run does not disrupt ongoing learning.