Example 5: MAPPO Training#

This example demonstrates production-ready training of Multi-Agent PPO (MAPPO) on cooperative multi-agent microgrids with RLlib.

What You’ll Learn#

Setting up RLlib for multi-agent training
MAPPO vs IPPO for cooperative tasks
Shared rewards for cooperation
Experiment tracking and checkpointing

Architecture#

RLlib (PPO Algorithm)
└── MultiAgentMicrogrids (PowerGrid)
        ├── GridAgent MG1
        ├── GridAgent MG2
        └── GridAgent MG3

Quick Start#

# Install dependencies
pip install "ray[rllib]==2.9.0"

# Quick test run
cd case_studies/power
python examples/05_mappo_training.py --test

# Full training
python examples/05_mappo_training.py --iterations 100 --num-workers 4

Code#

from ray.rllib.algorithms.ppo import PPOConfig
from powergrid.envs.multi_agent_microgrids import MultiAgentMicrogrids

def env_creator(env_config):
    env = MultiAgentMicrogrids(env_config)
    return env

# Configure PPO
config = (
    PPOConfig()
    .environment(env="multi_agent_microgrids", env_config=env_config)
    .training(lr=5e-5, train_batch_size=4000)
    .multi_agent(
        policies={"shared_policy": (None, obs_space, act_space, {})},
        policy_mapping_fn=lambda agent_id, *args: "shared_policy",
    )
)

algo = config.build()
for i in range(100):
    result = algo.train()
    print(f"Iteration {i}: reward={result['episode_reward_mean']:.2f}")

MAPPO vs IPPO#

Aspect	MAPPO (Shared Policy)	IPPO (Independent Policies)
Policy	Single shared network	Separate network per agent
Learning	Faster (shared params)	Slower (more params)
Cooperation	Better (implicit sharing)	Harder to coordinate
Best for	Homogeneous agents	Heterogeneous agents

# MAPPO (default)
python examples/05_mappo_training.py --iterations 100

# IPPO
python examples/05_mappo_training.py --iterations 100 --independent-policies

Shared Rewards#

Shared rewards encourage cooperation:

# Without shared rewards
rewards = {"MG1": -cost_mg1, "MG2": -cost_mg2, "MG3": -cost_mg3}

# With shared rewards (cooperation)
total_cost = cost_mg1 + cost_mg2 + cost_mg3
shared_reward = -total_cost / 3
rewards = {"MG1": shared_reward, "MG2": shared_reward, "MG3": shared_reward}

# Enable shared rewards
python examples/05_mappo_training.py --share-reward

# Independent rewards
python examples/05_mappo_training.py --no-share-reward

Command-Line Options#

python examples/05_mappo_training.py \
    --iterations 200 \        # Training iterations
    --lr 5e-5 \               # Learning rate
    --hidden-dim 256 \        # Network hidden size
    --num-workers 8 \         # Parallel workers
    --share-reward \          # Cooperative rewards
    --checkpoint-freq 10 \    # Save frequency
    --wandb \                 # Enable W&B logging
    --wandb-project my-exp    # W&B project name

Experiment Tracking#

Enable Weights & Biases logging:

pip install wandb
wandb login

python examples/05_mappo_training.py --wandb --wandb-project powergrid-coop

Checkpointing#

# Save every 10 iterations
python examples/05_mappo_training.py --checkpoint-freq 10 --checkpoint-dir ./checkpoints

# Resume from checkpoint
python examples/05_mappo_training.py --resume ./checkpoints/mappo_shared_mg3_*/checkpoint_000050

Expected Output#

================================================================
Cooperative Multi-Agent Microgrid Training with RLlib
================================================================
Experiment:        mappo_shared_mg3_20240115_143022
Policy type:       MAPPO (Shared Policy)
Shared reward:     True (encourages cooperation)
Iterations:        100
================================================================

 Iter |     Reward |       Cost | Episodes |     Steps |     Time
----------------------------------------------------------------------
    1 |    -450.25 |     450.25 |       12 |     12000 |    15.2s
    2 |    -380.10 |     380.10 |       14 |     26000 |    28.5s
  ...
  100 |    -120.50 |     120.50 |       18 |   1800000 |  2450.0s

✓ Training complete!
  Best reward achieved: -115.30

Next Steps#

Try Distributed Mode for realistic deployment
Explore different Coordination Protocols