Example 5: MAPPO Training#
This example demonstrates production-ready training of Multi-Agent PPO (MAPPO) on cooperative multi-agent microgrids with RLlib.
What You’ll Learn#
Setting up RLlib for multi-agent training
MAPPO vs IPPO for cooperative tasks
Shared rewards for cooperation
Experiment tracking and checkpointing
Architecture#
RLlib (PPO Algorithm)
└── MultiAgentMicrogrids (PowerGrid)
├── GridAgent MG1
├── GridAgent MG2
└── GridAgent MG3
Quick Start#
# Install dependencies
pip install "ray[rllib]==2.9.0"
# Quick test run
cd case_studies/power
python examples/05_mappo_training.py --test
# Full training
python examples/05_mappo_training.py --iterations 100 --num-workers 4
Code#
from ray.rllib.algorithms.ppo import PPOConfig
from powergrid.envs.multi_agent_microgrids import MultiAgentMicrogrids
def env_creator(env_config):
env = MultiAgentMicrogrids(env_config)
return env
# Configure PPO
config = (
PPOConfig()
.environment(env="multi_agent_microgrids", env_config=env_config)
.training(lr=5e-5, train_batch_size=4000)
.multi_agent(
policies={"shared_policy": (None, obs_space, act_space, {})},
policy_mapping_fn=lambda agent_id, *args: "shared_policy",
)
)
algo = config.build()
for i in range(100):
result = algo.train()
print(f"Iteration {i}: reward={result['episode_reward_mean']:.2f}")
MAPPO vs IPPO#
Aspect |
MAPPO (Shared Policy) |
IPPO (Independent Policies) |
|---|---|---|
Policy |
Single shared network |
Separate network per agent |
Learning |
Faster (shared params) |
Slower (more params) |
Cooperation |
Better (implicit sharing) |
Harder to coordinate |
Best for |
Homogeneous agents |
Heterogeneous agents |
# MAPPO (default)
python examples/05_mappo_training.py --iterations 100
# IPPO
python examples/05_mappo_training.py --iterations 100 --independent-policies
Command-Line Options#
python examples/05_mappo_training.py \
--iterations 200 \ # Training iterations
--lr 5e-5 \ # Learning rate
--hidden-dim 256 \ # Network hidden size
--num-workers 8 \ # Parallel workers
--share-reward \ # Cooperative rewards
--checkpoint-freq 10 \ # Save frequency
--wandb \ # Enable W&B logging
--wandb-project my-exp # W&B project name
Experiment Tracking#
Enable Weights & Biases logging:
pip install wandb
wandb login
python examples/05_mappo_training.py --wandb --wandb-project powergrid-coop
Checkpointing#
# Save every 10 iterations
python examples/05_mappo_training.py --checkpoint-freq 10 --checkpoint-dir ./checkpoints
# Resume from checkpoint
python examples/05_mappo_training.py --resume ./checkpoints/mappo_shared_mg3_*/checkpoint_000050
Expected Output#
================================================================
Cooperative Multi-Agent Microgrid Training with RLlib
================================================================
Experiment: mappo_shared_mg3_20240115_143022
Policy type: MAPPO (Shared Policy)
Shared reward: True (encourages cooperation)
Iterations: 100
================================================================
Iter | Reward | Cost | Episodes | Steps | Time
----------------------------------------------------------------------
1 | -450.25 | 450.25 | 12 | 12000 | 15.2s
2 | -380.10 | 380.10 | 14 | 26000 | 28.5s
...
100 | -120.50 | 120.50 | 18 | 1800000 | 2450.0s
✓ Training complete!
Best reward achieved: -115.30
Next Steps#
Try Distributed Mode for realistic deployment
Explore different Coordination Protocols