15. Reinforcement Learning with ANDES#

Reinforcement learning (RL) offers a data-driven approach to designing power system controllers. Rather than manually tuning control laws, an RL agent learns a control policy by interacting with the simulation environment and optimizing a user-defined objective. Applications include automatic generation control (AGC), voltage regulation, and emergency frequency response.

ANDES provides the AndesEnv class, a Gymnasium-compatible environment that wraps ANDES time-domain simulation. Each step() advances the simulation by a fixed time interval, and the agent observes system variables and applies control actions through the standard Gymnasium API. The environment uses TDS.reinit() for fast episode resets (~1 ms), making it practical for training algorithms that require thousands of episodes.

This tutorial demonstrates how to construct an RL environment, define observations, actions, and reward functions, and run training episodes. Familiarity with the setpoint API from Dynamic Control and Setpoint Changes and the frequency response concepts from Frequency Response and Load Shedding is assumed.

Note

Prerequisites:

Complete Dynamic Control and Setpoint Changes for the setpoint API and multi-stage simulation.
Complete Frequency Response and Load Shedding for frequency dynamics after disturbances.
Install the RL extra: pip install andes[rl]

# Reduce logging verbosity for PDF builds
import os
if os.environ.get('SPHINX_BUILD_PDF'):
    import andes
    _orig_config_logger = andes.config_logger
    def _quiet_logger(stream_level=20, *args, **kwargs):
        stream_level = max(stream_level, 30)
        return _orig_config_logger(stream_level, *args, **kwargs)
    andes.config_logger = _quiet_logger

15.1. Setup#

import numpy as np

from andes.rl import AndesEnv
from andes.utils.paths import get_case

15.2. Constructing the Environment#

An AndesEnv instance requires four arguments: a case file, an observation specification, an action specification, and a reward function. Together, these define the RL problem: what the agent sees, what it can do, and what it optimizes.

The following example constructs an environment for secondary frequency control (AGC) on the IEEE 14-bus system. The agent observes generator rotor speeds and controls the auxiliary power input of all synchronous generators.

CASE = get_case('ieee14/ieee14_esst3a.xlsx')


def agc_reward(obs, action, env):
    """Penalize frequency deviation and control effort."""
    freq_dev = obs - 1.0
    return -float(np.sum(freq_dev ** 2) + 0.01 * np.sum(action ** 2))


env = AndesEnv(
    case=CASE,
    obs=[('GENROU', 'omega')],
    acts=[('SynGen', 'paux')],
    reward_fn=agc_reward,
    dt=0.1,
    tf=5.0,
)

The constructor performs several initialization steps internally: it loads the case, runs power flow, initializes TDS, and resolves the observation and action addresses. If any specification is invalid (e.g., a nonexistent model or setpoint), a ValueError is raised immediately.

The resulting spaces can be inspected:

print(f"Observation space: {env.observation_space}")
print(f"Action space:      {env.action_space}")
print(f"Number of generators observed: {env.observation_space.shape[0]}")
print(f"Number of generators controlled: {env.action_space.shape[0]}")

Observation space: Box(-inf, inf, (5,), float32)
Action space:      Box(-inf, inf, (5,), float32)
Number of generators observed: 5
Number of generators controlled: 5

15.2.1. Observation Specification#

The obs parameter is a list of tuples specifying which variables the agent observes. Each tuple takes the form (model, variable) or (model, variable, idx_list). The variable type (state or algebraic) is resolved automatically from the model definition.

Spec	Observation
`('GENROU', 'omega')`	All GENROU rotor speeds
`('Bus', 'v')`	All bus voltage magnitudes
`('Bus', 'v', [1, 5, 14])`	Voltages at buses 1, 5, and 14 only

Multiple specs are concatenated into a single flat observation vector. For example, observing both rotor speeds and selected bus voltages:

env_multi = AndesEnv(
    case=CASE,
    obs=[
        ('GENROU', 'omega'),          # all generator speeds
        ('Bus', 'v', [1, 2, 3]),      # voltages at 3 buses
    ],
    acts=[('SynGen', 'paux')],
    reward_fn=agc_reward,
    dt=0.1, tf=5.0,
)

n_gen = env_multi.ss.GENROU.n
print(f"Obs dimension: {env_multi.observation_space.shape[0]} "
      f"({n_gen} speeds + 3 voltages)")

Obs dimension: 8 (5 speeds + 3 voltages)

15.2.2. Action Specification#

The acts parameter is a list of (target, setpoint) tuples. The target can be either a group name or a model name, and the behavior differs accordingly:

Group target (e.g., 'SynGen'): Uses the group-level setpoint API (set_paux, set_pref, set_vref). This resolves the controller chain automatically, as described in Dynamic Control and Setpoint Changes. One action dimension is created per device in the group.
Model target (e.g., 'TGOV1'): Writes directly to the specified attribute on the model. One action dimension is created per device in the model.

The group-level approach is recommended for most applications because it does not require knowledge of which specific governor or exciter model is in use.

Tip

To bound the action space, pass action_low and action_high to the constructor. For RL algorithms that assume normalized actions, symmetric bounds such as [-0.1, 0.1] are typical.

15.2.3. Reward Function#

The reward function defines the optimization objective. It receives three arguments: the observation vector, the action vector, and the environment instance. It must return a scalar float.

The agc_reward function used above penalizes two quantities: the squared frequency deviation from nominal (1.0 pu) and the squared control effort (weighted by 0.01). This structure encourages the agent to restore frequency with minimal actuator usage.

Other reward designs are possible depending on the application:

# Voltage regulation: penalize voltage deviation from setpoint
def voltage_reward(obs, action, env):
    v_dev = obs - 1.0    # assuming obs contains bus voltages
    return -float(np.sum(v_dev ** 2))

# Frequency nadir: penalize the worst-case deviation
def nadir_reward(obs, action, env):
    return -float(np.max(np.abs(obs - 1.0)))

The environment instance is passed as the third argument, providing access to the underlying ANDES system via env.ss. This allows the reward function to query any system quantity, not only the observed variables.

15.3. Running an Episode#

The environment follows the standard Gymnasium API. An episode begins with reset(), which calls TDS.reinit() internally, and proceeds through a sequence of step() calls. Each step advances the simulation by dt seconds and returns the standard five-tuple (obs, reward, terminated, truncated, info).

obs, info = env.reset(seed=42)
print(f"Initial obs (omega): {obs}")
print(f"Initial time: {info['t']}")

Initial obs (omega): [1. 1. 1. 1. 1.]
Initial time: 0.0

# Run a complete episode with zero actions (no control)
obs, info = env.reset(seed=42)
total_reward = 0.0
obs_history = [obs.copy()]

while True:
    action = np.zeros(env.action_space.shape)
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    obs_history.append(obs.copy())
    if terminated or truncated:
        break

print(f"Episode finished: steps={info['step']}, t={info['t']:.1f}s")
print(f"Total reward: {total_reward:.4f}")
print(f"Final omega: {obs}")

Episode finished: steps=50, t=5.0s
Total reward: -0.0000
Final omega: [1.0000004  0.9999995  0.99999994 0.99999976 0.9999997 ]

The episode terminates in two ways:

truncated=True: the simulation reached tf (normal episode end).
terminated=True: the simulation failed to converge (the agent destabilized the system).

Since no disturbance was applied and no control actions were taken, the system remained at steady state throughout the episode. The reward is approximately zero because the frequency stayed at nominal.

15.4. Adding Disturbances#

A meaningful RL problem requires a disturbance for the agent to respond to. The disturbance_fn parameter accepts a callable that is invoked after each reset(). This function can run the simulation forward through a pre-defined event before handing control to the agent.

The IEEE 14-bus test case contains built-in Toggle events (a line trip at t=1.0s and reconnection at t=1.1s). The following disturbance function advances the simulation past these events, so the agent begins controlling at t=1.2s with the system in a post-disturbance transient state.

def fast_forward_past_event(env):
    """Advance past the line trip/reclose before agent takes control."""
    env.ss.TDS.config.tf = 1.2
    env.ss.TDS.run(no_summary=True)


env_dist = AndesEnv(
    case=CASE,
    obs=[('GENROU', 'omega')],
    acts=[('SynGen', 'paux')],
    reward_fn=agc_reward,
    dt=0.1,
    tf=5.0,
    disturbance_fn=fast_forward_past_event,
    action_low=-0.1,
    action_high=0.1,
)

# Run baseline episode (no control) to see the uncontrolled response
obs, info = env_dist.reset(seed=42)
baseline_obs = [obs.copy()]
baseline_reward = 0.0

while True:
    action = np.zeros(env_dist.action_space.shape)
    obs, reward, terminated, truncated, info = env_dist.step(action)
    baseline_obs.append(obs.copy())
    baseline_reward += reward
    if terminated or truncated:
        break

baseline_obs = np.array(baseline_obs)
print(f"Baseline total reward: {baseline_reward:.4f}")
print(f"Max frequency deviation: {np.max(np.abs(baseline_obs - 1.0)):.6f} pu")

Baseline total reward: -0.0000
Max frequency deviation: 0.000827 pu

The nonzero reward and frequency deviation confirm that the disturbance creates a meaningful control problem for the agent to solve.

15.5. Comparing Control Policies#

Before training an RL agent, it is instructive to compare the uncontrolled baseline against a simple hand-crafted policy. The following proportional controller applies a corrective action proportional to the mean frequency deviation, analogous to a basic droop response.

def proportional_policy(obs, gain=5.0):
    """Simple proportional controller: action = -gain * mean(omega - 1)."""
    mean_dev = np.mean(obs - 1.0)
    action = np.full(env_dist.action_space.shape, -gain * mean_dev)
    return np.clip(action, -0.1, 0.1)


obs, info = env_dist.reset(seed=42)
controlled_obs = [obs.copy()]
controlled_reward = 0.0

while True:
    action = proportional_policy(obs)
    obs, reward, terminated, truncated, info = env_dist.step(action)
    controlled_obs.append(obs.copy())
    controlled_reward += reward
    if terminated or truncated:
        break

controlled_obs = np.array(controlled_obs)
print(f"Controlled total reward: {controlled_reward:.4f}")
print(f"Max frequency deviation: {np.max(np.abs(controlled_obs - 1.0)):.6f} pu")
print(f"Improvement over baseline: {controlled_reward - baseline_reward:.4f}")

Controlled total reward: -0.0000
Max frequency deviation: 0.000855 pu
Improvement over baseline: -0.0000

The proportional policy should yield a higher (less negative) total reward than the uncontrolled baseline. An RL agent trained with a suitable algorithm can potentially discover a more effective policy by jointly optimizing the control actions across all generators and time steps.

15.6. Deterministic Reset#

The reset() method uses TDS.reinit() internally, which restores the system to its exact post-initialization state. This makes episodes fully deterministic: identical seeds and actions produce identical trajectories. This property is important for reproducible training and for debugging reward functions.

# Verify determinism: two identical episodes should produce the same trajectory
def run_episode(env, seed=42):
    obs_list = []
    obs, _ = env.reset(seed=seed)
    obs_list.append(obs.copy())
    for _ in range(5):
        action = np.zeros(env.action_space.shape)
        obs, _, term, trunc, _ = env.step(action)
        obs_list.append(obs.copy())
        if term or trunc:
            break
    return np.array(obs_list)

traj1 = run_episode(env_dist)
traj2 = run_episode(env_dist)
print(f"Trajectories identical: {np.allclose(traj1, traj2, atol=1e-10)}")

Trajectories identical: True

15.7. Training with Stable-Baselines3#

Because AndesEnv is Gymnasium-compatible, it can be used directly with standard RL libraries such as Stable-Baselines3. The following example trains a PPO agent for a small number of steps to demonstrate the integration. Production training typically requires 50,000 to 500,000 timesteps depending on the problem complexity.

Note

Stable-Baselines3 is not included in the andes[rl] extra. Install it separately with pip install stable-baselines3.

from stable_baselines3 import PPO

model = PPO(
    "MlpPolicy",
    env_dist,
    learning_rate=3e-4,
    n_steps=256,
    batch_size=64,
    n_epochs=10,
    verbose=1,
)

model.learn(total_timesteps=20_000)

# Evaluate the trained policy
obs, info = env_dist.reset(seed=99)
total_reward = 0.0
while True:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env_dist.step(action)
    total_reward += reward
    if terminated or truncated:
        break

print(f"Trained agent reward: {total_reward:.4f}")

A complete training script with logging callbacks is provided in examples/rl_agc_training.py.

15.8. Accessing the System Object#

The underlying ANDES System object is accessible via env.ss. This provides full access to all models, parameters, and DAE arrays, which is useful for constructing reward functions that depend on quantities beyond the observation vector.

ss = env.ss

print(f"System frequency: {ss.config.freq} Hz")
print(f"Number of buses: {ss.Bus.n}")
print(f"GENROU devices: {list(ss.GENROU.idx.v)}")
print(f"Current sim time: {float(ss.dae.t):.2f} s")

System frequency: 60 Hz
Number of buses: 14
GENROU devices: ['GENROU_1', 'GENROU_2', 'GENROU_3', 'GENROU_4', 'GENROU_5']
Current sim time: 5.00 s

15.9. TDS Configuration#

TDS solver options can be customized via the tds_config parameter at construction time. This is useful for selecting the integration method or adjusting convergence tolerances. Invalid keys raise a ValueError immediately.

env_custom = AndesEnv(
    case=CASE,
    obs=[('GENROU', 'omega')],
    acts=[('SynGen', 'paux')],
    reward_fn=agc_reward,
    dt=0.1, tf=5.0,
    tds_config={'tstep': 1/60},   # smaller integration step
)
print(f"Integration step: {env_custom.ss.TDS.config.tstep:.4f} s")

Integration step: 0.0167 s

15.10. Summary#

The following table summarizes the key AndesEnv parameters:

Parameter	Description
`case`	Path to case file (use `get_case()` for stock cases)
`obs`	List of `(model, var)` or `(model, var, idx_list)` tuples
`acts`	List of `(target, setpoint)` tuples (group or model target)
`reward_fn`	`reward_fn(obs, action, env) -> float`
`dt`	Simulation time per `step()` call [seconds]
`tf`	Episode end time [seconds]
`disturbance_fn`	`disturbance_fn(env) -> None`, called after each `reset()`
`action_low`, `action_high`	Bounds for the action space
`obs_low`, `obs_high`	Bounds for the observation space
`tds_config`	Dict of TDS config overrides

15.11. Next Steps#

Dynamic Control and Setpoint Changes for the setpoint API used by group-level actions
Frequency Response and Load Shedding for frequency response fundamentals
examples/rl_agc_training.py for a complete PPO training script with Stable-Baselines3