Reinforcement Learning for Physics Control Problems

Physics is full of sequential decision problems where an agent must learn to act in a dynamic environment. How do you control the plasma in a fusion reactor in real time? How do you design a laser pulse that drives a quantum system to a target state? How do you optimise a particle accelerator while it’s running? Reinforcement learning has a proven answer. This cluster builds from the foundations of RL all the way to state-of-the-art applications in experimental physics — with full working code.

⚡ Plasma Fusion Control ⚛ Quantum Gate Design 🧪 Lab Optimisation 🐍 DQN, PPO & SAC

AI for Physics Students › Cluster 7 › Cluster 8: RL for Physics Control

Cluster 7 showed how generative models can discover physics — recovering laws from data and designing new materials. Cluster 8 asks a different question: once you know the physics, how do you control it? This is the domain of reinforcement learning — sequential decision-making under uncertainty — and it maps onto some of the hardest experimental challenges in modern physics.

📋 What You Will Learn

RL as a Physics Problem: MDP Framework
Q-Learning and DQN from Scratch
Policy Gradient Methods (REINFORCE)
PPO: The Workhorse Algorithm
SAC for Continuous Physics Control
Plasma Fusion: DeepMind’s Tokamak
Quantum Gate & Laser Pulse Optimisation
Reward Shaping with Physical Constraints

Section 1 — Reinforcement Learning as a Physics Problem

Every physicist who has ever tried to tune a laser, stabilise a feedback loop, or optimise an experimental parameter while an experiment is running has solved a reinforcement learning problem by hand. RL formalises this intuition and scales it massively.

The Markov Decision Process (MDP) is the mathematical framework underlying RL, and it maps directly onto controlled physical systems. An agent (your controller) interacts with an environment (the physical system) at each time step: it observes a state s (sensor readings, system parameters), takes an action a (a control signal, a voltage, a magnetic field strength), receives a reward r (a scalar measuring how well the system is performing), and transitions to a new state s’. The goal is to learn a policy π(a|s) — a mapping from states to actions — that maximises the total discounted return.

🧪 MDP in Physics: Mapping the Framework

RL Concept	Physics Equivalent	Fusion Reactor Example
State s	System observables	Plasma temperature, density, shape measurements
Action a	Control signal	Magnetic coil currents (19 control variables)
Reward r	Performance metric	Plasma stability + confinement time − disruptions
Policy π	Controller	Neural network: plasma state → coil currents
Environment	Physical system + dynamics	Tokamak plasma (simulated via MHD equations)

The key property that makes RL so powerful for physics: you don’t need to know the equations of motion of the system. You just need to be able to simulate it (or run it in the real world) and measure a reward. The RL agent learns the optimal control policy by trial and error — exactly how a skilled experimentalist learns to tune an apparatus, but at machine speed and scale.

Section 2 — Q-Learning and Deep Q-Networks: Value-Based RL

The foundational concept in RL is the Q-function (or action-value function): Q(s,a) is the expected total discounted return when you take action a in state s and then follow the optimal policy thereafter. If you know Q*, you can act optimally by always choosing argmax_a Q*(s,a).

Bellman optimality equation: Q*(s,a) = E[r + gamma max Q*(s',a') | s,a]

This is the Bellman equation — a recursive definition of the optimal Q-function. It says: the value of taking action a in state s equals the immediate reward r plus the discounted value of the best action in the next state s’. This is essentially a self-consistency condition, and we can solve it iteratively. Deep Q-Networks (DQN) replace the tabular Q-function with a neural network, making Q-learning applicable to continuous, high-dimensional state spaces.

DQN for a Physical Control Task (Inverted Pendulum)

Python — DQN from scratch: replay buffer, target network, epsilon-greedy, CartPole-v1

# pip install gymnasium torch
import gymnasium as gym
import torch, torch.nn as nn
import numpy as np
from collections import deque
import random

# ── Q-Network: state → Q(s, a) for all actions ─────────────────
class QNetwork(nn.Module):
    def __init__(self, n_obs=4, n_actions=2, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_obs, hidden),   nn.ReLU(),
            nn.Linear(hidden, hidden),   nn.ReLU(),
            nn.Linear(hidden, n_actions)
        )
    def forward(self, x): return self.net(x)

# ── Replay buffer: breaks temporal correlations ─────────────────
class ReplayBuffer:
    def __init__(self, capacity=50000):
        self.buffer = deque(maxlen=capacity)
    def push(self, *transition): self.buffer.append(transition)
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        return [torch.tensor(np.array(x), dtype=torch.float32) for x in zip(*batch)]
    def __len__(self): return len(self.buffer)

# ── DQN training ────────────────────────────────────────────────
env        = gym.make('CartPole-v1')
q_net      = QNetwork(n_obs=4, n_actions=2)
target_net = QNetwork(n_obs=4, n_actions=2)
target_net.load_state_dict(q_net.state_dict())
optim      = torch.optim.Adam(q_net.parameters(), lr=1e-3)
buffer     = ReplayBuffer(50000)

gamma, eps, eps_min, eps_decay = 0.99, 1.0, 0.01, 0.995
BATCH, TARGET_UPDATE = 64, 100

for episode in range(500):
    state, _  = env.reset()
    total_r   = 0
    while True:
        # Epsilon-greedy exploration
        if random.random() < eps:
            action = env.action_space.sample()   # explore
        else:
            with torch.no_grad():
                action = q_net(torch.tensor(state).unsqueeze(0)).argmax().item()

        next_state, reward, done, trunc, _ = env.step(action)
        buffer.push(state, action, reward, next_state, float(done))
        state = next_state; total_r += reward
        if done or trunc: break

        if len(buffer) >= BATCH:
            s, a, r, s2, d = buffer.sample(BATCH)
            a = a.long().unsqueeze(1)
            # Current Q values
            q_vals = q_net(s).gather(1, a).squeeze(1)
            # Target Q values: r + gamma * max Q(s', .)  [no grad]
            with torch.no_grad():
                q_next  = target_net(s2).max(1).values
                q_target= r + gamma * (1 - d) * q_next
            loss = nn.MSELoss()(q_vals, q_target)
            optim.zero_grad(); loss.backward(); optim.step()

    eps = max(eps_min, eps * eps_decay)   # decay exploration
    if episode % 50 == 0:
        target_net.load_state_dict(q_net.state_dict())   # sync target
        print(f"Episode {episode}: return={total_r:.0f}  eps={eps:.3f}")

🧠 Concept: Why the Target Network and Replay Buffer Matter

DQN has two stabilisation tricks that are essential to understand. The replay buffer stores past (s, a, r, s’) transitions and samples random minibatches for training — this breaks the temporal correlations that would otherwise cause divergence. The target network is a periodically-updated frozen copy of the Q-network used to compute Q-targets — without it, you’re chasing a moving target with the same network you’re updating, causing instability. Both tricks were introduced in the original DQN paper (Mnih et al. 2015) and remain standard practice.

Section 3 — Policy Gradient Methods: Directly Optimising the Policy

Q-learning is indirect: learn Q*, then derive the policy. Policy gradient methods optimise the policy πθ(a|s) directly by computing the gradient of the expected return. This is more natural for continuous action spaces — which is most of experimental physics.

Policy gradient theorem: gradient of J = E[grad log pi(a|s) * G_t]

The intuition is elegant: if an action led to high return G_t, increase the probability of that action in that state. If it led to low return, decrease it. The log πθ(a|s) term is the score function — it tells us how to adjust θ to change the action probabilities. This is called the REINFORCE algorithm, and while simple to derive, it has high variance. All modern algorithms (PPO, SAC, A3C) are built on this foundation with variance reduction techniques.

REINFORCE for a Physics Reward Function

Python — REINFORCE: stochastic policy, discounted returns, Pendulum continuous control

# REINFORCE: simplest policy gradient — direct policy optimisation
import torch, torch.nn as nn
import torch.distributions as dist
import gymnasium as gym
import numpy as np

# ── Stochastic policy network ───────────────────────────────────
# For discrete actions: outputs logits → categorical distribution
# For continuous: output mean + log_std → Gaussian distribution
class PolicyNetwork(nn.Module):
    def __init__(self, n_obs=3, n_actions=1, hidden=128, continuous=True):
        super().__init__()
        self.continuous = continuous
        self.net = nn.Sequential(
            nn.Linear(n_obs, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
        )
        if continuous:
            self.mean_head    = nn.Linear(hidden, n_actions)
            self.log_std_head = nn.Parameter(torch.zeros(n_actions))
        else:
            self.logit_head = nn.Linear(hidden, n_actions)

    def get_dist(self, state):
        h = self.net(state)
        if self.continuous:
            mean    = self.mean_head(h)
            std     = self.log_std_head.exp().expand_as(mean)
            return dist.Normal(mean, std)
        else:
            return dist.Categorical(logits=self.logit_head(h))

# ── REINFORCE training loop ──────────────────────────────────────
env    = gym.make('Pendulum-v1')         # continuous control physics task
policy = PolicyNetwork(n_obs=3, n_actions=1, continuous=True)
optim  = torch.optim.Adam(policy.parameters(), lr=3e-4)
gamma  = 0.99

for episode in range(2000):
    state, _     = env.reset()
    log_probs, rewards = [], []

    while True:
        state_t    = torch.tensor(state, dtype=torch.float32)
        action_dist= policy.get_dist(state_t.unsqueeze(0))
        action     = action_dist.sample()                # stochastic action
        log_prob   = action_dist.log_prob(action).sum()
        
        next_state, reward, done, trunc, _ = env.step(action.numpy().flatten())
        log_probs.append(log_prob)
        rewards.append(reward)
        state = next_state
        if done or trunc: break

    # Compute discounted returns G_t
    G, returns = 0.0, []
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    # Normalise returns — critical variance reduction
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)

    # Policy gradient loss: -E[log pi * G]
    loss = -torch.stack(log_probs).dot(returns)
    optim.zero_grad(); loss.backward(); optim.step()

    if (episode+1) % 100 == 0:
        print(f"Episode {episode+1}: return={sum(rewards):.1f}")

Section 4 — Proximal Policy Optimization (PPO): The Workhorse

If you need to pick one RL algorithm to learn for physics applications, it’s PPO. Developed at OpenAI (Schulman et al. 2017), PPO has become the dominant algorithm for continuous control because it is stable, sample-efficient, and parallelisable. DeepMind used it for plasma control in the tokamak. It is used for laser pulse optimisation, accelerator tuning, and quantum state preparation. Understanding PPO deeply means understanding modern RL.

PPO clipped objective: E[min(r_t * A_hat_t, clip(r_t, 1-eps, 1+eps) * A_hat_t)]

The key innovation is the clipped surrogate objective. The ratio r_t(θ) = πθ(a|s) / πθ_old(a|s) measures how much the policy has changed. The clip operation prevents the policy from updating too aggressively in any single step — avoiding the catastrophic policy collapse that plagues naive policy gradient methods. The advantage estimate Â_t measures how much better action a was than average in state s.

$GAE advantage: A_hat_t = sum (gamma lambda)^k delta_{t+k}, delta_t = r_t + gamma V(s') - V(s)$

Full PPO Implementation for Continuous Physics Control

Python — Full PPO: Actor-Critic, GAE advantage, clipped surrogate, gradient clipping

# PPO implementation: Actor-Critic with GAE advantage estimation
# Works for continuous action spaces — essential for physics control
import torch, torch.nn as nn
import torch.distributions as dist
import numpy as np
import gymnasium as gym

# ── Actor-Critic network ────────────────────────────────────────
class ActorCritic(nn.Module):
    def __init__(self, n_obs, n_actions):
        super().__init__()
        # Shared feature extractor
        self.shared = nn.Sequential(
            nn.Linear(n_obs, 256), nn.Tanh(),
            nn.Linear(256, 256),   nn.Tanh(),
        )
        # Actor: outputs Gaussian policy mean and log_std
        self.actor_mean    = nn.Linear(256, n_actions)
        self.actor_log_std = nn.Parameter(torch.zeros(n_actions))
        # Critic: estimates state value V(s)
        self.critic = nn.Linear(256, 1)

    def forward(self, state):
        h     = self.shared(state)
        mean  = self.actor_mean(h)
        std   = self.actor_log_std.exp().expand_as(mean)
        value = self.critic(h).squeeze(-1)
        return dist.Normal(mean, std), value

# ── GAE advantage estimation ────────────────────────────────────
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    advantages, last_adv = [], 0.0
    values_np = values.detach().numpy()
    for t in reversed(range(len(rewards))):
        next_val   = values_np[t+1] if t+1 < len(values_np) else 0.0
        delta      = rewards[t] + gamma * next_val * (1-dones[t]) - values_np[t]
        last_adv   = delta + gamma * lam * (1-dones[t]) * last_adv
        advantages.insert(0, last_adv)
    return torch.tensor(advantages, dtype=torch.float32)

# ── PPO update ──────────────────────────────────────────────────
def ppo_update(model, optim, states, actions, old_log_probs, advantages, returns,
              clip_eps=0.2, value_coef=0.5, entropy_coef=0.01, n_epochs=10):
    for _ in range(n_epochs):
        dist_new, values = model(states)
        log_probs_new    = dist_new.log_prob(actions).sum(dim=-1)
        entropy          = dist_new.entropy().sum(dim=-1).mean()
        # PPO clipped surrogate objective
        ratio    = torch.exp(log_probs_new - old_log_probs)
        surr1    = ratio * advantages
        surr2    = torch.clamp(ratio, 1-clip_eps, 1+clip_eps) * advantages
        actor_loss  = -torch.min(surr1, surr2).mean()
        # Value function loss
        critic_loss = nn.MSELoss()(values, returns)
        # Combined loss: actor + critic + entropy bonus (encourages exploration)
        loss = actor_loss + value_coef * critic_loss - entropy_coef * entropy
        optim.zero_grad(); loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)   # gradient clip
        optim.step()

# ── Training: collect rollouts, compute GAE, PPO update ──────────
env   = gym.make('HalfCheetah-v4')      # complex locomotion physics
model = ActorCritic(n_obs=17, n_actions=6)
optim = torch.optim.Adam(model.parameters(), lr=3e-4, eps=1e-5)
ROLLOUT = 2048                              # steps before each update

# Training loop: collect ROLLOUT steps, then update for n_epochs
# Typical convergence: 1-5M environment steps for locomotion tasks
# For physics control tasks: much fewer steps (simpler reward landscapes)

Section 5 — Soft Actor-Critic (SAC): Maximum Entropy RL

For physics applications with stochastic dynamics — turbulent flows, noisy quantum systems, thermal fluctuations — Soft Actor-Critic (SAC) is often the best algorithm. SAC maximises a modified objective that includes an entropy bonus: it learns policies that are both high-reward and as stochastic as possible, preventing premature convergence to narrow local optima. This is called the maximum entropy RL framework.

SAC is off-policy (it learns from a replay buffer like DQN) and actor-critic (it learns both a policy and a value function), giving it excellent sample efficiency. For physics problems where simulator evaluations are expensive, this is crucial.

Python — SAC on a custom physics environment: magnetic pendulum, stable-baselines3

# SAC: off-policy, max-entropy, continuous action spaces
# pip install stable-baselines3  (recommended for production physics RL)
from stable_baselines3 import SAC
from stable_baselines3.common.env_util import make_vec_env
import gymnasium as gym
import numpy as np

# ── Example: Magnetic pendulum stabilisation ────────────────────
# A physics-inspired custom environment
class MagneticPendulumEnv(gym.Env):
    """Inverted pendulum driven by a magnetic coil — physics-based reward."""
    metadata = {'render_modes': []}

    def __init__(self):
        super().__init__()
        # State: [cos(theta), sin(theta), theta_dot, coil_current]
        self.observation_space = gym.spaces.Box(-np.inf, np.inf, (4,))
        # Action: coil current change [-1, 1] in normalised units
        self.action_space      = gym.spaces.Box(-1.0, 1.0, (1,))
        self.dt = 0.02; self.g = 9.81; self.l = 1.0

    def reset(self, seed=None):
        self.theta     = np.random.uniform(-np.pi, np.pi)
        self.theta_dot = np.random.uniform(-1.0, 1.0)
        self.current   = 0.0
        return self._get_obs(), {}

    def _get_obs(self):
        return np.array([np.cos(self.theta), np.sin(self.theta),
                             self.theta_dot, self.current], dtype=np.float32)

    def step(self, action):
        u              = float(action[0]) * 5.0          # scale to Amps
        self.current  += u * self.dt
        self.current   = np.clip(self.current, -10, 10)
        # Physics: pendulum equation with magnetic torque
        torque         = -self.g/self.l * np.sin(self.theta) + 0.5 * self.current
        self.theta_dot += torque * self.dt
        self.theta    += self.theta_dot * self.dt
        # Reward: penalise angle from vertical + energy cost
        angle_cost     = self.theta**2 + 0.1 * self.theta_dot**2
        control_cost   = 0.01 * u**2
        reward         = -angle_cost - control_cost
        done           = abs(self.theta_dot) > 20.0     # instability check
        return self._get_obs(), reward, done, False, {}

# ── Train SAC on the physics environment ────────────────────────
env   = MagneticPendulumEnv()
model = SAC(
    'MlpPolicy', env,
    learning_rate   = 3e-4,
    buffer_size     = 100000,     # replay buffer size
    learning_starts = 1000,       # steps before first update
    batch_size      = 256,
    tau             = 0.005,      # soft target update rate
    gamma           = 0.99,
    train_freq      = 1,
    ent_coef        = 'auto',     # automatic entropy tuning
    verbose         = 1
)
model.learn(total_timesteps=500000)
model.save('magnetic_pendulum_sac')

# Evaluate: stable-baselines3 makes evaluation clean
from stable_baselines3.common.evaluation import evaluate_policy
mean_r, std_r = evaluate_policy(model, env, n_eval_episodes=20)
print(f"Mean return: {mean_r:.2f} +/- {std_r:.2f}")

Section 6 — Real Application: Plasma Control in Fusion Reactors

In January 2022, DeepMind and the Swiss Plasma Center published a landmark paper in Nature: they had trained a reinforcement learning agent to control the plasma shape and position in a tokamak fusion reactor in real time. The agent simultaneously controlled 19 magnetic coils and maintained plasma configurations — including a novel “snowflake” shape — that had never been achieved before. This is arguably the most impactful single application of RL in experimental physics to date.

What makes plasma control so hard — and why RL works so well for it — is the combination of: (1) continuous high-dimensional action space (19 coil currents, each continuous), (2) fast dynamics (millisecond timescales, faster than human reaction), (3) complex constraints (plasma must not touch the vessel walls), and (4) incomplete observability (not all plasma quantities are directly measurable). RL handles all of these naturally.

🧪 The DeepMind Tokamak Pipeline

Simulator training — The RL agent was first trained entirely in a high-fidelity MHD (magnetohydrodynamics) plasma simulator, not the real tokamak. This allowed safe exploration of control strategies that might cause plasma disruptions on the real machine.

Reward shaping — The reward function combined: target plasma shape error, plasma current targets, distance from vessel walls, and disruption penalties. Crucially, the reward was designed by plasma physicists with deep domain knowledge — not a generic performance metric.

Architecture — The policy was a two-layer MLP with 256 neurons per layer. Input: 92 state variables (coil currents, flux measurements, plasma estimates). Output: 19 coil current targets. The policy runs at 10 kHz on standard hardware.

Real-world transfer — The simulator-trained policy was deployed directly on the TCV tokamak with minimal fine-tuning. It successfully maintained plasma shapes and demonstrated configurations — including the first simultaneous control of two separate plasma columns — beyond what previous controllers could achieve.

✅ Read the PaperDegrave et al. (2022) — Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602:414. doi.org/10.1038/s41586-021-04301-9. The full paper includes the reward function design, simulator architecture, and comprehensive ablation studies. It is an outstanding example of how physics knowledge must be embedded into the RL problem formulation.

Section 7 — Quantum Gate Optimization and Laser Pulse Shaping

Two more frontier applications where RL has delivered state-of-the-art results:

Quantum Gate Optimization with RL

Implementing a quantum gate on real hardware requires a precisely shaped pulse sequence applied to the qubits. Traditional pulse optimisation uses GRAPE (Gradient Ascent Pulse Engineering), which is efficient but gets stuck in local optima and struggles with hardware constraints. RL formulates gate optimisation as: state = current quantum state, action = pulse parameter update, reward = gate fidelity − energy penalty.

Python — Qubit gate optimisation environment: shaped pulse RL with quantum dynamics

# Quantum gate optimisation as an RL problem
# Environment: qubit driven by a control Hamiltonian
# Goal: apply a target unitary (e.g. CNOT) using shaped pulses
import numpy as np
import torch

# ── Simplified qubit environment ──────────────────────────────
class QubitControlEnv:
    """Single qubit: state is on Bloch sphere, action is pulse amplitude."""
    def __init__(self, target_state=None, T=20, dt=0.05):
        self.T = T; self.dt = dt
        self.target = target_state if target_state is not None else np.array([0, 1], dtype=complex)

    def reset(self):
        self.state = np.array([1, 0], dtype=complex)   # start in |0>
        self.t     = 0
        return np.array([self.state.real, self.state.imag]).flatten()

    def step(self, action):
        omega_x, omega_y = action[0] * 5.0, action[1] * 5.0  # control fields
        # Schrödinger evolution: psi -> exp(-i H dt) psi
        H = 0.5 * (omega_x * np.array([[0,1],[1,0]]) +
                   omega_y * np.array([[0,-1j],[1j,0]]))
        # Matrix exponential for exact evolution
        vals, vecs = np.linalg.eigh(H)
        U = vecs @ np.diag(np.exp(-1j * vals * self.dt)) @ vecs.conj().T
        self.state = U @ self.state
        self.t    += 1
        # Reward: fidelity with target state
        fidelity   = abs(self.target.conj() @ self.state)**2
        done       = self.t >= self.T
        # Dense reward: fidelity at every step + bonus at end
        reward     = fidelity * 0.01 + (fidelity * 10.0 if done else 0)
        obs        = np.array([self.state.real, self.state.imag]).flatten()
        return obs, reward, done, False, {}

# Train with PPO (stable-baselines3):
# from stable_baselines3 import PPO
# env   = QubitControlEnv(target_state=np.array([0,1], dtype=complex))
# model = PPO('MlpPolicy', env, verbose=1)
# model.learn(total_timesteps=500000)
# After training: inspect model.predict(obs) to extract the optimal pulse shape

# RL finds pulse sequences with fidelity >0.999 that:
# - Respect hardware constraints (max amplitude, bandwidth)
# - Are robust to calibration errors (with domain randomisation)
# - Outperform GRAPE on constrained optimisation problems

Section 8 — Reward Shaping with Physical Constraints

The hardest part of applying RL to physics is not the algorithm — it’s the reward function. A poorly designed reward leads to reward hacking: the agent finds unexpected ways to maximise the reward metric that violate physical intuition or experimental constraints. This section covers the principled approaches physicists use to design reward functions that are both learnable and physically meaningful.

The Four Rules of Physics Reward Design

Dense rewards beat sparse rewards
If your reward is only non-zero when the experiment succeeds (e.g., plasma confinement exceeds 1 second), the agent gets almost no learning signal. Always provide intermediate rewards proportional to progress toward the goal. In plasma control: reward fidelity at every timestep, not just at the end.

Penalise constraint violations explicitly
Physical systems have hard constraints — power limits, temperature ceilings, stability conditions. Add explicit penalty terms: reward = task_reward − λ × constraint_violation. The weight λ must be large enough that violating the constraint is never worth the reward benefit, but not so large that the agent learns to do nothing.

Reward components must be on the same scale
If your task reward is ±1 and your constraint penalty is ±1000, the agent will optimise exclusively for constraint satisfaction and ignore the task. Normalise all reward components to the same order of magnitude, or use careful weighting. This is the most common failure mode in physics RL.

Use domain randomisation for robustness
Real physical systems have parameter uncertainty — you never know the exact mass, viscosity, or coupling constant. During training, randomly sample these parameters from a range. The policy learned under randomisation is robust to the real-world uncertainty, enabling direct sim-to-real transfer without fine-tuning.

Python — Physics-aware reward function with constraint penalties and domain randomisation

# Physics-aware reward function with constraint penalties
# Example: laser cavity optimisation
import numpy as np

class PhysicsReward:
    def __init__(self, target_power=100.0, max_temp=350.0, max_power_draw=500.0):
        self.target_power   = target_power
        self.max_temp       = max_temp
        self.max_power_draw = max_power_draw

    def compute(self, state, action):
        output_power, temperature, power_draw = state

        # ── Task reward: match target output power ──────────────
        power_error  = (output_power - self.target_power)**2 / self.target_power**2
        task_reward  = np.exp(-5 * power_error) - 1     # range: [-1, 0], 0=perfect

        # ── Constraint penalties (physical safety limits) ───────
        # Soft penalty: smoothly increases as constraint is approached
        temp_margin    = self.max_temp - temperature
        temp_penalty   = -np.exp(-temp_margin / 10.0)   # severe if temp > max

        power_margin   = self.max_power_draw - power_draw
        power_penalty  = -np.exp(-power_margin / 50.0)

        # ── Control cost: penalise large, rapid actuator changes ─
        control_cost   = -0.001 * np.sum(action**2)

        # ── Combine: all terms are O(1) in magnitude ─────────────
        reward = task_reward + 0.3 * temp_penalty + 0.3 * power_penalty + control_cost
        return float(reward), {
            'task': task_reward, 'temp_penalty': temp_penalty,
            'power_penalty': power_penalty, 'control': control_cost
        }

# ── Domain randomisation for sim-to-real transfer ─────────────
def randomise_physics(env, rng=None):
    """Randomise physical parameters within realistic uncertainty bounds."""
    rng = np.random.default_rng(rng)
    # Sample parameters from distributions around nominal values
    env.thermal_mass  *= rng.uniform(0.85, 1.15)   # ±15% uncertainty
    env.heat_transfer *= rng.uniform(0.80, 1.20)   # ±20% uncertainty
    env.gain          *= rng.uniform(0.90, 1.10)   # ±10% uncertainty
    return env

# At each episode reset: randomise_physics(env)
# This ensures the learned policy generalises to real hardware
# without needing fine-tuning on the physical system

⚠ Common MistakeThe single most common RL failure in physics: the agent finds a reward-hacking solution that maximises the metric but violates the physical intent. A plasma controller might learn to immediately terminate episodes (scoring zero penalty) rather than attempt to control the plasma. Always monitor individual reward components during training, not just the total reward — and validate the learned policy qualitatively, not just quantitatively.

External References & Further Reading

Degrave et al. (2022) — Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602:414. doi.org/10.1038/s41586-021-04301-9 — The landmark plasma fusion RL paper.
Mnih et al. (2015) — Human-level control through deep reinforcement learning (DQN). Nature 518:529. arXiv:1312.5602
Schulman et al. (2017) — Proximal Policy Optimization Algorithms (PPO). arXiv:1707.06347
Haarnoja et al. (2018) — Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning (SAC). ICML. arXiv:1801.01290
Nguyen-Tang & Arenz (2023) — Quantum optimal control with reinforcement learning. arXiv:2306.01854
Sutton & Barto — Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free at incompleteideas.net — The definitive RL textbook.
Gymnasium documentation — gymnasium.farama.org — Standard RL environment library with physics simulators (MuJoCo, Brax).

📋 Key Takeaways — Cluster 8

RL maps naturally onto physics control. Every physical experiment with sensors, actuators, and a performance objective is an MDP. The physicist’s intuition about control directly translates to reward function design.
Choose the right algorithm for your problem. DQN for discrete actions. REINFORCE for simple continuous tasks. PPO for complex continuous control — it’s the default. SAC for sample efficiency with stochastic dynamics.
Two stabilisation tricks are non-negotiable in DQN. Replay buffer (breaks temporal correlations) and target network (stabilises the learning target). Never skip these.
PPO’s clipped objective is the key innovation. It prevents catastrophic policy updates. Always use gradient clipping (max_norm=0.5) alongside it. Monitor the PPO clip fraction during training — if it’s always 0 or always 1, your learning rate is wrong.
DeepMind’s tokamak is the gold standard. Trained in simulation, deployed on real hardware, controlling 19 coils at millisecond timescales. The reward function took months of domain expert work to design correctly.
Reward shaping is where physics knowledge matters most. Dense rewards, explicit constraint penalties, reward component normalisation, and domain randomisation. Get these right and almost any RL algorithm will work. Get them wrong and no algorithm will save you.

← Previous

Cluster 7: Generative Models & Discovery

↑ Pillar Page

Cluster 9: NLP & LLMs for Physics Research

Reinforcement Learning for Physics Control Problems: Complete Guide