Physics is full of sequential decision problems where an agent must learn to act in a dynamic environment. How do you control the plasma in a fusion reactor in real time? How do you design a laser pulse that drives a quantum system to a target state? How do you optimise a particle accelerator while it’s running? Reinforcement learning has a proven answer. This cluster builds from the foundations of RL all the way to state-of-the-art applications in experimental physics — with full working code.
AI for Physics Students › Cluster 7 › Cluster 8: RL for Physics Control
Cluster 7 showed how generative models can discover physics — recovering laws from data and designing new materials. Cluster 8 asks a different question: once you know the physics, how do you control it? This is the domain of reinforcement learning — sequential decision-making under uncertainty — and it maps onto some of the hardest experimental challenges in modern physics.
- RL as a Physics Problem: MDP Framework
- Q-Learning and DQN from Scratch
- Policy Gradient Methods (REINFORCE)
- PPO: The Workhorse Algorithm
- SAC for Continuous Physics Control
- Plasma Fusion: DeepMind’s Tokamak
- Quantum Gate & Laser Pulse Optimisation
- Reward Shaping with Physical Constraints
Section 1 — Reinforcement Learning as a Physics Problem
Every physicist who has ever tried to tune a laser, stabilise a feedback loop, or optimise an experimental parameter while an experiment is running has solved a reinforcement learning problem by hand. RL formalises this intuition and scales it massively.
The Markov Decision Process (MDP) is the mathematical framework underlying RL, and it maps directly onto controlled physical systems. An agent (your controller) interacts with an environment (the physical system) at each time step: it observes a state s (sensor readings, system parameters), takes an action a (a control signal, a voltage, a magnetic field strength), receives a reward r (a scalar measuring how well the system is performing), and transitions to a new state s’. The goal is to learn a policy π(a|s) — a mapping from states to actions — that maximises the total discounted return.
| RL Concept | Physics Equivalent | Fusion Reactor Example |
|---|---|---|
| State s | System observables | Plasma temperature, density, shape measurements |
| Action a | Control signal | Magnetic coil currents (19 control variables) |
| Reward r | Performance metric | Plasma stability + confinement time − disruptions |
| Policy π | Controller | Neural network: plasma state → coil currents |
| Environment | Physical system + dynamics | Tokamak plasma (simulated via MHD equations) |
The key property that makes RL so powerful for physics: you don’t need to know the equations of motion of the system. You just need to be able to simulate it (or run it in the real world) and measure a reward. The RL agent learns the optimal control policy by trial and error — exactly how a skilled experimentalist learns to tune an apparatus, but at machine speed and scale.
Section 2 — Q-Learning and Deep Q-Networks: Value-Based RL
The foundational concept in RL is the Q-function (or action-value function): Q(s,a) is the expected total discounted return when you take action a in state s and then follow the optimal policy thereafter. If you know Q*, you can act optimally by always choosing argmax_a Q*(s,a).
This is the Bellman equation — a recursive definition of the optimal Q-function. It says: the value of taking action a in state s equals the immediate reward r plus the discounted value of the best action in the next state s’. This is essentially a self-consistency condition, and we can solve it iteratively. Deep Q-Networks (DQN) replace the tabular Q-function with a neural network, making Q-learning applicable to continuous, high-dimensional state spaces.
DQN for a Physical Control Task (Inverted Pendulum)
# pip install gymnasium torch import gymnasium as gym import torch, torch.nn as nn import numpy as np from collections import deque import random # ── Q-Network: state → Q(s, a) for all actions ───────────────── class QNetwork(nn.Module): def __init__(self, n_obs=4, n_actions=2, hidden=128): super().__init__() self.net = nn.Sequential( nn.Linear(n_obs, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, n_actions) ) def forward(self, x): return self.net(x) # ── Replay buffer: breaks temporal correlations ───────────────── class ReplayBuffer: def __init__(self, capacity=50000): self.buffer = deque(maxlen=capacity) def push(self, *transition): self.buffer.append(transition) def sample(self, batch_size): batch = random.sample(self.buffer, batch_size) return [torch.tensor(np.array(x), dtype=torch.float32) for x in zip(*batch)] def __len__(self): return len(self.buffer) # ── DQN training ──────────────────────────────────────────────── env = gym.make('CartPole-v1') q_net = QNetwork(n_obs=4, n_actions=2) target_net = QNetwork(n_obs=4, n_actions=2) target_net.load_state_dict(q_net.state_dict()) optim = torch.optim.Adam(q_net.parameters(), lr=1e-3) buffer = ReplayBuffer(50000) gamma, eps, eps_min, eps_decay = 0.99, 1.0, 0.01, 0.995 BATCH, TARGET_UPDATE = 64, 100 for episode in range(500): state, _ = env.reset() total_r = 0 while True: # Epsilon-greedy exploration if random.random() < eps: action = env.action_space.sample() # explore else: with torch.no_grad(): action = q_net(torch.tensor(state).unsqueeze(0)).argmax().item() next_state, reward, done, trunc, _ = env.step(action) buffer.push(state, action, reward, next_state, float(done)) state = next_state; total_r += reward if done or trunc: break if len(buffer) >= BATCH: s, a, r, s2, d = buffer.sample(BATCH) a = a.long().unsqueeze(1) # Current Q values q_vals = q_net(s).gather(1, a).squeeze(1) # Target Q values: r + gamma * max Q(s', .) [no grad] with torch.no_grad(): q_next = target_net(s2).max(1).values q_target= r + gamma * (1 - d) * q_next loss = nn.MSELoss()(q_vals, q_target) optim.zero_grad(); loss.backward(); optim.step() eps = max(eps_min, eps * eps_decay) # decay exploration if episode % 50 == 0: target_net.load_state_dict(q_net.state_dict()) # sync target print(f"Episode {episode}: return={total_r:.0f} eps={eps:.3f}")
Section 3 — Policy Gradient Methods: Directly Optimising the Policy
Q-learning is indirect: learn Q*, then derive the policy. Policy gradient methods optimise the policy πθ(a|s) directly by computing the gradient of the expected return. This is more natural for continuous action spaces — which is most of experimental physics.
The intuition is elegant: if an action led to high return G_t, increase the probability of that action in that state. If it led to low return, decrease it. The log πθ(a|s) term is the score function — it tells us how to adjust θ to change the action probabilities. This is called the REINFORCE algorithm, and while simple to derive, it has high variance. All modern algorithms (PPO, SAC, A3C) are built on this foundation with variance reduction techniques.
REINFORCE for a Physics Reward Function
# REINFORCE: simplest policy gradient — direct policy optimisation import torch, torch.nn as nn import torch.distributions as dist import gymnasium as gym import numpy as np # ── Stochastic policy network ─────────────────────────────────── # For discrete actions: outputs logits → categorical distribution # For continuous: output mean + log_std → Gaussian distribution class PolicyNetwork(nn.Module): def __init__(self, n_obs=3, n_actions=1, hidden=128, continuous=True): super().__init__() self.continuous = continuous self.net = nn.Sequential( nn.Linear(n_obs, hidden), nn.Tanh(), nn.Linear(hidden, hidden), nn.Tanh(), ) if continuous: self.mean_head = nn.Linear(hidden, n_actions) self.log_std_head = nn.Parameter(torch.zeros(n_actions)) else: self.logit_head = nn.Linear(hidden, n_actions) def get_dist(self, state): h = self.net(state) if self.continuous: mean = self.mean_head(h) std = self.log_std_head.exp().expand_as(mean) return dist.Normal(mean, std) else: return dist.Categorical(logits=self.logit_head(h)) # ── REINFORCE training loop ────────────────────────────────────── env = gym.make('Pendulum-v1') # continuous control physics task policy = PolicyNetwork(n_obs=3, n_actions=1, continuous=True) optim = torch.optim.Adam(policy.parameters(), lr=3e-4) gamma = 0.99 for episode in range(2000): state, _ = env.reset() log_probs, rewards = [], [] while True: state_t = torch.tensor(state, dtype=torch.float32) action_dist= policy.get_dist(state_t.unsqueeze(0)) action = action_dist.sample() # stochastic action log_prob = action_dist.log_prob(action).sum() next_state, reward, done, trunc, _ = env.step(action.numpy().flatten()) log_probs.append(log_prob) rewards.append(reward) state = next_state if done or trunc: break # Compute discounted returns G_t G, returns = 0.0, [] for r in reversed(rewards): G = r + gamma * G returns.insert(0, G) returns = torch.tensor(returns) # Normalise returns — critical variance reduction returns = (returns - returns.mean()) / (returns.std() + 1e-8) # Policy gradient loss: -E[log pi * G] loss = -torch.stack(log_probs).dot(returns) optim.zero_grad(); loss.backward(); optim.step() if (episode+1) % 100 == 0: print(f"Episode {episode+1}: return={sum(rewards):.1f}")
Section 4 — Proximal Policy Optimization (PPO): The Workhorse
If you need to pick one RL algorithm to learn for physics applications, it’s PPO. Developed at OpenAI (Schulman et al. 2017), PPO has become the dominant algorithm for continuous control because it is stable, sample-efficient, and parallelisable. DeepMind used it for plasma control in the tokamak. It is used for laser pulse optimisation, accelerator tuning, and quantum state preparation. Understanding PPO deeply means understanding modern RL.
The key innovation is the clipped surrogate objective. The ratio r_t(θ) = πθ(a|s) / πθ_old(a|s) measures how much the policy has changed. The clip operation prevents the policy from updating too aggressively in any single step — avoiding the catastrophic policy collapse that plagues naive policy gradient methods. The advantage estimate Â_t measures how much better action a was than average in state s.
Full PPO Implementation for Continuous Physics Control
# PPO implementation: Actor-Critic with GAE advantage estimation # Works for continuous action spaces — essential for physics control import torch, torch.nn as nn import torch.distributions as dist import numpy as np import gymnasium as gym # ── Actor-Critic network ──────────────────────────────────────── class ActorCritic(nn.Module): def __init__(self, n_obs, n_actions): super().__init__() # Shared feature extractor self.shared = nn.Sequential( nn.Linear(n_obs, 256), nn.Tanh(), nn.Linear(256, 256), nn.Tanh(), ) # Actor: outputs Gaussian policy mean and log_std self.actor_mean = nn.Linear(256, n_actions) self.actor_log_std = nn.Parameter(torch.zeros(n_actions)) # Critic: estimates state value V(s) self.critic = nn.Linear(256, 1) def forward(self, state): h = self.shared(state) mean = self.actor_mean(h) std = self.actor_log_std.exp().expand_as(mean) value = self.critic(h).squeeze(-1) return dist.Normal(mean, std), value # ── GAE advantage estimation ──────────────────────────────────── def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95): advantages, last_adv = [], 0.0 values_np = values.detach().numpy() for t in reversed(range(len(rewards))): next_val = values_np[t+1] if t+1 < len(values_np) else 0.0 delta = rewards[t] + gamma * next_val * (1-dones[t]) - values_np[t] last_adv = delta + gamma * lam * (1-dones[t]) * last_adv advantages.insert(0, last_adv) return torch.tensor(advantages, dtype=torch.float32) # ── PPO update ────────────────────────────────────────────────── def ppo_update(model, optim, states, actions, old_log_probs, advantages, returns, clip_eps=0.2, value_coef=0.5, entropy_coef=0.01, n_epochs=10): for _ in range(n_epochs): dist_new, values = model(states) log_probs_new = dist_new.log_prob(actions).sum(dim=-1) entropy = dist_new.entropy().sum(dim=-1).mean() # PPO clipped surrogate objective ratio = torch.exp(log_probs_new - old_log_probs) surr1 = ratio * advantages surr2 = torch.clamp(ratio, 1-clip_eps, 1+clip_eps) * advantages actor_loss = -torch.min(surr1, surr2).mean() # Value function loss critic_loss = nn.MSELoss()(values, returns) # Combined loss: actor + critic + entropy bonus (encourages exploration) loss = actor_loss + value_coef * critic_loss - entropy_coef * entropy optim.zero_grad(); loss.backward() nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5) # gradient clip optim.step() # ── Training: collect rollouts, compute GAE, PPO update ────────── env = gym.make('HalfCheetah-v4') # complex locomotion physics model = ActorCritic(n_obs=17, n_actions=6) optim = torch.optim.Adam(model.parameters(), lr=3e-4, eps=1e-5) ROLLOUT = 2048 # steps before each update # Training loop: collect ROLLOUT steps, then update for n_epochs # Typical convergence: 1-5M environment steps for locomotion tasks # For physics control tasks: much fewer steps (simpler reward landscapes)
Section 5 — Soft Actor-Critic (SAC): Maximum Entropy RL
For physics applications with stochastic dynamics — turbulent flows, noisy quantum systems, thermal fluctuations — Soft Actor-Critic (SAC) is often the best algorithm. SAC maximises a modified objective that includes an entropy bonus: it learns policies that are both high-reward and as stochastic as possible, preventing premature convergence to narrow local optima. This is called the maximum entropy RL framework.
SAC is off-policy (it learns from a replay buffer like DQN) and actor-critic (it learns both a policy and a value function), giving it excellent sample efficiency. For physics problems where simulator evaluations are expensive, this is crucial.
# SAC: off-policy, max-entropy, continuous action spaces # pip install stable-baselines3 (recommended for production physics RL) from stable_baselines3 import SAC from stable_baselines3.common.env_util import make_vec_env import gymnasium as gym import numpy as np # ── Example: Magnetic pendulum stabilisation ──────────────────── # A physics-inspired custom environment class MagneticPendulumEnv(gym.Env): """Inverted pendulum driven by a magnetic coil — physics-based reward.""" metadata = {'render_modes': []} def __init__(self): super().__init__() # State: [cos(theta), sin(theta), theta_dot, coil_current] self.observation_space = gym.spaces.Box(-np.inf, np.inf, (4,)) # Action: coil current change [-1, 1] in normalised units self.action_space = gym.spaces.Box(-1.0, 1.0, (1,)) self.dt = 0.02; self.g = 9.81; self.l = 1.0 def reset(self, seed=None): self.theta = np.random.uniform(-np.pi, np.pi) self.theta_dot = np.random.uniform(-1.0, 1.0) self.current = 0.0 return self._get_obs(), {} def _get_obs(self): return np.array([np.cos(self.theta), np.sin(self.theta), self.theta_dot, self.current], dtype=np.float32) def step(self, action): u = float(action[0]) * 5.0 # scale to Amps self.current += u * self.dt self.current = np.clip(self.current, -10, 10) # Physics: pendulum equation with magnetic torque torque = -self.g/self.l * np.sin(self.theta) + 0.5 * self.current self.theta_dot += torque * self.dt self.theta += self.theta_dot * self.dt # Reward: penalise angle from vertical + energy cost angle_cost = self.theta**2 + 0.1 * self.theta_dot**2 control_cost = 0.01 * u**2 reward = -angle_cost - control_cost done = abs(self.theta_dot) > 20.0 # instability check return self._get_obs(), reward, done, False, {} # ── Train SAC on the physics environment ──────────────────────── env = MagneticPendulumEnv() model = SAC( 'MlpPolicy', env, learning_rate = 3e-4, buffer_size = 100000, # replay buffer size learning_starts = 1000, # steps before first update batch_size = 256, tau = 0.005, # soft target update rate gamma = 0.99, train_freq = 1, ent_coef = 'auto', # automatic entropy tuning verbose = 1 ) model.learn(total_timesteps=500000) model.save('magnetic_pendulum_sac') # Evaluate: stable-baselines3 makes evaluation clean from stable_baselines3.common.evaluation import evaluate_policy mean_r, std_r = evaluate_policy(model, env, n_eval_episodes=20) print(f"Mean return: {mean_r:.2f} +/- {std_r:.2f}")
Section 6 — Real Application: Plasma Control in Fusion Reactors
In January 2022, DeepMind and the Swiss Plasma Center published a landmark paper in Nature: they had trained a reinforcement learning agent to control the plasma shape and position in a tokamak fusion reactor in real time. The agent simultaneously controlled 19 magnetic coils and maintained plasma configurations — including a novel “snowflake” shape — that had never been achieved before. This is arguably the most impactful single application of RL in experimental physics to date.
What makes plasma control so hard — and why RL works so well for it — is the combination of: (1) continuous high-dimensional action space (19 coil currents, each continuous), (2) fast dynamics (millisecond timescales, faster than human reaction), (3) complex constraints (plasma must not touch the vessel walls), and (4) incomplete observability (not all plasma quantities are directly measurable). RL handles all of these naturally.
Section 7 — Quantum Gate Optimization and Laser Pulse Shaping
Two more frontier applications where RL has delivered state-of-the-art results:
Quantum Gate Optimization with RL
Implementing a quantum gate on real hardware requires a precisely shaped pulse sequence applied to the qubits. Traditional pulse optimisation uses GRAPE (Gradient Ascent Pulse Engineering), which is efficient but gets stuck in local optima and struggles with hardware constraints. RL formulates gate optimisation as: state = current quantum state, action = pulse parameter update, reward = gate fidelity − energy penalty.
# Quantum gate optimisation as an RL problem # Environment: qubit driven by a control Hamiltonian # Goal: apply a target unitary (e.g. CNOT) using shaped pulses import numpy as np import torch # ── Simplified qubit environment ────────────────────────────── class QubitControlEnv: """Single qubit: state is on Bloch sphere, action is pulse amplitude.""" def __init__(self, target_state=None, T=20, dt=0.05): self.T = T; self.dt = dt self.target = target_state if target_state is not None else np.array([0, 1], dtype=complex) def reset(self): self.state = np.array([1, 0], dtype=complex) # start in |0> self.t = 0 return np.array([self.state.real, self.state.imag]).flatten() def step(self, action): omega_x, omega_y = action[0] * 5.0, action[1] * 5.0 # control fields # Schrödinger evolution: psi -> exp(-i H dt) psi H = 0.5 * (omega_x * np.array([[0,1],[1,0]]) + omega_y * np.array([[0,-1j],[1j,0]])) # Matrix exponential for exact evolution vals, vecs = np.linalg.eigh(H) U = vecs @ np.diag(np.exp(-1j * vals * self.dt)) @ vecs.conj().T self.state = U @ self.state self.t += 1 # Reward: fidelity with target state fidelity = abs(self.target.conj() @ self.state)**2 done = self.t >= self.T # Dense reward: fidelity at every step + bonus at end reward = fidelity * 0.01 + (fidelity * 10.0 if done else 0) obs = np.array([self.state.real, self.state.imag]).flatten() return obs, reward, done, False, {} # Train with PPO (stable-baselines3): # from stable_baselines3 import PPO # env = QubitControlEnv(target_state=np.array([0,1], dtype=complex)) # model = PPO('MlpPolicy', env, verbose=1) # model.learn(total_timesteps=500000) # After training: inspect model.predict(obs) to extract the optimal pulse shape # RL finds pulse sequences with fidelity >0.999 that: # - Respect hardware constraints (max amplitude, bandwidth) # - Are robust to calibration errors (with domain randomisation) # - Outperform GRAPE on constrained optimisation problems
Section 8 — Reward Shaping with Physical Constraints
The hardest part of applying RL to physics is not the algorithm — it’s the reward function. A poorly designed reward leads to reward hacking: the agent finds unexpected ways to maximise the reward metric that violate physical intuition or experimental constraints. This section covers the principled approaches physicists use to design reward functions that are both learnable and physically meaningful.
The Four Rules of Physics Reward Design
If your reward is only non-zero when the experiment succeeds (e.g., plasma confinement exceeds 1 second), the agent gets almost no learning signal. Always provide intermediate rewards proportional to progress toward the goal. In plasma control: reward fidelity at every timestep, not just at the end.
Physical systems have hard constraints — power limits, temperature ceilings, stability conditions. Add explicit penalty terms: reward = task_reward − λ × constraint_violation. The weight λ must be large enough that violating the constraint is never worth the reward benefit, but not so large that the agent learns to do nothing.
If your task reward is ±1 and your constraint penalty is ±1000, the agent will optimise exclusively for constraint satisfaction and ignore the task. Normalise all reward components to the same order of magnitude, or use careful weighting. This is the most common failure mode in physics RL.
Real physical systems have parameter uncertainty — you never know the exact mass, viscosity, or coupling constant. During training, randomly sample these parameters from a range. The policy learned under randomisation is robust to the real-world uncertainty, enabling direct sim-to-real transfer without fine-tuning.
# Physics-aware reward function with constraint penalties # Example: laser cavity optimisation import numpy as np class PhysicsReward: def __init__(self, target_power=100.0, max_temp=350.0, max_power_draw=500.0): self.target_power = target_power self.max_temp = max_temp self.max_power_draw = max_power_draw def compute(self, state, action): output_power, temperature, power_draw = state # ── Task reward: match target output power ────────────── power_error = (output_power - self.target_power)**2 / self.target_power**2 task_reward = np.exp(-5 * power_error) - 1 # range: [-1, 0], 0=perfect # ── Constraint penalties (physical safety limits) ─────── # Soft penalty: smoothly increases as constraint is approached temp_margin = self.max_temp - temperature temp_penalty = -np.exp(-temp_margin / 10.0) # severe if temp > max power_margin = self.max_power_draw - power_draw power_penalty = -np.exp(-power_margin / 50.0) # ── Control cost: penalise large, rapid actuator changes ─ control_cost = -0.001 * np.sum(action**2) # ── Combine: all terms are O(1) in magnitude ───────────── reward = task_reward + 0.3 * temp_penalty + 0.3 * power_penalty + control_cost return float(reward), { 'task': task_reward, 'temp_penalty': temp_penalty, 'power_penalty': power_penalty, 'control': control_cost } # ── Domain randomisation for sim-to-real transfer ───────────── def randomise_physics(env, rng=None): """Randomise physical parameters within realistic uncertainty bounds.""" rng = np.random.default_rng(rng) # Sample parameters from distributions around nominal values env.thermal_mass *= rng.uniform(0.85, 1.15) # ±15% uncertainty env.heat_transfer *= rng.uniform(0.80, 1.20) # ±20% uncertainty env.gain *= rng.uniform(0.90, 1.10) # ±10% uncertainty return env # At each episode reset: randomise_physics(env) # This ensures the learned policy generalises to real hardware # without needing fine-tuning on the physical system
External References & Further Reading
- Degrave et al. (2022) — Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602:414. doi.org/10.1038/s41586-021-04301-9 — The landmark plasma fusion RL paper.
- Mnih et al. (2015) — Human-level control through deep reinforcement learning (DQN). Nature 518:529. arXiv:1312.5602
- Schulman et al. (2017) — Proximal Policy Optimization Algorithms (PPO). arXiv:1707.06347
- Haarnoja et al. (2018) — Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning (SAC). ICML. arXiv:1801.01290
- Nguyen-Tang & Arenz (2023) — Quantum optimal control with reinforcement learning. arXiv:2306.01854
- Sutton & Barto — Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free at incompleteideas.net — The definitive RL textbook.
- Gymnasium documentation — gymnasium.farama.org — Standard RL environment library with physics simulators (MuJoCo, Brax).
- RL maps naturally onto physics control. Every physical experiment with sensors, actuators, and a performance objective is an MDP. The physicist’s intuition about control directly translates to reward function design.
- Choose the right algorithm for your problem. DQN for discrete actions. REINFORCE for simple continuous tasks. PPO for complex continuous control — it’s the default. SAC for sample efficiency with stochastic dynamics.
- Two stabilisation tricks are non-negotiable in DQN. Replay buffer (breaks temporal correlations) and target network (stabilises the learning target). Never skip these.
- PPO’s clipped objective is the key innovation. It prevents catastrophic policy updates. Always use gradient clipping (max_norm=0.5) alongside it. Monitor the PPO clip fraction during training — if it’s always 0 or always 1, your learning rate is wrong.
- DeepMind’s tokamak is the gold standard. Trained in simulation, deployed on real hardware, controlling 19 coils at millisecond timescales. The reward function took months of domain expert work to design correctly.
- Reward shaping is where physics knowledge matters most. Dense rewards, explicit constraint penalties, reward component normalisation, and domain randomisation. Get these right and almost any RL algorithm will work. Get them wrong and no algorithm will save you.
