Cluster 3 · AI for Physics Students
AI in Particle Physics & High-Energy Experiments
Particle physics was one of the first sciences to adopt deep learning at massive scale — and for good reason. The LHC produces 40 million collisions per second. Only machine learning can decide, in real time, which ones to keep. This guide covers the complete ML pipeline: from opening a ROOT file with Python to training Graph Neural Networks that outperform human-designed algorithms.
AI for Physics Students › Cluster 3: AI in Particle Physics
- The HEP Data Challenge
- Reading Physics Data with uproot
- Signal vs Background Classification
- Jet Tagging with Deep Learning
- Graph Neural Networks for Particle Tracks
- Anomaly Detection: Finding New Physics
- Real-Time Triggers & Fast Inference
- Complete Pipeline Walkthrough
Section 1 — The HEP Data Challenge: Why Particle Physics Needs ML
Every second, the Large Hadron Collider produces 40 million proton-proton collisions. That's 40,000,000 events per second, each generating roughly 1 MB of raw detector data — a raw data rate of 40 terabytes per second. The entire storage capacity of the internet couldn't absorb this. Something has to make decisions, in real time, about which collisions are interesting enough to save.
That something is machine learning. The LHC's trigger system uses a hierarchy of algorithms — hardware-level FPGAs, then fast software filters, then full offline analysis — to reject 99.9975% of events, keeping only about 1,000 per second for permanent storage. Every step in this pipeline now relies on trained models.
But the challenge doesn't stop at data reduction. Even the events that survive the trigger contain hundreds of overlapping proton interactions (called pileup), complex detector responses, backgrounds from known processes that look similar to new physics, and systematic uncertainties in every measurement. ML has become essential for all of it.
The Standard Model as a Classification Problem
At its heart, most particle physics analysis is a classification problem: you have detector signals from particle collisions, and you want to separate signal (e.g. a Higgs boson decaying to two photons) from background (everything that looks similar but isn't). Before deep learning, physicists built these classifiers by hand — using their physical intuition to select a handful of discriminating variables and combine them with a Fisher discriminant or boosted decision tree.
Deep learning changed the game by learning discriminating features directly from raw or lightly processed detector data — capturing correlations between hundreds of variables that no human would think to combine manually. In benchmark studies, deep neural networks consistently outperform hand-crafted observables by statistically significant margins.
Section 2 — Reading Particle Physics Data with uproot
The standard data format in particle physics is ROOT, developed at CERN. Historically this meant using the C++ ROOT framework — not particularly Python-friendly. The uproot library changed everything: it reads ROOT files natively in Python, outputting NumPy arrays and Pandas DataFrames. No C++ installation required.
# pip install uproot awkward-array vector import uproot import awkward as ak import numpy as np import pandas as pd # Open a ROOT file (can be local or remote HTTP) file = uproot.open(https://opendata.cern.ch/record/12102/files/SMHiggsToZZTo4L.root) # List available trees print(file.keys()) # Load a TTree (like a table of events) tree = file["events"] print(f"Branches: {tree.keys()}") print(f"Events: {tree.num_entries:,}") # Read specific branches as NumPy arrays pt = tree["lep_pt"].array(library="np") # lepton transverse momenta eta = tree["lep_eta"].array(library="np") phi = tree["lep_phi"].array(library="np") # awkward arrays handle variable-length lists (different n leptons per event) jets = tree["jet_pt"].array() # awkward.Array: [[pt1, pt2, ...], ...] print(f"Mean lepton pT: {ak.mean(ak.flatten(jets)):.1f} GeV") # Convert to flat DataFrame for ML df = tree.arrays(["lep_pt" , "lep_eta", "lep_phi", "nJets"], library="pd") print(df.head())
Section 3 — Signal vs Background Classification
The most common ML task in HEP is binary classification: is this event a signal process (rare, interesting) or a background process (common, boring)? The performance metric physicists care most about is not accuracy — it's significance:
where s is the expected number of signal events and b is background, after applying a selection. A classifier that selects a purer signal region (high s/(s+b)) dramatically improves discovery potential. The ROC curve and AUC are also widely used:
A Complete Signal/Background Classifier
import torch, torch.nn as nn import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import roc_auc_score # ── Simulate HEP-like features ─────────────────────────────── # In real life, load these from your ROOT file via uproot np.random.seed(42) N_sig, N_bkg = 50000, 200000 # Signal: Higgs-like — 4 high-pT leptons, low missing ET sig_feats = np.column_stack([ np.random.normal(90, 15, N_sig), # lep1_pt [GeV] np.random.normal(60, 15, N_sig), # lep2_pt np.random.normal(125, 8, N_sig), # m_4l [GeV] — Higgs mass peak np.random.normal(0.5, 0.3, N_sig), # MET [GeV] ]) # Background: ZZ continuum — similar but broader m_4l bkg_feats = np.column_stack([ np.random.exponential(40, N_bkg), np.random.exponential(30, N_bkg), np.random.uniform(80, 180, N_bkg), np.random.exponential(20, N_bkg), ]) X = np.vstack([sig_feats, bkg_feats]).astype(np.float32) y = np.hstack([np.ones(N_sig), np.zeros(N_bkg)]).astype(np.float32) # ── Preprocessing ──────────────────────────────────────────── X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) scaler = StandardScaler().fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # ── Model: deep classifier ─────────────────────────────────── class HEPClassifier(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(4, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3), nn.Linear(256, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3), nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Linear(128, 1), nn.Sigmoid() ) def forward(self, x): return self.net(x) # ── Training ───────────────────────────────────────────────── model = HEPClassifier() optim = torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-5) criterion = nn.BCELoss() # Class-weighted loss — compensate for signal/background imbalance w = torch.tensor([N_bkg / N_sig]).float() # upweight rare signal class criterion = nn.BCEWithLogitsLoss(pos_weight=w) X_tr_t = torch.tensor(X_train) y_tr_t = torch.tensor(y_train).unsqueeze(1) for epoch in range(30): model.train() pred = model(X_tr_t) loss = criterion(pred, y_tr_t) optim.zero_grad(); loss.backward(); optim.step() # ── Evaluation ─────────────────────────────────────────────── model.eval() with torch.no_grad(): scores = model(torch.tensor(X_test)).numpy().flatten() auc = roc_auc_score(y_test, scores) print(f"AUC = {auc:.4f} (random=0.5, perfect=1.0)")
Section 4 — Jet Tagging: Identifying Particles from Calorimeter Showers
When quarks or gluons are produced in a collision, they immediately hadronise — fragmenting into a spray of hundreds of particles called a jet. Identifying which type of particle initiated the jet (bottom quark, top quark, W boson, gluon, etc.) is one of the most important and challenging classification problems in HEP. This is called jet tagging.
Traditional taggers used a handful of hand-crafted variables (B-hadron displaced vertices, secondary vertex mass, track impact parameters). Deep learning approaches treat the jet as a point cloud or an image — either processing the constituent particles directly or projecting them onto a 2D (η, φ) calorimeter map.
Jet as a Sequence: ParticleNet / DeepJet Approach
# Jet tagging with a permutation-invariant deep network # Input: constituent particles [pt, eta, phi, charge, ...] per jet class JetTagger(nn.Module): def __init__(self, n_constituents=30, n_features=9): super().__init__() # Process each constituent independently (shared weights) self.constituent_net = nn.Sequential( nn.Conv1d(n_features, 64, 1), nn.BatchNorm1d(64), nn.ReLU(), nn.Conv1d(64, 128, 1), nn.BatchNorm1d(128), nn.ReLU(), nn.Conv1d(128, 256, 1), nn.BatchNorm1d(256), nn.ReLU() ) # Global max-pool: permutation invariant aggregation self.classifier = nn.Sequential( nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.3), nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, 5) # 5 classes: b, c, uds, g, top ) def forward(self, x): # x: [batch, n_features, n_constituents] x = self.constituent_net(x) # [batch, 256, n_constituents] x, _ = torch.max(x, dim=2) # global max pool → [batch, 256] return self.classifier(x) # [batch, 5] class scores tagger = JetTagger(n_constituents=30, n_features=9) print(f"Parameters: {sum(p.numel() for p in tagger.parameters()):,}") # Training uses categorical cross-entropy criterion = nn.CrossEntropyLoss() # Dataset: jets with labels [b=0, c=1, uds=2, g=3, top=4] # Public benchmark: Top Quark Tagging Reference Dataset (Kasieczka et al. 2019)
Section 5 — Graph Neural Networks for Particle Tracking
Particle tracking — reconstructing the trajectories of particles through a detector from raw hits — is one of the most computationally expensive steps in HEP reconstruction. Traditionally done with Kalman filters, it scales poorly with pileup. Graph Neural Networks (GNNs) offer a natural solution: represent detector hits as graph nodes, potential track segments as edges, and use message-passing to classify which connections are real tracks.
The TrackML challenge (2018) demonstrated that GNN-based trackers could match classical reconstruction quality at a fraction of the computational cost. The ExaTrkX collaboration has since built GNN trackers that run efficiently on GPU hardware, reducing reconstruction time by orders of magnitude.
# pip install torch-geometric from torch_geometric.nn import GCNConv, global_mean_pool from torch_geometric.data import Data, DataLoader # ── Build a graph from detector hits ───────────────────────── # Each hit: node features = [r, phi, z] in cylindrical coords # Edges: connect hits within dr < threshold (potential track segment) def build_graph(hits_r, hits_phi, hits_z, dr_max=50.0): # Node features x = torch.stack([hits_r, hits_phi, hits_z], dim=1).float() # Build edges: connect pairs within spatial distance dr_max diffs = x.unsqueeze(1) - x.unsqueeze(0) # [N, N, 3] dists = (diffs**2).sum(-1).sqrt() # [N, N] edges = (dists < dr_max).nonzero(as_tuple=False).t().contiguous() return Data(x=x, edge_index=edges) # ── GNN for jet classification ──────────────────────────────── class ParticleGNN(nn.Module): def __init__(self, in_feats=3, hidden=64, out_classes=2): super().__init__() self.conv1 = GCNConv(in_feats, hidden) self.conv2 = GCNConv(hidden, hidden) self.conv3 = GCNConv(hidden, hidden) self.lin = nn.Linear(hidden, out_classes) def forward(self, data): x, edge_index, batch = data.x, data.edge_index, data.batch # Message-passing: each node aggregates from its neighbours x = self.conv1(x, edge_index).relu() x = self.conv2(x, edge_index).relu() x = self.conv3(x, edge_index).relu() # Pool all nodes into a single graph-level embedding x = global_mean_pool(x, batch) # [batch_size, hidden] return self.lin(x) # [batch_size, out_classes] model = ParticleGNN(in_feats=3, hidden=128, out_classes=5) print(f"GNN parameters: {sum(p.numel() for p in model.parameters()):,}")
Section 6 — Anomaly Detection: The Model-Independent Search for New Physics
The biggest limitation of traditional HEP analysis is that you can only find what you're looking for. You define signal regions based on a specific model of new physics, and you look there. If nature chose a different model, you miss it entirely.
Anomaly detection flips this around: train a model to learn what "normal" Standard Model events look like, then flag events that are unusually hard to compress or reconstruct. No specific new physics hypothesis is required. If something genuinely new is in your dataset, it will look anomalous to a model trained on background.
Autoencoder for Anomaly Detection
# Autoencoder: learns compressed representation of normal events # Anomaly score = reconstruction error (high = unusual = potentially new physics) class JetAutoencoder(nn.Module): def __init__(self, input_dim=20, latent_dim=4): super().__init__() self.encoder = nn.Sequential( nn.Linear(input_dim, 64), nn.ReLU(), nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, latent_dim) # bottleneck: forces compression ) self.decoder = nn.Sequential( nn.Linear(latent_dim, 32), nn.ReLU(), nn.Linear(32, 64), nn.ReLU(), nn.Linear(64, input_dim) # reconstruct original features ) def forward(self, x): z = self.encoder(x) x_hat= self.decoder(z) return x_hat, z # ── Train ONLY on background (Standard Model) events ───────── ae = JetAutoencoder(input_dim=20, latent_dim=4) optim = torch.optim.Adam(ae.parameters(), lr=1e-3) for epoch in range(100): x_hat, _ = ae(X_background) loss = nn.MSELoss()(x_hat, X_background) optim.zero_grad(); loss.backward(); optim.step() # ── Score test events: high MSE = anomalous ───────────────── with torch.no_grad(): x_hat_test, _ = ae(X_test) anomaly_score = ((X_test - x_hat_test)**2).mean(dim=1) # Events with highest anomaly_score are candidates for new physics top_anomalies = anomaly_score.argsort(descending=True)[:100] print(f"Top 100 anomalous events for follow-up analysis")
Section 7 — Real-Time Triggers & Fast Inference with hls4ml
The LHC trigger system must make keep/reject decisions in microseconds. Standard neural networks running on CPUs are far too slow. The solution is hls4ml — a library that converts trained Keras/PyTorch models into firmware for FPGAs, achieving inference latencies of under 1 microsecond.
# pip install hls4ml import hls4ml from tensorflow import keras # Train a small Keras model for the trigger trigger_model = keras.Sequential([ keras.layers.Dense(64, activation='relu' , input_shape=(16,)), keras.layers.Dense(32, activation='relu'), keras.layers.Dense(1, activation='sigmoid') ]) trigger_model.compile(optimizer='adam' , loss='binary_crossentropy') # ... train on signal/background events ... # ── Convert to FPGA firmware via hls4ml ────────────────────── config = hls4ml.utils.config_from_keras_model( trigger_model, granularity='name', default_precision='ap_fixed<16,6>' # 16-bit fixed point for speed ) hls_model = hls4ml.convert_from_keras_model( trigger_model, hls_config=config, output_dir='trigger_hls', backend='VivadoAccelerator' # or 'Quartus' for Intel FPGAs ) # Synthesize to firmware hls_model.compile() # runs HLS synthesis if Vivado installed hls_model.predict(X_test) # C-simulation: verify latency < 1 μs print("Latency: ~100–500 ns on Xilinx Ultrascale+ FPGA")
Section 8 — The Complete HEP ML Pipeline
Let's put it all together. A realistic particle physics ML analysis follows this pipeline:
vector library provides 4-momentum arithmetic.pyhf or zfit. The ML score becomes the discriminant variable in the fit. Compute expected and observed significance.External References & Further Reading
- Radovic et al. (2018) — Machine learning at the energy and intensity frontiers of particle physics. Nature. doi.org/10.1038/s41586-018-0361-2 — The landmark review that put ML on every HEP physicist's radar.
- Guest, Cranmer & Whiteson (2018) — Deep Learning and Its Application to LHC Physics. Annual Review of Nuclear and Particle Science. arXiv:1806.11484
- Kasieczka et al. (2019) — The Machine Learning Landscape of Top Taggers. SciPost Physics. arXiv:1902.09914 — Standard benchmark for jet tagging algorithms, with public dataset.
- Moreno et al. (2020) — JEDI-net: a jet identification algorithm based on interaction networks. EPJC. arXiv:1908.05318 — Graph-based jet tagging.
- Govorkova et al. (2022) — Autoencoders for unsupervised anomaly detection in high energy physics. JHEP. arXiv:2112.09071 — Anomaly detection with autoencoders at the LHC.
- Duarte et al. (2018) — Fast inference of deep neural networks in FPGAs for particle physics. JINST. arXiv:1804.06913 — The hls4ml paper: μs inference on FPGAs.
- CERN Open Data Portal — opendata.cern.ch — Free access to real LHC collision data from CMS and ATLAS.
- HEP pioneered ML at scale. The LHC trigger system, jet taggers, and event classifiers have used deep learning in production for years — before most of industry caught on.
- uproot is your gateway. Read any ROOT file in pure Python, no C++ required. awkward-array handles the jagged structure of particle physics data naturally.
- Imbalanced datasets are the norm. Signal events are rare — always use class-weighted loss or oversampling techniques. Never evaluate with raw accuracy.
- Jets are point clouds or graphs. 1D-CNN on constituents with global max-pool gives permutation invariance. GNNs model particle interactions directly.
- Anomaly detection = model-agnostic new physics search. Train an autoencoder on Standard Model events; high reconstruction error = candidate for new physics.
- Systematics kill analyses. A model that is 5% better in AUC but 20% more sensitive to pile-up conditions is worse than a simpler model. Always evaluate systematic stability.
