AI in Particle Physics: Machine Learning at the LHC & Beyond

Cluster 3 · AI for Physics Students

AI in Particle Physics & High-Energy Experiments

Particle physics was one of the first sciences to adopt deep learning at massive scale — and for good reason. The LHC produces 40 million collisions per second. Only machine learning can decide, in real time, which ones to keep. This guide covers the complete ML pipeline: from opening a ROOT file with Python to training Graph Neural Networks that outperform human-designed algorithms.

🔬 Jet Tagging 🔍 Anomaly Detection 🕸️ Graph Neural Networks 🐍 uproot + PyTorch

AI for Physics Students  ›  Cluster 3: AI in Particle Physics

📋 In This Article
  1. The HEP Data Challenge
  2. Reading Physics Data with uproot
  3. Signal vs Background Classification
  4. Jet Tagging with Deep Learning
  5. Graph Neural Networks for Particle Tracks
  6. Anomaly Detection: Finding New Physics
  7. Real-Time Triggers & Fast Inference
  8. Complete Pipeline Walkthrough

Section 1 — The HEP Data Challenge: Why Particle Physics Needs ML

Every second, the Large Hadron Collider produces 40 million proton-proton collisions. That's 40,000,000 events per second, each generating roughly 1 MB of raw detector data — a raw data rate of 40 terabytes per second. The entire storage capacity of the internet couldn't absorb this. Something has to make decisions, in real time, about which collisions are interesting enough to save.

That something is machine learning. The LHC's trigger system uses a hierarchy of algorithms — hardware-level FPGAs, then fast software filters, then full offline analysis — to reject 99.9975% of events, keeping only about 1,000 per second for permanent storage. Every step in this pipeline now relies on trained models.

But the challenge doesn't stop at data reduction. Even the events that survive the trigger contain hundreds of overlapping proton interactions (called pileup), complex detector responses, backgrounds from known processes that look similar to new physics, and systematic uncertainties in every measurement. ML has become essential for all of it.

The Standard Model as a Classification Problem

At its heart, most particle physics analysis is a classification problem: you have detector signals from particle collisions, and you want to separate signal (e.g. a Higgs boson decaying to two photons) from background (everything that looks similar but isn't). Before deep learning, physicists built these classifiers by hand — using their physical intuition to select a handful of discriminating variables and combine them with a Fisher discriminant or boosted decision tree.

Deep learning changed the game by learning discriminating features directly from raw or lightly processed detector data — capturing correlations between hundreds of variables that no human would think to combine manually. In benchmark studies, deep neural networks consistently outperform hand-crafted observables by statistically significant margins.

40M
LHC collisions per second — impossible to record without real-time ML
99.9975%
of events rejected by the trigger system in real time
15 PB
data archived per year after triggering — still needs ML for analysis

Section 2 — Reading Particle Physics Data with uproot

The standard data format in particle physics is ROOT, developed at CERN. Historically this meant using the C++ ROOT framework — not particularly Python-friendly. The uproot library changed everything: it reads ROOT files natively in Python, outputting NumPy arrays and Pandas DataFrames. No C++ installation required.

Python — uproot: reading ROOT files, awkward arrays, DataFrame export
# pip install uproot awkward-array vector
import uproot
import awkward as ak
import numpy as np
import pandas as pd

# Open a ROOT file (can be local or remote HTTP)
file = uproot.open(https://opendata.cern.ch/record/12102/files/SMHiggsToZZTo4L.root)

# List available trees
print(file.keys())

# Load a TTree (like a table of events)
tree = file["events"]
print(f"Branches: {tree.keys()}")
print(f"Events: {tree.num_entries:,}")

# Read specific branches as NumPy arrays
pt   = tree["lep_pt"].array(library="np")  # lepton transverse momenta
eta  = tree["lep_eta"].array(library="np")
phi  = tree["lep_phi"].array(library="np")

# awkward arrays handle variable-length lists (different n leptons per event)
jets = tree["jet_pt"].array()    # awkward.Array: [[pt1, pt2, ...], ...]
print(f"Mean lepton pT: {ak.mean(ak.flatten(jets)):.1f} GeV")

# Convert to flat DataFrame for ML
df = tree.arrays(["lep_pt" , "lep_eta", "lep_phi", "nJets"], library="pd")
print(df.head())
Why uproot matters 💡 Open Data CERN publishes real LHC collision data through the CERN Open Data Portal. With uproot, you can download actual CMS or ATLAS data and run the same ML pipelines used in published analyses — no CERN account needed.

Section 3 — Signal vs Background Classification

The most common ML task in HEP is binary classification: is this event a signal process (rare, interesting) or a background process (common, boring)? The performance metric physicists care most about is not accuracy — it's significance:

Signal significance: s / sqrt(s + b)

where s is the expected number of signal events and b is background, after applying a selection. A classifier that selects a purer signal region (high s/(s+b)) dramatically improves discovery potential. The ROC curve and AUC are also widely used:

AUC = integral of TPR vs FPR curve

A Complete Signal/Background Classifier

Python — Complete signal/background classifier with class-weight correction
import torch, torch.nn as nn
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

# ── Simulate HEP-like features ───────────────────────────────
# In real life, load these from your ROOT file via uproot
np.random.seed(42)
N_sig, N_bkg = 50000, 200000

# Signal: Higgs-like — 4 high-pT leptons, low missing ET
sig_feats = np.column_stack([
    np.random.normal(90, 15, N_sig),    # lep1_pt [GeV]
    np.random.normal(60, 15, N_sig),    # lep2_pt
    np.random.normal(125, 8, N_sig),    # m_4l [GeV]  — Higgs mass peak
    np.random.normal(0.5, 0.3, N_sig),  # MET [GeV]
])

# Background: ZZ continuum — similar but broader m_4l
bkg_feats = np.column_stack([
    np.random.exponential(40, N_bkg),
    np.random.exponential(30, N_bkg),
    np.random.uniform(80, 180, N_bkg),
    np.random.exponential(20, N_bkg),
])

X = np.vstack([sig_feats, bkg_feats]).astype(np.float32)
y = np.hstack([np.ones(N_sig), np.zeros(N_bkg)]).astype(np.float32)

# ── Preprocessing ────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
scaler  = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test  = scaler.transform(X_test)

# ── Model: deep classifier ───────────────────────────────────
class HEPClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(4, 256),  nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(),
            nn.Linear(128, 1),   nn.Sigmoid()
        )
    def forward(self, x): return self.net(x)

# ── Training ─────────────────────────────────────────────────
model  = HEPClassifier()
optim  = torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-5)
criterion = nn.BCELoss()

# Class-weighted loss — compensate for signal/background imbalance
w = torch.tensor([N_bkg / N_sig]).float()  # upweight rare signal class
criterion = nn.BCEWithLogitsLoss(pos_weight=w)

X_tr_t = torch.tensor(X_train)
y_tr_t = torch.tensor(y_train).unsqueeze(1)

for epoch in range(30):
    model.train()
    pred  = model(X_tr_t)
    loss  = criterion(pred, y_tr_t)
    optim.zero_grad(); loss.backward(); optim.step()

# ── Evaluation ───────────────────────────────────────────────
model.eval()
with torch.no_grad():
    scores = model(torch.tensor(X_test)).numpy().flatten()
auc = roc_auc_score(y_test, scores)
print(f"AUC = {auc:.4f}  (random=0.5, perfect=1.0)")

Section 4 — Jet Tagging: Identifying Particles from Calorimeter Showers

When quarks or gluons are produced in a collision, they immediately hadronise — fragmenting into a spray of hundreds of particles called a jet. Identifying which type of particle initiated the jet (bottom quark, top quark, W boson, gluon, etc.) is one of the most important and challenging classification problems in HEP. This is called jet tagging.

Traditional taggers used a handful of hand-crafted variables (B-hadron displaced vertices, secondary vertex mass, track impact parameters). Deep learning approaches treat the jet as a point cloud or an image — either processing the constituent particles directly or projecting them onto a 2D (η, φ) calorimeter map.

Jet as a Sequence: ParticleNet / DeepJet Approach

Python — Jet tagger: 1D-CNN on jet constituents, permutation-invariant via max-pool
# Jet tagging with a permutation-invariant deep network
# Input: constituent particles [pt, eta, phi, charge, ...] per jet

class JetTagger(nn.Module):
    def __init__(self, n_constituents=30, n_features=9):
        super().__init__()
        # Process each constituent independently (shared weights)
        self.constituent_net = nn.Sequential(
            nn.Conv1d(n_features, 64,  1), nn.BatchNorm1d(64),  nn.ReLU(),
            nn.Conv1d(64,  128, 1), nn.BatchNorm1d(128), nn.ReLU(),
            nn.Conv1d(128, 256, 1), nn.BatchNorm1d(256), nn.ReLU()
        )
        # Global max-pool: permutation invariant aggregation
        self.classifier = nn.Sequential(
            nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(128, 64),  nn.ReLU(),
            nn.Linear(64,  5)    # 5 classes: b, c, uds, g, top
        )

    def forward(self, x):
        # x: [batch, n_features, n_constituents]
        x = self.constituent_net(x)         # [batch, 256, n_constituents]
        x, _ = torch.max(x, dim=2)       # global max pool → [batch, 256]
        return self.classifier(x)              # [batch, 5] class scores

tagger = JetTagger(n_constituents=30, n_features=9)
print(f"Parameters: {sum(p.numel() for p in tagger.parameters()):,}")

# Training uses categorical cross-entropy
criterion = nn.CrossEntropyLoss()
# Dataset: jets with labels [b=0, c=1, uds=2, g=3, top=4]
# Public benchmark: Top Quark Tagging Reference Dataset (Kasieczka et al. 2019)
Public HEP datasets for practice 📖 Real Dataset The Top Quark Tagging Reference Dataset (Kasieczka et al.) contains 2M jets with particle-level features, freely available on Zenodo. It's the standard benchmark for jet tagging algorithms. Also see the CERN Open Data Portal for full CMS collision datasets.

Section 5 — Graph Neural Networks for Particle Tracking

Particle tracking — reconstructing the trajectories of particles through a detector from raw hits — is one of the most computationally expensive steps in HEP reconstruction. Traditionally done with Kalman filters, it scales poorly with pileup. Graph Neural Networks (GNNs) offer a natural solution: represent detector hits as graph nodes, potential track segments as edges, and use message-passing to classify which connections are real tracks.

The TrackML challenge (2018) demonstrated that GNN-based trackers could match classical reconstruction quality at a fraction of the computational cost. The ExaTrkX collaboration has since built GNN trackers that run efficiently on GPU hardware, reducing reconstruction time by orders of magnitude.

Python — Graph Neural Network for particle reconstruction (torch-geometric)
# pip install torch-geometric
from torch_geometric.nn import GCNConv, global_mean_pool
from torch_geometric.data import Data, DataLoader

# ── Build a graph from detector hits ─────────────────────────
# Each hit: node features = [r, phi, z] in cylindrical coords
# Edges: connect hits within dr < threshold (potential track segment)
def build_graph(hits_r, hits_phi, hits_z, dr_max=50.0):
    # Node features
    x = torch.stack([hits_r, hits_phi, hits_z], dim=1).float()
    # Build edges: connect pairs within spatial distance dr_max
    diffs = x.unsqueeze(1) - x.unsqueeze(0)           # [N, N, 3]
    dists = (diffs**2).sum(-1).sqrt()                   # [N, N]
    edges = (dists < dr_max).nonzero(as_tuple=False).t().contiguous()
    return Data(x=x, edge_index=edges)

# ── GNN for jet classification ────────────────────────────────
class ParticleGNN(nn.Module):
    def __init__(self, in_feats=3, hidden=64, out_classes=2):
        super().__init__()
        self.conv1 = GCNConv(in_feats, hidden)
        self.conv2 = GCNConv(hidden, hidden)
        self.conv3 = GCNConv(hidden, hidden)
        self.lin   = nn.Linear(hidden, out_classes)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        # Message-passing: each node aggregates from its neighbours
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index).relu()
        x = self.conv3(x, edge_index).relu()
        # Pool all nodes into a single graph-level embedding
        x = global_mean_pool(x, batch)   # [batch_size, hidden]
        return self.lin(x)                 # [batch_size, out_classes]

model = ParticleGNN(in_feats=3, hidden=128, out_classes=5)
print(f"GNN parameters: {sum(p.numel() for p in model.parameters()):,}")

Section 6 — Anomaly Detection: The Model-Independent Search for New Physics

The biggest limitation of traditional HEP analysis is that you can only find what you're looking for. You define signal regions based on a specific model of new physics, and you look there. If nature chose a different model, you miss it entirely.

Anomaly detection flips this around: train a model to learn what "normal" Standard Model events look like, then flag events that are unusually hard to compress or reconstruct. No specific new physics hypothesis is required. If something genuinely new is in your dataset, it will look anomalous to a model trained on background.

Autoencoder for Anomaly Detection

Python — Unsupervised anomaly detection: autoencoder trained on Standard Model only
# Autoencoder: learns compressed representation of normal events
# Anomaly score = reconstruction error (high = unusual = potentially new physics)

class JetAutoencoder(nn.Module):
    def __init__(self, input_dim=20, latent_dim=4):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64), nn.ReLU(),
            nn.Linear(64, 32),         nn.ReLU(),
            nn.Linear(32, latent_dim)   # bottleneck: forces compression
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 32), nn.ReLU(),
            nn.Linear(32, 64),         nn.ReLU(),
            nn.Linear(64, input_dim)   # reconstruct original features
        )

    def forward(self, x):
        z    = self.encoder(x)
        x_hat= self.decoder(z)
        return x_hat, z

# ── Train ONLY on background (Standard Model) events ─────────
ae    = JetAutoencoder(input_dim=20, latent_dim=4)
optim = torch.optim.Adam(ae.parameters(), lr=1e-3)

for epoch in range(100):
    x_hat, _ = ae(X_background)
    loss = nn.MSELoss()(x_hat, X_background)
    optim.zero_grad(); loss.backward(); optim.step()

# ── Score test events: high MSE = anomalous ─────────────────
with torch.no_grad():
    x_hat_test, _ = ae(X_test)
    anomaly_score = ((X_test - x_hat_test)**2).mean(dim=1)

# Events with highest anomaly_score are candidates for new physics
top_anomalies = anomaly_score.argsort(descending=True)[:100]
print(f"Top 100 anomalous events for follow-up analysis")
The CMS Collaboration uses autoencoders 🔬 Real Application The CMS experiment at the LHC has published results using autoencoder-based anomaly detection for model-agnostic new physics searches (CMS-EXO-22-026). The technique was also central to the LHC Olympics 2020 challenge — an open competition to find a hidden signal in an unlabelled dataset.

Section 7 — Real-Time Triggers & Fast Inference with hls4ml

The LHC trigger system must make keep/reject decisions in microseconds. Standard neural networks running on CPUs are far too slow. The solution is hls4ml — a library that converts trained Keras/PyTorch models into firmware for FPGAs, achieving inference latencies of under 1 microsecond.

Python — hls4ml: convert a trained neural network to FPGA firmware for real-time triggers
# pip install hls4ml
import hls4ml
from tensorflow import keras

# Train a small Keras model for the trigger
trigger_model = keras.Sequential([
    keras.layers.Dense(64, activation='relu' , input_shape=(16,)),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1,  activation='sigmoid')
])
trigger_model.compile(optimizer='adam' , loss='binary_crossentropy')
# ... train on signal/background events ...

# ── Convert to FPGA firmware via hls4ml ──────────────────────
config = hls4ml.utils.config_from_keras_model(
    trigger_model,
    granularity='name',
    default_precision='ap_fixed<16,6>'   # 16-bit fixed point for speed
)

hls_model = hls4ml.convert_from_keras_model(
    trigger_model,
    hls_config=config,
    output_dir='trigger_hls',
    backend='VivadoAccelerator'    # or 'Quartus' for Intel FPGAs
)

# Synthesize to firmware
hls_model.compile()    # runs HLS synthesis if Vivado installed
hls_model.predict(X_test)  # C-simulation: verify latency < 1 μs

print("Latency: ~100–500 ns on Xilinx Ultrascale+ FPGA") 

Section 8 — The Complete HEP ML Pipeline

Let's put it all together. A realistic particle physics ML analysis follows this pipeline:

1
Data ingestion — uproot reads ROOT files from CERN storage. awkward-array handles variable-length jet constituent lists. Pandas DataFrames for event-level variables.
2
Feature engineering — derive high-level observables (invariant masses, angular separations, jet substructure variables) using vector algebra. The vector library provides 4-momentum arithmetic.
3
Model training — binary classifier (signal vs background), multi-class jet tagger, or GNN for track reconstruction. Class-weight correction for imbalanced datasets.
4
Evaluation — ROC/AUC, signal efficiency vs background rejection curves (ROC in HEP convention), significance improvement χ = AUC × √(total events).
5
Systematic uncertainties — evaluate classifier stability under detector calibration uncertainties, pile-up conditions, and MC generator variations. This is where many ML analyses fail: a model that is sensitive to systematics is worse than a simple cut-based analysis.
6
Statistical interpretation — profile likelihood fits using pyhf or zfit. The ML score becomes the discriminant variable in the fit. Compute expected and observed significance.

External References & Further Reading

  • Radovic et al. (2018)Machine learning at the energy and intensity frontiers of particle physics. Nature. doi.org/10.1038/s41586-018-0361-2 — The landmark review that put ML on every HEP physicist's radar.
  • Guest, Cranmer & Whiteson (2018)Deep Learning and Its Application to LHC Physics. Annual Review of Nuclear and Particle Science. arXiv:1806.11484
  • Kasieczka et al. (2019)The Machine Learning Landscape of Top Taggers. SciPost Physics. arXiv:1902.09914 — Standard benchmark for jet tagging algorithms, with public dataset.
  • Moreno et al. (2020)JEDI-net: a jet identification algorithm based on interaction networks. EPJC. arXiv:1908.05318 — Graph-based jet tagging.
  • Govorkova et al. (2022)Autoencoders for unsupervised anomaly detection in high energy physics. JHEP. arXiv:2112.09071 — Anomaly detection with autoencoders at the LHC.
  • Duarte et al. (2018)Fast inference of deep neural networks in FPGAs for particle physics. JINST. arXiv:1804.06913 — The hls4ml paper: μs inference on FPGAs.
  • CERN Open Data Portalopendata.cern.ch — Free access to real LHC collision data from CMS and ATLAS.
📋 Key Takeaways — Cluster 3
  • HEP pioneered ML at scale. The LHC trigger system, jet taggers, and event classifiers have used deep learning in production for years — before most of industry caught on.
  • uproot is your gateway. Read any ROOT file in pure Python, no C++ required. awkward-array handles the jagged structure of particle physics data naturally.
  • Imbalanced datasets are the norm. Signal events are rare — always use class-weighted loss or oversampling techniques. Never evaluate with raw accuracy.
  • Jets are point clouds or graphs. 1D-CNN on constituents with global max-pool gives permutation invariance. GNNs model particle interactions directly.
  • Anomaly detection = model-agnostic new physics search. Train an autoencoder on Standard Model events; high reconstruction error = candidate for new physics.
  • Systematics kill analyses. A model that is 5% better in AUC but 20% more sensitive to pile-up conditions is worse than a simpler model. Always evaluate systematic stability.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top