NLP & LLMs for Physics Research The Complete Practical Guide

Every physicist drowns in literature. Over 2 million papers are published annually across physics alone — more than any human can read, let alone synthesise. Large language models have fundamentally changed how researchers interact with this firehose of knowledge. This cluster covers every practical NLP tool in the physicist’s arsenal: semantic search, automated summarisation, RAG pipelines, equation extraction, fine-tuning on domain text, and LLM-assisted code generation. With working Python code throughout.

🔍 Semantic arXiv Search 📚 RAG Literature Review 🧬 Fine-tuning on Physics Text 🤖 LLM-Assisted Coding

AI for Physics Students › Cluster 8 › Cluster 9: NLP & LLMs for Physics

Clusters 1–8 focused on ML tools that help physicists do physics — fitting data, solving equations, simulating systems, discovering laws. Cluster 9 focuses on something different: using AI to help you read, understand, and build on the physics that has already been done. The literature is the accumulated knowledge of the field. LLMs are becoming the interface to that knowledge.

📋 What You Will Learn

How Transformers Work (Physics Intuition)
Semantic Search of arXiv with SPECTER
Building a RAG Pipeline for Literature Review
Automated Paper Summarisation
LaTeX Equation Extraction & Classification
Fine-Tuning LLMs on Physics Text
LLM-Assisted Physics Coding
Responsible Use: Hallucination & Trust

Section 1 — How Transformers Work: A Physicist’s Intuition

Before using LLMs as tools, it helps to understand what they are doing — at least at the level of physical intuition. The transformer architecture, introduced by Vaswani et al. (2017), processes sequences by computing attention: a learned measure of relevance between every pair of positions in the input.

Attention: softmax(Q K^T / sqrt(d_k)) V — relevance-weighted sum of values

Think of it this way. Each token (a word, subword, or symbol) has three representations: a Query (what this token is looking for), a Key (what this token advertises about itself), and a Value (what this token contributes if attended to). The attention score between two tokens is the dot product Q·K, normalised by √d_k to prevent vanishing gradients in high dimensions. The output is a weighted sum of Values, where weights are the softmax-normalised attention scores.

For a physicist, this is a non-local, learned Green’s function. In a classical field theory, G(x, x′) couples field values at different spacetime points. Attention couples token representations at different sequence positions. The “field” is the sequence of token embeddings; the “coupling” is learned from data rather than derived from physics. The parallel is remarkably close — and it explains why transformers can handle long-range dependencies that RNNs cannot.

LLM output: P(token_t | context) = softmax(W_o h_t)

🧠 Concept: Why Physics Intuition Helps You Use LLMs Better

Understanding that transformers learn correlation functions between tokens explains four practical observations. (1) Context length matters — attention is O(n squared) in sequence length, so long documents are expensive. (2) Positional encoding matters — unlike RNNs, transformers have no inherent notion of order; position must be injected explicitly. (3) Repetition degrades quality — if a concept appears many times, attention weights spread thin. (4) Specific prompts outperform vague ones — precise vocabulary activates the right attention patterns.

Section 2 — Semantic Search of arXiv with Physics-Specific Embeddings

Standard keyword search of arXiv finds papers that contain your exact keywords. Semantic search finds papers that mean what you mean, even if they use different terminology. A query for “neural network potential energy surface” should also return papers about “machine learning interatomic potentials” and “deep learning force fields” — because they describe the same concept using different vocabulary. Keyword search fails here; semantic search succeeds.

The tool is sentence embeddings: a model that maps text to a dense vector in semantic space, where similar meanings cluster together. For physics, the state-of-the-art embedding model is SPECTER (Lo et al. 2020), trained on citation graphs from scientific papers — if paper A cites paper B, their embeddings should be similar. SPECTER understands domain vocabulary: “Hamiltonian” and “energy operator” will have similar vectors.

Cosine similarity: sim(q,d) = e_q . e_d / (||e_q|| ||e_d||)

Python — Semantic arXiv search: SPECTER embeddings + FAISS index + cosine similarity

# pip install sentence-transformers arxiv faiss-cpu
import arxiv
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import json

# ── Step 1: Download physics papers from arXiv ─────────────────
client = arxiv.Client()
search = arxiv.Search(
    query     = 'cat:cond-mat OR cat:hep-ph OR cat:astro-ph',
    max_results = 5000,
    sort_by   = arxiv.SortCriterion.SubmittedDate
)

papers = []
for result in client.results(search):
    papers.append({
        'id':       result.entry_id,
        'title':    result.title,
        'abstract': result.summary,
        'authors':  [a.name for a in result.authors[:3]],
        'date':     str(result.published.date()),
        'url':      result.pdf_url
    })
print(f"Downloaded {len(papers)} papers")

# ── Step 2: Embed abstracts with SPECTER (physics-trained model) ─
# allenai/specter2 is the latest version — better than SPECTER1
model = SentenceTransformer('allenai/specter2')

texts  = [f"{p['title']} {p['abstract']}" for p in papers]
embeddings = model.encode(
    texts,
    batch_size       = 64,
    show_progress_bar= True,
    normalize_embeddings = True   # unit vectors for cosine similarity
)
print(f"Embeddings shape: {embeddings.shape}")   # [5000, 768]

# ── Step 3: Build FAISS index for fast nearest-neighbour search ─
# FAISS: Facebook AI Similarity Search — handles millions of vectors
d     = embeddings.shape[1]      # embedding dimension: 768
index = faiss.IndexFlatIP(d)  # inner product (= cosine for unit vectors)
index.add(embeddings.astype(np.float32))
print(f"FAISS index: {index.ntotal} vectors")

# ── Step 4: Semantic search ──────────────────────────────────────
def semantic_search(query, top_k=10):
    q_emb = model.encode([query], normalize_embeddings=True)
    scores, indices = index.search(q_emb.astype(np.float32), top_k)
    results = []
    for score, idx in zip(scores[0], indices[0]):
        p = papers[idx]
        results.append({**p, 'similarity': float(score)})
    return results

# ── Example queries ──────────────────────────────────────────────
queries = [
    'machine learning interatomic potentials molecular dynamics',
    'transformer architecture attention quantum many-body systems',
    'normalizing flows posterior sampling particle physics',
]
for q in queries:
    results = semantic_search(q, top_k=5)
    print(f"\nQuery: {q}")
    for r in results:
        print(f"  [{r['similarity']:.3f}] {r['title'][:70]}")

SPECTER2 (allenai/specter2) is the best general-purpose scientific paper embedding model — trained on 75M citation pairs from Semantic Scholar. For pure physics text, SciBERT (allenai/scibert_scivocab_uncased) is a BERT model pre-trained on 1.14M scientific papers. For cross-modal tasks involving equations, consider MathBERT or the recently released LLEMMA (math-focused LLM). Start with SPECTER2 for literature search. 💡 Model Choices for Physics Embeddings#faf5ff

Section 3 — Building a RAG Pipeline for Physics Literature Review

Retrieval-Augmented Generation (RAG) is the most practical and reliable way to use LLMs for scientific literature review. The idea: instead of asking an LLM to answer from its training data (which may be outdated or hallucinated), you first retrieve relevant papers from your own database, then include them as context in the LLM prompt. The model answers based on retrieved documents, not from memory.

For physics research, this is transformative. You can build a private RAG system over your group’s preprints, your institution’s published work, a curated reading list, or the full arXiv corpus in your subfield. When you ask “what is the current experimental status of the muon g-2 anomaly?”, the system retrieves the five most relevant recent papers and synthesises them into a structured answer — with citations you can verify.

Python — Full RAG pipeline: SPECTER embeddings + ChromaDB + LangChain + physics-tuned prompt

# pip install langchain langchain-community openai chromadb
# RAG pipeline: retrieve relevant chunks, then generate with LLM
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import arxiv, os

# ── Step 1: Collect and chunk physics papers ────────────────────
splitter = RecursiveCharacterTextSplitter(
    chunk_size    = 800,         # tokens per chunk
    chunk_overlap = 100,         # overlap to preserve context across chunks
    separators    = ['\n\n', '\n', '. ', ' ']   # respect paragraph structure
)

documents = []
for paper in papers[:200]:              # use our downloaded papers
    text   = f"{paper['title']}\n\n{paper['abstract']}"
    chunks = splitter.create_documents(
        [text],
        metadatas=[{'title': paper['title'],
                    'url':   paper['url'],
                    'date':  paper['date']}]
    )
    documents.extend(chunks)
print(f"Total chunks: {len(documents)}")

# ── Step 2: Embed and store in vector database (Chroma) ─────────
# Chroma: lightweight, local vector DB — no server needed
embedding_model = HuggingFaceEmbeddings(
    model_name   = 'allenai/specter2',
    model_kwargs = {'device': 'cpu'},
    encode_kwargs= {'normalize_embeddings': True}
)
vectorstore = Chroma.from_documents(
    documents,
    embedding_model,
    persist_directory='./physics_rag_db'
)
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})

# ── Step 3: RAG chain with physics-tuned prompt ─────────────────
PHYSICS_RAG_PROMPT = """You are a physics research assistant with deep domain expertise.
Answer the question using ONLY the provided context from scientific papers.
Always cite the paper titles when making specific claims.
If the context does not contain enough information, say so explicitly.
Do NOT speculate beyond what the papers say.

Context from retrieved papers:
{context}

Question: {question}

Answer (cite paper titles inline):"""

prompt = ChatPromptTemplate.from_template(PHYSICS_RAG_PROMPT)

# ── Step 4: Chain retrieval + generation ────────────────────────
# Using OpenAI API — replace with Anthropic/local model as preferred
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.1)   # low temp for factual answers

def rag_answer(question):
    docs      = retriever.invoke(question)
    context   = '\n\n---\n\n'.join([d.page_content for d in docs])
    sources   = list(set([d.metadata['title'] for d in docs]))
    chain     = prompt | llm | StrOutputParser()
    answer    = chain.invoke({'context': context, 'question': question})
    return answer, sources

# Example questions a physicist might ask:
questions = [
    'What ML methods are used for jet tagging at the LHC?',
    'How do neural network potentials compare to DFT in accuracy?',
    'What is the current state of gravitational wave detection with ML?',
]
for q in questions:
    answer, sources = rag_answer(q)
    print(f'Q: {q}')
    print(f'A: {answer[:300]}...')
    print(f'Sources: {sources}\n')

⚠ Watch OutThe single most important rule for using LLMs in physics research: always verify citations against the actual papers. LLMs hallucinate paper titles, authors, journal names, and results — even when they sound completely plausible. In a RAG system this risk is reduced (the model answers from retrieved text), but it is not eliminated. Never cite a paper in your own work based solely on an LLM summary without reading the original. The RAG pipeline is for discovery and orientation, not for authoritative claims.

Section 4 — Automated Paper Summarisation at Scale

Reading a paper fully takes 30–90 minutes. Skimming the abstract, introduction, and conclusions takes 5–10 minutes. An LLM can produce a structured summary in 10 seconds. For a physicist keeping up with a fast-moving subfield, this is a genuine productivity multiplier — provided you trust the summary enough to decide whether the full paper is worth reading, and you read the full paper before citing anything.

Python — Automated arXiv paper summarisation: ar5iv fetch + structured Claude prompt

# Automated paper summarisation pipeline
# Works with arXiv papers via their HTML/PDF, or any text
import arxiv
import requests
from anthropic import Anthropic

# ── Fetch full paper text from arXiv HTML endpoint ─────────────
def fetch_arxiv_text(arxiv_id):
    """Fetch plain text of an arXiv paper via the HTML endpoint."""
    url = f'https://ar5iv.labs.arxiv.org/html/{arxiv_id}'
    resp = requests.get(url, timeout=30)
    if resp.status_code != 200:
        return None
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(resp.text, 'html.parser')
    # Extract main article text, skip references
    article = soup.find('article')
    if not article: return None
    # Remove reference section
    for refs in article.find_all(class_=['ltx_bibliography']):
        refs.decompose()
    return article.get_text(separator=' ', strip=True)[:12000]  # ~3k tokens

# ── Structured physics summary prompt ───────────────────────────
SUMMARY_PROMPT = '''You are a senior physicist. Summarise this paper concisely.
Structure your summary exactly as follows:

**Problem**: What specific problem does this paper address?
**Method**: What ML/computational approach do they use? (2-3 sentences)
**Key result**: What is the most important quantitative finding?
**Significance**: Why does this matter for the field?
**Limitations**: What does the paper not address or where might it fail?
**Recommended for**: What type of physicist should read this in full?

Paper text:
{text}

Summary:'''

# ── Summarise a paper ────────────────────────────────────────────
client = Anthropic()

def summarise_paper(arxiv_id):
    text = fetch_arxiv_text(arxiv_id)
    if not text:
        return 'Could not fetch paper text.'
    response = client.messages.create(
        model   = 'claude-sonnet-4-20250514',
        max_tokens = 1000,
        messages   = [{'role': 'user', 'content': SUMMARY_PROMPT.format(text=text)}]
    )
    return response.content[0].text

# ── Batch summarise a reading list ──────────────────────────────
reading_list = [
    '2310.06825',    # GNoME materials paper
    '2112.09071',    # autoencoder anomaly detection HEP
    '2203.07404',    # causal PINNs
]
for arxiv_id in reading_list:
    print(f'\n{'='*60}\narXiv:{arxiv_id}')
    print(summarise_paper(arxiv_id))

Section 5 — LaTeX Equation Extraction and Classification

Physics papers are unique in scientific literature: they are dense with mathematical expressions encoded as LaTeX. For many tasks — building equation databases, automatic knowledge graphs, training domain-specific models — you need to extract, parse, and classify equations from papers. This is a non-trivial NLP problem because equations are interspersed with natural language, span multiple lines, and can be nested arbitrarily.

Python — LaTeX equation extraction + LLM classification into physics equation taxonomy

# Extract and classify equations from LaTeX source files
# arXiv provides source .tar.gz files for most papers
import re, tarfile, io, requests
from pathlib import Path

# ── Download LaTeX source from arXiv ────────────────────────────
def get_arxiv_latex(arxiv_id):
    url  = f'https://arxiv.org/src/{arxiv_id}'
    resp = requests.get(url, timeout=30)
    try:
        with tarfile.open(fileobj=io.BytesIO(resp.content)) as tar:
            for member in tar.getmembers():
                if member.name.endswith('.tex'):
                    f = tar.extractfile(member)
                    if f: return f.read().decode('utf-8', errors='ignore')
    except:
        pass
    return None

# ── Extract display equations ($$...$$, \[...\], equation env) ─
def extract_equations(latex_text):
    patterns = [
        r'\\begin\{equation\*?\}(.*?)\\end\{equation\*?\}',
        r'\\begin\{align\*?\}(.*?)\\end\{align\*?\}',
        r'\\begin\{eqnarray\*?\}(.*?)\\end\{eqnarray\*?\}',
        r'\$\$(.+?)\$\$',
        r'\\\[(.*?)\\\]',
    ]
    equations = []
    for pattern in patterns:
        matches = re.findall(pattern, latex_text, re.DOTALL)
        equations.extend([m.strip() for m in matches if len(m.strip()) > 5])
    return list(set(equations))   # deduplicate

# ── Classify equations by type using LLM ────────────────────────
from anthropic import Anthropic
client = Anthropic()

def classify_equation(eq_latex):
    prompt = f"""Classify this LaTeX equation into ONE of these categories:
    definition | conservation_law | equation_of_motion | loss_function |
    probability | wave_equation | thermodynamic | other

    Equation: {eq_latex[:200]}

    Respond with ONLY the category name, nothing else."""
    resp = client.messages.create(
        model='claude-haiku-4-5-20251001',   # fast + cheap for classification
        max_tokens=10,
        messages=[{'role':'user','content':prompt}]
    )
    return resp.content[0].text.strip().lower()

# ── Process a paper and build equation taxonomy ──────────────────
latex = get_arxiv_latex('2101.03164')   # NequIP paper
if latex:
    eqs = extract_equations(latex)
    print(f"Found {len(eqs)} equations")
    taxonomy = {}
    for eq in eqs[:20]:            # classify first 20
        cat = classify_equation(eq)
        taxonomy.setdefault(cat, []).append(eq[:80]+'...')
    for cat, examples in taxonomy.items():
        print(f"\n{cat.upper()} ({len(examples)} equations)")
        print(f"  Example: {examples[0]}")

Section 6 — Fine-Tuning LLMs on Physics Text

General-purpose LLMs (GPT-4, Claude, Llama) are trained on broad internet text and have reasonable but imperfect physics knowledge. For specialised tasks — generating valid LaTeX equations, completing physics derivations, extracting structured data from specific paper formats — fine-tuning a smaller model on domain-specific data can significantly improve performance at much lower cost than using a large API-based model.

The standard approach is LoRA (Low-Rank Adaptation): instead of fine-tuning all model weights, you add small trainable rank-decomposition matrices to the attention layers. This reduces trainable parameters by 10,000×, making fine-tuning feasible on a single GPU. Combined with 4-bit quantisation (QLoRA), you can fine-tune a 7B or 13B parameter model on a single A100 GPU in hours.

Python — QLoRA fine-tuning: 4-bit quantisation + LoRA on physics instruction data (Llama 3.1 8B)

# pip install transformers peft datasets bitsandbytes trl
# QLoRA fine-tuning: 4-bit quantisation + LoRA
# Hardware: single A100 (80GB) or 2x A6000 GPUs
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset, Dataset
import torch

# ── Step 1: Prepare physics fine-tuning dataset ─────────────────
# Format: instruction-following pairs from physics papers
physics_examples = [
    {
        'instruction': 'Explain the physical meaning of the variational principle in quantum mechanics.',
        'output': 'The variational principle states that for any trial wavefunction |psi>, the expectation value  provides an upper bound on the true ground state energy E_0. This follows from expanding |psi> in the energy eigenbasis...'
    },
    {
        'instruction': 'Write the CGCNN message-passing update equation in LaTeX.',
        'output': 'The CGCNN update is: \\mathbf{h}_i^{(l+1)} = \\mathbf{h}_i^{(l)} + \\sum_{j \\in \\mathcal{N}(i)} \\sigma\\left(\\mathbf{z}_{ij}^{(l)} \\mathbf{W}_g\\right) \\odot g\\left(\\mathbf{z}_{ij}^{(l)} \\mathbf{W}_f\\right)'
    },
    # ... thousands more examples from papers and textbooks
]

# Format for instruction fine-tuning
def format_example(ex):
    return f"""### Instruction:\n{ex['instruction']}\n\n### Response:\n{ex['output']}<|endoftext|>"""

dataset = Dataset.from_list([{'text': format_example(e)} for e in physics_examples])

# ── Step 2: Load base model with 4-bit quantisation (QLoRA) ─────
bnb_config = BitsAndBytesConfig(
    load_in_4bit              = True,
    bnb_4bit_quant_type       = 'nf4',      # NF4: best for LLM weights
    bnb_4bit_compute_dtype    = torch.bfloat16,
    bnb_4bit_use_double_quant = True,        # double quant for extra compression
)
model_name = 'meta-llama/Llama-3.1-8B'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config = bnb_config,
    device_map          = 'auto',
    torch_dtype         = torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# ── Step 3: Configure LoRA ────────────────────────────────────────
# Only train rank-8 adapter matrices in attention layers
# Trainable params: ~4M  (vs 8B total model params) = 0.05%
lora_config = LoraConfig(
    r             = 16,          # LoRA rank
    lora_alpha    = 32,          # scaling factor
    target_modules= ['q_proj', 'v_proj', 'k_proj', 'o_proj'],
    lora_dropout  = 0.05,
    bias          = 'none',
    task_type     = 'CAUSAL_LM'
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()   # shows: ~4M / 8B (0.05%)

# ── Step 4: Train with SFTTrainer ────────────────────────────────
trainer = SFTTrainer(
    model          = model,
    train_dataset  = dataset,
    dataset_text_field = 'text',
    max_seq_length     = 2048,
    args = TrainingArguments(
        output_dir          = './physics-llm',
        num_train_epochs    = 3,
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,    # effective batch = 16
        learning_rate       = 2e-4,
        fp16                = False,  bf16 = True,
        logging_steps       = 25,
        save_strategy       = 'epoch',
        warmup_ratio        = 0.03,
        lr_scheduler_type   = 'cosine',
    )
)
trainer.train()
model.save_pretrained('./physics-llm-lora')

🧠 Concept: When to Fine-Tune vs When to Use RAG

These two approaches are complementary, not competing. Use RAG when you need factual answers from specific documents, the knowledge changes frequently, or you need citations. Use fine-tuning when you want the model to adopt a specific style, understand domain vocabulary deeply, or need faster inference without retrieval overhead. The best production systems combine both: a fine-tuned model that understands physics vocabulary, augmented with RAG over a curated paper database.

Section 7 — LLM-Assisted Physics Coding

One of the highest-ROI applications of LLMs for physicists is code generation. Not replacing the physicist — but handling the boilerplate, suggesting implementations, catching bugs, and translating between mathematical formulations and code. A physicist who uses LLM-assisted coding effectively can implement in hours what used to take days.

The key is knowing how to prompt effectively for physics code. Vague prompts give vague results. Precise prompts that include the mathematical formulation, the expected input/output shapes, the physical constraints, and the test cases give production-quality code on the first attempt.

Python — Four effective LLM prompting patterns for physics code: equation-to-code, debug, validate, translate

# Effective prompting patterns for physics code generation
# The more physics context you provide, the better the output

# ── Pattern 1: Equation-to-code with explicit context ──────────
GOOD_PHYSICS_PROMPT = '''
Implement the following in Python using PyTorch:

The Ornstein-Uhlenbeck (OU) process for a particle in a harmonic trap:
    dx = -gamma * x * dt + sigma * sqrt(dt) * N(0,1)
where gamma is the restoring rate and sigma is the noise amplitude.

Requirements:
- Simulate N=1000 particles for T=200 time steps with dt=0.01
- gamma=1.0, sigma=0.5, initial positions drawn from N(0,1)
- Return tensor of shape [T, N] containing all trajectories
- Include the analytical stationary variance: Var_ss = sigma^2 / (2*gamma)
- Verify numerically that the simulated variance matches the analytical value
'''

# ── Pattern 2: Debug with physics context ───────────────────────
DEBUG_PROMPT = '''
This PINN is training but the physics residual loss stays > 0.1
even after 10,000 steps. Expected: < 0.001 for this ODE.

The ODE is: du/dt = -2u, u(0) = 1 (exact solution: u = exp(-2t))

[PASTE CODE HERE]

Diagnose the issue. Check: loss weights, collocation point density,
learning rate, architecture depth, and activation function choice.
Suggest specific fixes with physical justification.
'''

# ── Pattern 3: Validate a physics implementation ────────────────
VALIDATE_PROMPT = '''
Review this Monte Carlo integration of the 2D Ising partition function.
Check for:
1. Correct Boltzmann weight exp(-E/kT) in acceptance criterion
2. Proper periodic boundary conditions
3. Correct normalisation for energy per site
4. Any off-by-one errors in the spin update loop
5. Whether the magnetisation calculation is correct

[PASTE CODE HERE]

For each issue found, explain why it matters physically and provide a fix.
'''

# ── Pattern 4: Translate maths to code ─────────────────────────
TRANSLATE_PROMPT = '''
Convert this LaTeX equation to a numerically stable PyTorch implementation.
Equation (from the paper): \\hat{A}_t = \\sum_{k=0}^{T-t} (\\gamma\\lambda)^k \\delta_{t+k}
where delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)  (TD error)

Requirements:
- Input: rewards tensor [T], values tensor [T+1], gamma=0.99, lambda_gae=0.95
- Output: advantages tensor [T]
- Must be computed in O(T) not O(T^2) — use the recursive form
- Handle the terminal state correctly (V(s_T) = 0)
- Include docstring explaining each variable's physical meaning
'''

# These prompt patterns consistently outperform vague requests like:
# 'implement ising model' or 'fix my PINN code'
# The physics context tells the LLM what constraints matter

Section 8 — Responsible Use: Hallucination, Trust, and Scientific Integrity

The most important section in this cluster. LLMs are powerful tools, but using them irresponsibly in scientific research can damage your credibility, spread misinformation, and — in the worst case — produce published results that are wrong. This section lays out the principles that distinguish responsible from irresponsible use.

❌

Never cite a paper you have not read

LLMs hallucinate plausible-sounding paper titles, authors, journal names, and page numbers. Even RAG systems can generate citations that slightly misrepresent the source. If you include a citation in your paper, you have read it. This is a non-negotiable principle of academic integrity — LLMs do not change it.

❌

Never trust numerical claims without verification

If an LLM says “this method achieves 0.95 AUC on the ATLAS dataset”, verify the number against the actual paper. LLMs interpolate between training examples and can produce plausible-but-wrong quantitative claims. Any specific number in scientific context must be verified at the source.

✅

LLMs are excellent for orientation, not authoritative claims

Use LLMs to get oriented in a new subfield quickly, understand how concepts relate, generate candidate papers to read, draft initial text for later careful revision, and debug code. These are legitimate, high-value uses that accelerate research without compromising integrity.

✅

Disclose LLM assistance in your methods

Most journals and conferences now have policies on LLM disclosure. If you used an LLM to help draft text, generate code, or process data, state this explicitly in your methods section. The standard is converging toward: disclose use, take full responsibility for accuracy, and do not list LLMs as authors.

✅

Verify LLM-generated code with physical sanity checks

LLM-generated physics code can be syntactically correct but physically wrong — wrong sign conventions, missing normalisations, incorrect boundary conditions. Always verify generated code against known analytic results before trusting it for research purposes. The prompting patterns in Section 7 are designed to produce verifiable code precisely for this reason.

External References & Further Reading

Vaswani et al. (2017) — Attention Is All You Need. NeurIPS. arXiv:1706.03762 — The transformer paper. Still the most important ML paper of the 2010s.
Lo et al. (2020) — SPECTER: Document-level Representation Learning using Citation-informed Transformers. ACL. arXiv:2004.07180 — The scientific paper embedding model.
Beltagy et al. (2019) — SciBERT: A Pretrained Language Model for Scientific Text. EMNLP. arXiv:1903.10676
Azerbayev et al. (2023) — LLEMMA: An Open Language Model for Mathematics. arXiv:2310.10631 — LLM pre-trained on mathematical text and code.
Lewis et al. (2020) — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401 — The original RAG paper.
Hu et al. (2022) — LoRA: Low-Rank Adaptation of Large Language Models. ICLR. arXiv:2106.09685
Semantic Scholar API — api.semanticscholar.org — Free API for 200M+ scientific papers with SPECTER embeddings pre-computed.

📋 Key Takeaways — Cluster 9

Transformers are learned non-local correlation functions. Attention couples every token to every other token via Q·K inner products. Understanding this helps you write better prompts: specific vocabulary activates the right attention patterns.
SPECTER2 is your embedding model for physics. Pre-trained on citation graphs, it understands domain vocabulary. Build your semantic search index with it, stored in FAISS for fast nearest-neighbour retrieval.
RAG beats asking from memory for factual questions. Retrieve 5 relevant paper chunks, include them in context, generate the answer. The model answers from your documents, not from hallucinated training data.
Equation-specific prompts beat vague ones by a large margin. Include: the mathematical formulation, expected input/output shapes, physical constraints, test cases. These four elements make the difference between useful and mediocre generated code.
QLoRA makes fine-tuning accessible. 4-bit quantisation + LoRA (rank 16) trains 0.05% of parameters. A 7B model fine-tuned on physics text outperforms GPT-4 on domain-specific tasks at 100× lower inference cost.
Verify everything. Never cite unread papers. Always check numerical claims at the source. Disclose LLM use in your methods. The speed gain from LLMs is real; the responsibility for accuracy is still yours.

← Previous

Cluster 8: Reinforcement Learning for Physics

↑ Pillar Page

Cluster 10: Building Your AI Physics Career

NLP & LLMs for Physics Research: The Complete Practical Guide