DNA Basics for CRISPR: Genes, Chromosomes & Mutations

📋 In This Article

What DNA Actually Is
The Four Bases & Base Pairing
The Double Helix: Structure in 3D
Chromosomes: Packaging 2 Metres into a Nucleus
Genes: The Instructions Within the Sequence
Transcription: DNA → RNA
Translation: RNA → Protein
Mutations: When the Code Goes Wrong
A Real Example: The Sickle Cell Mutation
How This All Connects Back to CRISPR

Section 1 — What DNA Actually Is

Every cell in your body — except red blood cells — contains the complete instruction manual for building and running a human being. That instruction manual is DNA: deoxyribonucleic acid. It is a molecule. Not a concept, not a metaphor — a physical, chemical molecule that sits in the nucleus of each cell and can be extracted, purified, observed under an electron microscope, and directly manipulated. CRISPR manipulates it.

DNA is a polymer — a long chain built from repeating smaller units called nucleotides. Each nucleotide has three parts: a sugar molecule (deoxyribose), a phosphate group, and one of four possible nitrogen-containing bases. The sugar and phosphate groups link together to form the backbone of the DNA chain. The bases hang off the backbone like teeth on a comb. It is the sequence of these bases that encodes all biological information.

The four bases are Adenine (A), Thymine (T), Guanine (G), and Cytosine (C). That’s it. Everything about every living organism on Earth — every protein, every cell type, every inherited trait — is ultimately encoded in sequences of these four letters. The human genome contains approximately 3.2 billion of them in each cell. Written out at standard font size, that sequence would fill a stack of paperback novels roughly 200 metres tall.

🧪 The Four DNA Bases at a Glance

Adenine

Pairs only with Thymine (T). Double hydrogen bond.

Thymine

Pairs only with Adenine (A). Double hydrogen bond.

Guanine

Pairs only with Cytosine (C). Triple hydrogen bond.

Cytosine

Pairs only with Guanine (G). Triple hydrogen bond.

💡 Analogy: The Alphabet of Life

Think of A, T, G, and C as a four-letter alphabet. Instead of 26 letters making words and sentences, these four letters make genes and genomes. Just as rearranging letters in different sequences gives you entirely different words — “cat,” “act,” “tac” — rearranging DNA bases in different sequences gives you different genes, different proteins, different organisms. The entire diversity of life — every bacterium, plant, animal, and human — is written in this same four-letter alphabet.

Section 2 — Base Pairing: The Rule That Makes Everything Work

The most important property of DNA bases is that they pair with each other in a very specific, exclusive way. Adenine always pairs with Thymine. Guanine always pairs with Cytosine. These are not preferences or tendencies — they are hard chemical constraints determined by the shape and charge of the molecules. A always pairs with T. G always pairs with C. Full stop.

This base-pairing rule is called Watson-Crick base pairing, after James Watson and Francis Crick who proposed the double helix structure in 1953 (building on the X-ray crystallography data of Rosalind Franklin). The pairing is held together by hydrogen bonds — weak electromagnetic attractions between certain atoms. A-T pairs have two hydrogen bonds; G-C pairs have three, making G-C pairs slightly stronger and G-C rich regions of DNA slightly more stable.

Why does this matter so much? Because base pairing means that if you know the sequence of one strand of DNA, you automatically know the sequence of the other. The two strands are complementary mirror images of each other. A sequence reading 5’-ATGCGT-3’ on one strand will always have 3’-TACGCA-5’ on the complementary strand. This complementarity is what makes DNA replication, transcription, and CRISPR targeting all possible.

🧬 Key Concept: Why CRISPR Uses Base Pairing to Find Its Target

The guide RNA in CRISPR-Cas9 finds its target by base pairing. The 20-nucleotide spacer sequence in the guide RNA is complementary to the target DNA sequence. When the guide RNA encounters its matching DNA sequence, the RNA bases pair with the DNA bases one by one, forming a stable RNA-DNA hybrid. This is the recognition event that tells Cas9 “cut here.” If even a few bases don’t match, the hybrid is unstable and Cas9 is much less likely to cut. Base pairing is therefore both the source of CRISPR precision and the cause of off-target effects when near-matches exist elsewhere in the genome.

Section 3 — The Double Helix: DNA’s Three-Dimensional Structure

A single strand of DNA is already a remarkable molecule. But DNA in cells never exists as a single strand — it exists as a double helix, two complementary strands wound around each other in a right-handed spiral. This is the iconic structure that Watson, Crick, and Franklin worked out in 1953, and it is the actual physical form in which genetic information is stored and passed between generations.

The double helix is like a ladder that has been twisted. The sides of the ladder are the sugar-phosphate backbones — the structural scaffold of the molecule. The rungs of the ladder are the base pairs — A:T and G:C hydrogen-bonded pairs that hold the two strands together and encode the genetic information. The helix completes one full twist every approximately 10 base pairs, a distance of 3.4 nanometres.

The two strands run in opposite directions — they are antiparallel. One strand runs in the 5’ to 3’ direction (these refer to which carbon of the deoxyribose sugar is at each end of the chain), and the complementary strand runs 3’ to 5’. This directionality matters enormously for DNA replication and transcription, and for CRISPR: the guide RNA must be designed in the correct orientation relative to the target strand.

💡 Analogy: The Twisted Ladder

Imagine a rope ladder laid flat on the ground. The left and right ropes are the sugar-phosphate backbones. Each rung is a base pair: on one side of each rung is an A or G, on the other side is the complementary T or C. Now grab one end of the ladder and twist: the left rope spirals around the right. That twisted ladder is the double helix. The information is all in the rungs — their sequence from top to bottom is the genetic code.

The Major and Minor Grooves

When the double helix twists, it creates two channels running along its length: the major groove (wider, ~2.2 nm) and the minor groove (narrower, ~1.2 nm). These grooves are not just structural features — they are the access points through which proteins read the DNA sequence without unwinding the helix. The shape and chemical signature of the base pairs are readable from the major groove, which is why most DNA-binding proteins recognise their target sequences by fitting into the major groove like a key into a lock.

Cas9 is no exception. Part of its recognition mechanism involves reading the PAM sequence (NGG) in the major groove of the DNA before unwinding the helix to allow guide RNA base pairing. Understanding the groove structure is key to understanding why Cas9 can only access certain target sites.

Section 4 — Chromosomes: Packing 2 Metres into 6 Micrometres

If you stretched out all the DNA in a single human cell, it would be approximately 2 metres long. The nucleus that contains it is roughly 6 micrometres in diameter — about 6 millionths of a metre. That’s a packing ratio of roughly 300,000 to one. Achieving this without tangling the DNA into an unusable knot requires extraordinary molecular engineering.

The solution is a hierarchical packaging system. First, DNA winds around protein spools called histones. About 147 base pairs of DNA wrap around a histone octamer (a cluster of eight histone proteins) to form a nucleosome — the fundamental repeating unit of chromatin. Nucleosomes are sometimes described as “beads on a string,” where the string is DNA and the beads are histone octamers.

Nucleosomes then compact further into higher-order structures — the 30nm chromatin fibre, loops, domains, and ultimately the compact structure visible under a microscope during cell division: the chromosome. Human cells have 46 chromosomes (23 pairs), each containing a single, enormously long DNA molecule that has been compacted into a visible structure only a few micrometres long.

📦 DNA Packaging Levels

DNA double helix

2 nm wide. The raw molecule. 3.2 billion base pairs.

Nucleosome

11 nm. 147 bp of DNA wound around 8 histone proteins. Compacts DNA 7-fold.

Chromatin fibre

30 nm. Nucleosomes further compacted. Additional ~40-fold compaction.

Chromatin loops

300 nm. Looped domains. Another ~750-fold compaction.

Chromosome

~1,400 nm. Fully condensed during cell division. Total compaction ~8,000-fold.

🧬 Key Concept: Why Chromatin Packing Matters for CRISPR

Cas9 cannot cut DNA that is tightly wrapped around histones and inaccessible. The accessibility of a genomic region — whether it is in open chromatin (euchromatin) or tightly packed (heterochromatin) — significantly affects CRISPR editing efficiency. This is why CRISPR efficiency varies across the genome even for perfectly designed guide RNAs: some target sites are simply harder for Cas9 to access due to chromatin structure. This is an important consideration when designing CRISPR experiments or therapies targeting specific genomic regions.

Section 5 — Genes: The Instructions Within the Sequence

Not all DNA encodes proteins. In humans, only about 2% of the genome consists of protein-coding sequences. The rest includes regulatory regions that control when and where genes are turned on, structural sequences, RNA-coding genes, and large stretches whose function is still incompletely understood (once dismissively called “junk DNA,” though it is now clear much of it is functionally important).

A gene is a defined segment of DNA that contains the instructions for making a specific functional product — most commonly a protein, but sometimes a functional RNA molecule. The human genome contains approximately 20,000 protein-coding genes, far fewer than scientists expected before the Human Genome Project was completed in 2003. Each gene is transcribed and translated to produce a protein that performs a specific function.

Gene Structure: The Parts of a Gene

A protein-coding gene has several distinct regions, each with a specific role:

PROMOTER

A regulatory sequence upstream (before) the gene. Transcription factors and RNA polymerase bind here to initiate gene expression. Think of it as the “on switch.” The strength and regulation of the promoter determines how much protein is made and in which cell types.

EXONS

The protein-coding segments of the gene. After transcription, the exon sequences are retained in the mature mRNA that goes to the ribosome for translation. The average human gene has about 8 exons.

INTRONS

Non-coding sequences that interrupt the exons. Transcribed into RNA but then spliced out before the mRNA leaves the nucleus. Their function is complex — some regulate gene expression, some encode small regulatory RNAs — but they do not encode protein sequence.

UTRs

Untranslated regions at the 5’ and 3’ ends of the mRNA. Not translated into protein, but contain important regulatory signals controlling mRNA stability, localisation, and translational efficiency.

⚠ Common ConfusionGenes vs DNA: These Are Not the Same Thing. A common source of confusion: “gene” and “DNA” are not interchangeable. DNA is the physical molecule. A gene is a specific functional segment within the DNA sequence. Your genome contains 3.2 billion base pairs of DNA, but only about 20,000 genes, covering about 2% of that sequence. When CRISPR targets a gene, it targets a specific address within the much larger DNA molecule.

Section 6 — Transcription: How DNA Becomes RNA

DNA sits in the nucleus of the cell, heavily guarded and compacted. It never leaves. But the instructions it carries need to reach the ribosomes in the cytoplasm, where proteins are actually made. The solution is an intermediary: messenger RNA (mRNA). Making an mRNA copy of a gene is called transcription.

Transcription is carried out by an enzyme called RNA polymerase. The process begins when transcription factors bind to the gene’s promoter and recruit RNA polymerase to the start of the coding sequence. RNA polymerase then unwinds the DNA double helix locally and reads one strand (the template strand) in the 3’ to 5’ direction, synthesising a complementary RNA molecule in the 5’ to 3’ direction.

The RNA produced is almost the same sequence as the non-template DNA strand — with one important difference: RNA uses uracil (U) instead of thymine (T). So wherever there was a T in the DNA coding strand, there is a U in the mRNA. This freshly made RNA is called pre-mRNA or primary transcript.

Before the mRNA can leave the nucleus, it undergoes splicing: molecular machinery called the spliceosome recognises the boundaries between exons and introns, cuts out all the introns, and joins the exons together. The result is a mature mRNA containing only the coding sequence, capped at the 5’ end and polyadenylated at the 3’ end for stability. This mature mRNA is exported to the cytoplasm.

💡 Analogy: Transcription as Photocopying

Imagine the DNA is the master record book locked in a vault (the nucleus). The vault is too important to take out — it must stay safe. But you need the information in it. So instead of removing the book, you make a photocopy of the specific page you need. That photocopy is the mRNA. You take the photocopy out to your workbench (the cytoplasm) and use it to do the actual work. The master book never leaves the vault.

Section 7 — Translation: How RNA Becomes Protein

Once the mature mRNA reaches the cytoplasm, it is read by ribosomes — large molecular machines that translate the RNA sequence into a protein sequence. This process is called translation, and the rules that govern it constitute the genetic code.

The mRNA is read in groups of three nucleotides called codons. Each codon specifies one amino acid (or a stop signal). There are 4³ = 64 possible codons but only 20 amino acids, so most amino acids are encoded by multiple codons — the genetic code is said to be degenerate. Three codons (UAA, UAG, UGA) are stop codons that signal the ribosome to release the finished protein.

The ribosome moves along the mRNA one codon at a time. For each codon, a transfer RNA (tRNA) molecule with the matching anticodon sequence delivers the appropriate amino acid. A peptide bond forms between successive amino acids, building the protein chain. When the ribosome reaches a stop codon, the completed protein is released.

🧬 Central Dogma: Information Flow in Biology

🧬

DNA

The master blueprint

→

Transcription

📄

mRNA

The working copy

→

Translation

⚙

Protein

The molecular worker

CRISPR intervenes at the very first step — editing the DNA before any transcription or translation happens. This means the change affects all proteins made from that gene, in every cell that carries the edit, forever.

Section 8 — Mutations: When the Code Goes Wrong

A mutation is any change to the DNA sequence. Mutations occur constantly — your DNA replication machinery makes roughly one error per billion base pairs copied, and your genome is copied every time a cell divides. Most mutations are repaired by proofreading enzymes. Those that escape repair are usually either silent (no effect on protein function) or occur in non-critical regions. But some mutations fundamentally alter protein function and cause disease.

Types of Mutations: Not All Changes Are Equal

Point Mutation (Substitution)

A single base is changed to a different base. For example, A is replaced by T. The effect depends entirely on where the change occurs. In the third position of a codon (which is often redundant), the amino acid may not change at all — a silent mutation. In the first or second position, it often changes the amino acid — a missense mutation. Occasionally it creates a stop codon early — a nonsense mutation that truncates the protein.

Normal:  ...GAG... → Glutamic acid (Glu)
Mutant: ...GTG... → Valine (Val) — causes sickle cell disease

Insertion

One or more extra base pairs inserted into the sequence. If the insertion is not a multiple of 3, it causes a frameshift — all codons downstream are misread, typically producing a completely non-functional protein. This is what NHEJ-mediated CRISPR editing often produces: a small insertion that frameshifts and destroys the gene.

Normal:  ATG CAT GGA TCC → Met-His-Gly-Ser
+1 ins: ATG XCA TGG ATC C→ Met-His-Trp-Ile... (completely different!)

Deletion

One or more base pairs removed from the sequence. A deletion of three base pairs (or any multiple of 3) removes one amino acid but keeps the reading frame intact — an in-frame deletion. A deletion of any other number causes a frameshift. CRISPR-NHEJ commonly produces 1-2 bp deletions that frameshift and knock out the gene.

Normal:  ATG CAT GGA TCC → Met-His-Gly-Ser
-1 del: ATG ATG GAT CC→  Met-Met-Asp... (frameshift, wrong protein)

Section 9 — A Real Example: The Sickle Cell Mutation

Abstract descriptions of mutations are one thing. A real clinical example makes it concrete. Sickle cell disease — one of the most common inherited diseases in the world, affecting millions of people, and the first condition for which CRISPR therapy was approved — is caused by a single point mutation in a single gene.

The gene is HBB, which encodes the beta-globin protein — one of the two main components of adult haemoglobin, the molecule in red blood cells that carries oxygen. The mutation is at codon 6 of the HBB gene: a single A is changed to a T, converting the codon GAG (which encodes glutamic acid) to GTG (which encodes valine).

This is a tiny change — one letter out of 3.2 billion. But its consequences are catastrophic. Glutamic acid is a charged, hydrophilic amino acid that keeps haemoglobin molecules dissolved and separated. Valine is a hydrophobic amino acid that makes haemoglobin molecules stick to each other. When the cell is under low-oxygen conditions, sickle haemoglobin (HbS) polymerises into long, rigid fibres that distort the red blood cell into the characteristic sickle shape.

Sickled red blood cells are rigid, sticky, and fragile. They block capillaries (causing the severe pain crises that define the disease), rupture prematurely (causing anaemia), and damage organs over time. Patients with two copies of the mutation — one on each chromosome 11 — develop severe disease. Patients with one copy have sickle cell trait, which is largely protective rather than harmful.

🧬 The Sickle Cell Mutation: One Letter That Changes Everything

✅ Normal HBB Gene (Codon 6)

G-A-G

Encodes: Glutamic acid (charged, hydrophilic)

Result: Haemoglobin molecules stay dissolved. Red blood cells remain flexible and biconcave. Oxygen delivered normally.

❌ Sickle Cell HBB Gene (Codon 6)

G-T-G

Encodes: Valine (uncharged, hydrophobic)

Result: Haemoglobin molecules clump together under low oxygen. Red blood cells deform into sickle shape. Vessels blocked, anaemia, organ damage.

This is why CRISPR matters. Casgevy — the first approved CRISPR therapy — does not directly correct this mutation. Instead it reactivates fetal haemoglobin (HbF), which naturally compensates for the defective adult haemoglobin. Future CRISPR-based approaches using base editing aim to directly correct the A→T mutation back to the normal sequence.

Section 10 — How This All Connects Back to CRISPR

Let’s now tie the molecular biology we’ve built in this cluster directly to CRISPR gene editing. Every element of what you’ve just learned is directly relevant to how CRISPR works and why it is so significant.

DNA structure

Cas9 searches double-stranded DNA. It reads the major groove looking for the PAM sequence. It unwinds the helix to allow guide RNA base pairing. Every aspect of Cas9 mechanism depends on DNA’s double-helical structure.

Base pairing

The guide RNA finds its target through base pairing. The 20-nucleotide spacer forms complementary pairs with the target DNA strand — A with T (as U in RNA), G with C. This is why you design a guide RNA by writing the complement of your target sequence.

Genes and proteins

CRISPR editing changes a DNA sequence → changes the mRNA transcribed from it → changes the protein produced. To design a therapeutic CRISPR intervention you must understand the gene’s structure, the disease mutation, and how the protein function will change with each type of edit.

Mutation types

NHEJ produces insertions and deletions (indels) — often frameshifts — which knock out gene function. HDR with a donor template corrects point mutations. Base editors correct single-base substitutions. Prime editors handle small insertions, deletions, and any substitution. Each tool matches a mutation type.

References & Further Reading

Watson & Crick (1953) — Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature 171:737. — The double helix paper. One page. One of the most important papers ever published.
Pauling & Corey (1953) — A Proposed Structure for the Nucleic Acids. PNAS. — The competing (incorrect) triple-helix model, useful for understanding what Watson and Crick actually solved.
Ingram (1957) — Gene mutations in human haemoglobin: the chemical difference between normal and sickle cell haemoglobin. Nature 180:326. — The paper that identified sickle cell as a molecular disease — the first demonstration that a genetic disease is caused by a single amino acid change.
International Human Genome Sequencing Consortium (2004) — Finishing the euchromatic sequence of the human genome. Nature 431:931. — The completed human genome reference sequence.
Alberts et al. — Molecular Biology of the Cell (7th ed., 2022). — The definitive cell biology textbook. Chapters 4-6 cover DNA, chromosomes, and gene expression in depth. Available free online through NCBI Bookshelf.
Khan Academy Biology — khanacademy.org/science/ap-biology — Free video explanations of every concept in this cluster. Excellent for visual learners.

📋 Key Takeaways — Cluster 2

DNA is a physical molecule made of four bases. A, T, G, C. Their sequence encodes all biological information. 3.2 billion of them in every human cell.
A always pairs with T. G always pairs with C. This base-pairing rule is how DNA replicates, how genes are transcribed, and how CRISPR guide RNA finds its target.
The double helix is two complementary strands wound together. The sugar-phosphate backbone is structural. The base pairs carry the information. The major groove is where proteins read the sequence.
Chromosomes pack 2 metres of DNA into 6 micrometres. Histone winding → nucleosomes → chromatin fibres → chromosomes. Tight packing affects CRISPR accessibility.
Information flows DNA → RNA → Protein. CRISPR edits the DNA, affecting everything downstream. Transcription makes mRNA copies. Translation converts mRNA into protein sequences.
Mutations come in three types. Substitutions change one base. Insertions and deletions (indels) often cause frameshifts that destroy gene function. NHEJ after CRISPR cutting typically produces indels.
One letter can cause devastating disease. The sickle cell A→T mutation at codon 6 of HBB: one base pair changed out of 3.2 billion, producing a disease that affects millions worldwide. This is why precision gene editing matters.

← Previous

Cluster 1: The CRISPR Origin Story

↑ Pillar Page

Cluster 3: How Cas9 Cuts DNA

DNA Basics for CRISPR: Genes, Chromosomes & Mutations Explained