- What DNA Actually Is
- The Four Bases & Base Pairing
- The Double Helix: Structure in 3D
- Chromosomes: Packaging 2 Metres into a Nucleus
- Genes: The Instructions Within the Sequence
- Transcription: DNA → RNA
- Translation: RNA → Protein
- Mutations: When the Code Goes Wrong
- A Real Example: The Sickle Cell Mutation
- How This All Connects Back to CRISPR
Section 1 — What DNA Actually Is
Every cell in your body — except red blood cells — contains the complete instruction manual for building and running a human being. That instruction manual is DNA: deoxyribonucleic acid. It is a molecule. Not a concept, not a metaphor — a physical, chemical molecule that sits in the nucleus of each cell and can be extracted, purified, observed under an electron microscope, and directly manipulated. CRISPR manipulates it.
DNA is a polymer — a long chain built from repeating smaller units called nucleotides. Each nucleotide has three parts: a sugar molecule (deoxyribose), a phosphate group, and one of four possible nitrogen-containing bases. The sugar and phosphate groups link together to form the backbone of the DNA chain. The bases hang off the backbone like teeth on a comb. It is the sequence of these bases that encodes all biological information.
The four bases are Adenine (A), Thymine (T), Guanine (G), and Cytosine (C). That’s it. Everything about every living organism on Earth — every protein, every cell type, every inherited trait — is ultimately encoded in sequences of these four letters. The human genome contains approximately 3.2 billion of them in each cell. Written out at standard font size, that sequence would fill a stack of paperback novels roughly 200 metres tall.
Section 2 — Base Pairing: The Rule That Makes Everything Work
The most important property of DNA bases is that they pair with each other in a very specific, exclusive way. Adenine always pairs with Thymine. Guanine always pairs with Cytosine. These are not preferences or tendencies — they are hard chemical constraints determined by the shape and charge of the molecules. A always pairs with T. G always pairs with C. Full stop.
This base-pairing rule is called Watson-Crick base pairing, after James Watson and Francis Crick who proposed the double helix structure in 1953 (building on the X-ray crystallography data of Rosalind Franklin). The pairing is held together by hydrogen bonds — weak electromagnetic attractions between certain atoms. A-T pairs have two hydrogen bonds; G-C pairs have three, making G-C pairs slightly stronger and G-C rich regions of DNA slightly more stable.
Why does this matter so much? Because base pairing means that if you know the sequence of one strand of DNA, you automatically know the sequence of the other. The two strands are complementary mirror images of each other. A sequence reading 5’-ATGCGT-3’ on one strand will always have 3’-TACGCA-5’ on the complementary strand. This complementarity is what makes DNA replication, transcription, and CRISPR targeting all possible.
Section 3 — The Double Helix: DNA’s Three-Dimensional Structure
A single strand of DNA is already a remarkable molecule. But DNA in cells never exists as a single strand — it exists as a double helix, two complementary strands wound around each other in a right-handed spiral. This is the iconic structure that Watson, Crick, and Franklin worked out in 1953, and it is the actual physical form in which genetic information is stored and passed between generations.
The double helix is like a ladder that has been twisted. The sides of the ladder are the sugar-phosphate backbones — the structural scaffold of the molecule. The rungs of the ladder are the base pairs — A:T and G:C hydrogen-bonded pairs that hold the two strands together and encode the genetic information. The helix completes one full twist every approximately 10 base pairs, a distance of 3.4 nanometres.
The two strands run in opposite directions — they are antiparallel. One strand runs in the 5’ to 3’ direction (these refer to which carbon of the deoxyribose sugar is at each end of the chain), and the complementary strand runs 3’ to 5’. This directionality matters enormously for DNA replication and transcription, and for CRISPR: the guide RNA must be designed in the correct orientation relative to the target strand.
The Major and Minor Grooves
When the double helix twists, it creates two channels running along its length: the major groove (wider, ~2.2 nm) and the minor groove (narrower, ~1.2 nm). These grooves are not just structural features — they are the access points through which proteins read the DNA sequence without unwinding the helix. The shape and chemical signature of the base pairs are readable from the major groove, which is why most DNA-binding proteins recognise their target sequences by fitting into the major groove like a key into a lock.
Cas9 is no exception. Part of its recognition mechanism involves reading the PAM sequence (NGG) in the major groove of the DNA before unwinding the helix to allow guide RNA base pairing. Understanding the groove structure is key to understanding why Cas9 can only access certain target sites.
Section 4 — Chromosomes: Packing 2 Metres into 6 Micrometres
If you stretched out all the DNA in a single human cell, it would be approximately 2 metres long. The nucleus that contains it is roughly 6 micrometres in diameter — about 6 millionths of a metre. That’s a packing ratio of roughly 300,000 to one. Achieving this without tangling the DNA into an unusable knot requires extraordinary molecular engineering.
The solution is a hierarchical packaging system. First, DNA winds around protein spools called histones. About 147 base pairs of DNA wrap around a histone octamer (a cluster of eight histone proteins) to form a nucleosome — the fundamental repeating unit of chromatin. Nucleosomes are sometimes described as “beads on a string,” where the string is DNA and the beads are histone octamers.
Nucleosomes then compact further into higher-order structures — the 30nm chromatin fibre, loops, domains, and ultimately the compact structure visible under a microscope during cell division: the chromosome. Human cells have 46 chromosomes (23 pairs), each containing a single, enormously long DNA molecule that has been compacted into a visible structure only a few micrometres long.
Section 5 — Genes: The Instructions Within the Sequence
Not all DNA encodes proteins. In humans, only about 2% of the genome consists of protein-coding sequences. The rest includes regulatory regions that control when and where genes are turned on, structural sequences, RNA-coding genes, and large stretches whose function is still incompletely understood (once dismissively called “junk DNA,” though it is now clear much of it is functionally important).
A gene is a defined segment of DNA that contains the instructions for making a specific functional product — most commonly a protein, but sometimes a functional RNA molecule. The human genome contains approximately 20,000 protein-coding genes, far fewer than scientists expected before the Human Genome Project was completed in 2003. Each gene is transcribed and translated to produce a protein that performs a specific function.
Gene Structure: The Parts of a Gene
A protein-coding gene has several distinct regions, each with a specific role:
A regulatory sequence upstream (before) the gene. Transcription factors and RNA polymerase bind here to initiate gene expression. Think of it as the “on switch.” The strength and regulation of the promoter determines how much protein is made and in which cell types.
The protein-coding segments of the gene. After transcription, the exon sequences are retained in the mature mRNA that goes to the ribosome for translation. The average human gene has about 8 exons.
Non-coding sequences that interrupt the exons. Transcribed into RNA but then spliced out before the mRNA leaves the nucleus. Their function is complex — some regulate gene expression, some encode small regulatory RNAs — but they do not encode protein sequence.
Untranslated regions at the 5’ and 3’ ends of the mRNA. Not translated into protein, but contain important regulatory signals controlling mRNA stability, localisation, and translational efficiency.
Section 6 — Transcription: How DNA Becomes RNA
DNA sits in the nucleus of the cell, heavily guarded and compacted. It never leaves. But the instructions it carries need to reach the ribosomes in the cytoplasm, where proteins are actually made. The solution is an intermediary: messenger RNA (mRNA). Making an mRNA copy of a gene is called transcription.
Transcription is carried out by an enzyme called RNA polymerase. The process begins when transcription factors bind to the gene’s promoter and recruit RNA polymerase to the start of the coding sequence. RNA polymerase then unwinds the DNA double helix locally and reads one strand (the template strand) in the 3’ to 5’ direction, synthesising a complementary RNA molecule in the 5’ to 3’ direction.
The RNA produced is almost the same sequence as the non-template DNA strand — with one important difference: RNA uses uracil (U) instead of thymine (T). So wherever there was a T in the DNA coding strand, there is a U in the mRNA. This freshly made RNA is called pre-mRNA or primary transcript.
Before the mRNA can leave the nucleus, it undergoes splicing: molecular machinery called the spliceosome recognises the boundaries between exons and introns, cuts out all the introns, and joins the exons together. The result is a mature mRNA containing only the coding sequence, capped at the 5’ end and polyadenylated at the 3’ end for stability. This mature mRNA is exported to the cytoplasm.
Section 7 — Translation: How RNA Becomes Protein
Once the mature mRNA reaches the cytoplasm, it is read by ribosomes — large molecular machines that translate the RNA sequence into a protein sequence. This process is called translation, and the rules that govern it constitute the genetic code.
The mRNA is read in groups of three nucleotides called codons. Each codon specifies one amino acid (or a stop signal). There are 4³ = 64 possible codons but only 20 amino acids, so most amino acids are encoded by multiple codons — the genetic code is said to be degenerate. Three codons (UAA, UAG, UGA) are stop codons that signal the ribosome to release the finished protein.
The ribosome moves along the mRNA one codon at a time. For each codon, a transfer RNA (tRNA) molecule with the matching anticodon sequence delivers the appropriate amino acid. A peptide bond forms between successive amino acids, building the protein chain. When the ribosome reaches a stop codon, the completed protein is released.
CRISPR intervenes at the very first step — editing the DNA before any transcription or translation happens. This means the change affects all proteins made from that gene, in every cell that carries the edit, forever.
Section 8 — Mutations: When the Code Goes Wrong
A mutation is any change to the DNA sequence. Mutations occur constantly — your DNA replication machinery makes roughly one error per billion base pairs copied, and your genome is copied every time a cell divides. Most mutations are repaired by proofreading enzymes. Those that escape repair are usually either silent (no effect on protein function) or occur in non-critical regions. But some mutations fundamentally alter protein function and cause disease.
Types of Mutations: Not All Changes Are Equal
A single base is changed to a different base. For example, A is replaced by T. The effect depends entirely on where the change occurs. In the third position of a codon (which is often redundant), the amino acid may not change at all — a silent mutation. In the first or second position, it often changes the amino acid — a missense mutation. Occasionally it creates a stop codon early — a nonsense mutation that truncates the protein.
Mutant: ...GTG... → Valine (Val) — causes sickle cell disease
One or more extra base pairs inserted into the sequence. If the insertion is not a multiple of 3, it causes a frameshift — all codons downstream are misread, typically producing a completely non-functional protein. This is what NHEJ-mediated CRISPR editing often produces: a small insertion that frameshifts and destroys the gene.
+1 ins: ATG XCA TGG ATC C→ Met-His-Trp-Ile... (completely different!)
One or more base pairs removed from the sequence. A deletion of three base pairs (or any multiple of 3) removes one amino acid but keeps the reading frame intact — an in-frame deletion. A deletion of any other number causes a frameshift. CRISPR-NHEJ commonly produces 1-2 bp deletions that frameshift and knock out the gene.
-1 del: ATG ATG GAT CC→ Met-Met-Asp... (frameshift, wrong protein)
Section 9 — A Real Example: The Sickle Cell Mutation
Abstract descriptions of mutations are one thing. A real clinical example makes it concrete. Sickle cell disease — one of the most common inherited diseases in the world, affecting millions of people, and the first condition for which CRISPR therapy was approved — is caused by a single point mutation in a single gene.
The gene is HBB, which encodes the beta-globin protein — one of the two main components of adult haemoglobin, the molecule in red blood cells that carries oxygen. The mutation is at codon 6 of the HBB gene: a single A is changed to a T, converting the codon GAG (which encodes glutamic acid) to GTG (which encodes valine).
This is a tiny change — one letter out of 3.2 billion. But its consequences are catastrophic. Glutamic acid is a charged, hydrophilic amino acid that keeps haemoglobin molecules dissolved and separated. Valine is a hydrophobic amino acid that makes haemoglobin molecules stick to each other. When the cell is under low-oxygen conditions, sickle haemoglobin (HbS) polymerises into long, rigid fibres that distort the red blood cell into the characteristic sickle shape.
Sickled red blood cells are rigid, sticky, and fragile. They block capillaries (causing the severe pain crises that define the disease), rupture prematurely (causing anaemia), and damage organs over time. Patients with two copies of the mutation — one on each chromosome 11 — develop severe disease. Patients with one copy have sickle cell trait, which is largely protective rather than harmful.
This is why CRISPR matters. Casgevy — the first approved CRISPR therapy — does not directly correct this mutation. Instead it reactivates fetal haemoglobin (HbF), which naturally compensates for the defective adult haemoglobin. Future CRISPR-based approaches using base editing aim to directly correct the A→T mutation back to the normal sequence.
Section 10 — How This All Connects Back to CRISPR
Let’s now tie the molecular biology we’ve built in this cluster directly to CRISPR gene editing. Every element of what you’ve just learned is directly relevant to how CRISPR works and why it is so significant.
References & Further Reading
- Watson & Crick (1953) — Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature 171:737. — The double helix paper. One page. One of the most important papers ever published.
- Pauling & Corey (1953) — A Proposed Structure for the Nucleic Acids. PNAS. — The competing (incorrect) triple-helix model, useful for understanding what Watson and Crick actually solved.
- Ingram (1957) — Gene mutations in human haemoglobin: the chemical difference between normal and sickle cell haemoglobin. Nature 180:326. — The paper that identified sickle cell as a molecular disease — the first demonstration that a genetic disease is caused by a single amino acid change.
- International Human Genome Sequencing Consortium (2004) — Finishing the euchromatic sequence of the human genome. Nature 431:931. — The completed human genome reference sequence.
- Alberts et al. — Molecular Biology of the Cell (7th ed., 2022). — The definitive cell biology textbook. Chapters 4-6 cover DNA, chromosomes, and gene expression in depth. Available free online through NCBI Bookshelf.
- Khan Academy Biology — khanacademy.org/science/ap-biology — Free video explanations of every concept in this cluster. Excellent for visual learners.
- DNA is a physical molecule made of four bases. A, T, G, C. Their sequence encodes all biological information. 3.2 billion of them in every human cell.
- A always pairs with T. G always pairs with C. This base-pairing rule is how DNA replicates, how genes are transcribed, and how CRISPR guide RNA finds its target.
- The double helix is two complementary strands wound together. The sugar-phosphate backbone is structural. The base pairs carry the information. The major groove is where proteins read the sequence.
- Chromosomes pack 2 metres of DNA into 6 micrometres. Histone winding → nucleosomes → chromatin fibres → chromosomes. Tight packing affects CRISPR accessibility.
- Information flows DNA → RNA → Protein. CRISPR edits the DNA, affecting everything downstream. Transcription makes mRNA copies. Translation converts mRNA into protein sequences.
- Mutations come in three types. Substitutions change one base. Insertions and deletions (indels) often cause frameshifts that destroy gene function. NHEJ after CRISPR cutting typically produces indels.
- One letter can cause devastating disease. The sickle cell A→T mutation at codon 6 of HBB: one base pair changed out of 3.2 billion, producing a disease that affects millions worldwide. This is why precision gene editing matters.
