home>History of the human genome assembly 2000–2022

data visualization + art

The information graphic showing the history of the human genome assembly is part of my series of designs created for the Scientific American Graphic Science page. Together with Senior Graphics Editor Jen Christiansen, we've looked at everything from the evolution of the genomes of SARS-Cov-2 strains to how pets contribute to the bacterial flora in your home.

support my work ·

Most of the art is available for purchase as framed prints and, yes, even pillows. Sleep's never been more important — I take custom requests.

History of the Human Genome Assembly

22 years, 3,117,275,501 bases and 0 gaps later

Round numbers are always false.
— Samuel Johnson

menu methods posters resources credits

1 · The first gapless assembly of a human genome

1.1 · Finished... again?

2 · Creating the Scientific American graphic
3 · The human genome — pieces of different genomes

3.1 · Genomic libraries
3.2 · Focused efforts

4 · 2000 to 2013 — zero to genome hero
5 · 2022 — CHM13 fills in the gaps

keyboard_arrow_up

1 · The first gapless assembly of a human genome

In March 2022, a flurry of publications announced the first ever complete assembly of a human genome. This is a big deal because, in genomics, we don't throw around words like “complete” very often. Actually, never — until now.

Because the human genome — a human genome — is complete. All hail the CHM13v2 (Mar 2022) telomere-to-telomere (T2T) assembly.

keyboard_arrow_up

1.1 · Finished... again?

You've probably already heard — and have been hearing for the last 15 years — that the human genome has been sequenced. After all, how else can there be a publication with the title Finishing the euchromatic sequence of the human genome (2004 Nature 431:931–945).

In 2004, the only thing that stood between “finished” and “complete” was that pesky word “euchromatic”.

“Euchromatin is a lightly packed form of chromatin (DNA, RNA, and protein) that is enriched in genes, and is often (but not always) under active transcription. Euchromatin stands in contrast to heterochromatin, which is tightly packed and less accessible for transcription. 92% of the human genome is euchromatic.” — Wikipedia

For modeling and analysis — such as in cancer research, for example, which is what we do here — by far the most important parts of the human genome assembly are the parts that code for protein (transcribed regions and their ORFs), along with their adjacent regulatory sequences.

These regions in total occupy less than half of the genome. The parts that ultimately translated into protein exons account for just 2.58% of the genome. For the vast majority of genes, the sequence in these regions was indeed finished.

History of the Human Genome Assembly (22 years, 3,117,275,501 bases and 0 gaps later) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca

▲ IT'S IN THE GENES | The relationship between transcribed region (TX), open reading frame (ORF) and protein coding region (CDS) in a gene. Size of elements is illustrative. Modified from Wikipedia.

In the most recent assembly prior to the complete telomere-to-telomere CHM13v2 assembly was hg38 (Dec 2013), exons cover only 2.58% of the total sequence (excluding gaps) of the assembly. Notice how much of the open reading frame (ORF) is in introns, which are spliced out during post-transcriptional modification into mature mRNA.

total size 3,088,269,832 total sequence 2,937,639,113 100.00% transcripts 59,657 with complete coding region annotation tx region 33,246 unique transcript regions transcription 1,195,741,784 40.70% ORF 1,011,159,704 34.42% exons 75,884,758 2.58% introns 979,628,382 33.35%

The values in the table above are generated from the size of the union of assembly intervals over the set of genes, because some genes overlap The list of genes includes protein coding, non-coding and pseudogenes.

But with the CHMv2 assembly being more than 200 million bases larger (hello heterochromatin!), we expect to find more genes. And indeed, 3,604 genes, of which 140 are protein coding, are exclusive to CHMv2 (Table 1 in Nurk et al.).

keyboard_arrow_up

2 · Creating the Scientific American graphic

Our goal for the August 2022 Scientific American Graphic Science page was to present the CHMv2 assembly in context of the efforts of the past 22 years — as the final sequencing stage in the sequencing effort. See my other Graphic Science illustrations.

▲ HOW DID THE ASSEMBLY IMPROVE OVER THE YEARS? | Colors of each 250,000 base regions of each chromosome indicate when the region reached 50, 90 or 99%+ completion. Completion from previous years is carried over in grey. Scientific American (2022) 327(2):92.

buy artwork

Year-by-year history of the human genome assembly (1Mb bins) by Martin Krzywinski

▲ MY DATA + SCIENCE ART ON A WALL | Most of my artwork is available for sale as prints and other items. (buy artwork / see all my art)

keyboard_arrow_up

3 · The human genome — pieces of different genomes

The DNA samples used to create the reference human sequence assembly used DNA samples from multiple individuals.

keyboard_arrow_up

3.1 · Genomic libraries

The bulk of the genome was sequenced from the RPCI-11 genomic library, which is a collection of about 10,000 randomly sampled pieces (each on average 180,000 bases long) from the genome of an individual male. Other parts of the genome assembly were reconstructed by sequencing RPCI-13 (female) and Caltech-D (Table 2 in Zhao (2000)) libraries.

The human genome is a variable quantity — two individuals vary, on average, by 1 base in 1,000 (or, roughly, in about 3 million positions) — and much of human genomics deals with understanding and handling this variability.

As such, the term “the human genome” is a little misleading and it's important to acknowledge that, given this natural variation between individuals (and even between cells within an individual), there is really no such thing.

When we say “the genome” we invariably mean “the reference genome”. Even then, we need to specify which version of the reference we mean. At the moment (July 2022), the hg38 (Dec 2013) assembly is considered to be the canonical reference and this may not change anytime soon. It's not uncommon (especially for several months after the release of a new assembly) to find genomic resources that use older references — updating an analysis or data pipeline to a new reference assembly is not a trivial task.

For example, well after the release of hg38 (Dec 2013), the previous reference hg19 (Feb 2009) continued to be widely used for several years. To facilitate standardizing results to the same assembly, so-called liftOver annotations allow conversion of one assembly's coordinates to another.

keyboard_arrow_up

3.2 · Focused efforts

As the sequence assembly matured, regions were successively filled in by targeted efforts focusing on specific regions (e.g. DNA sequence analysis of human chromosome 9) or by specific institutions (e.g. A Japanese history of the Human Genome Project).

This kind of focused effort to improve the assembly quality of a region was always part of the process. In fact, already in 2000 you can see that significant portions of chromosomes 14, 20, 21 and 22 were completed to 99%+. In 2001, chromosomes 6 and 13 received a lot of attention.

In contrast to focused sequencing, other parts of the assembly were sequenced in a shotgun fashion — by randomly sampling as many regions in the genome as possible in an effort to spread out the coverage. Early on, the shotgun effort was accelerated so that the public assembly would provide more coverage (with commensurately higher quality) than the private sector assembly created by Celera.

Michael Smith Genome Sciences Center played an important role in the Human Genome Project. We generated the weekly assembly of the human genome fingerprint map using the FPC assembly software developed in collaboration with Washington University. We also generated a tiling set of 32,000 DNA fragments that completely covered the genome with minimal overlap.

keyboard_arrow_up

4 · 2000 to 2013 — zero to genome hero

Over the years 2000 to 2013, the human genome assembly progressively improved from a so-called “draft” assembly to one that, practically, may be called “finished” (finishish?). These terms can be defined in various ways — for example, based on how much of coding sequence is represented, the contiguity of the assembly, the total coverage of the genome, or some combination of these metrics.

Individual assemblies were indexed by hgN from hg1 (May 2000) to hg38 (Dec 2013). Differences in the index value do not necessarily represent how much of the assembly changed and some indexes were skipped. And I'm still trying to locate the sequence files of hg3 (July 2000).

The hg prefix was used by the team at the UCSC Genome Browser while NCBI had their own assembly build indexes (e.g. hg10 was NCBI build 28). In fact, as of July 2022, there are 1,268 assemblies of the human genome indexed by NCBI.

keyboard_arrow_up

5 · 2022 — CHM13 fills in the gaps

The hg38 (Dec 2013) assembly continues to be the canonical “reference sequence” since 2013 (and likely beyond 2022). An excellent assembly, it nevertheless contains 1,001 gaps totalling about 150 Mb across chromosomes 1–22, X and Y. as well as 430 unanchored and alternate pieces that totalled 121 Mb and themselves include 240 gaps of about 9 Mb.

▲ FILLING IN THE GAPS | Before the March 2022 telomere-to-telomere assembly, the most recent humang enome assembly was from 2013 (hg38). This assembly had 1,001 gaps in chromosomes 1–22, X and Y. Shown here is the size, location and distribution of these gaps. To make small regions visible, those smaller than 230 kb are shown at a fixed size.

These gaps were mostly in heterochromatic regions, which are notoriously difficult to sequence. When the size of the gap was not known, following tradition, it was set to 100 bases . Other sizes, such as 1kb, 10kb and 50kb signalled more information about the nature of the gap, such as a contig gap, short arm gap, telomere gap, and so on.

The CHM13v2 (Mar 2022) assembly filled in all these gaps. It is a completely gapless assembly.

VIEW ALL

news + thoughts

Propensity score weighting

Mon 04-05-2026

It is not certain that everything is uncertain. —Blaise Pascal

We have already explored how we can mitigate bias caused by confounding variables in observational studies using propensity score (PS) matching (PSM) and propensity score weighting (PSW). However, any statistical model is only as good as its assumptions and, if it is specified incorrectly, it can itself produce biased estimates of the treatment effect.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca

▲ Nature Methods Points of Significance column: Double Robustness. (read)

This month, we explore double robustness, a powerful statistical concept that provides a valuable “safety net” against the risk of an incorrect model. It offers two opportunities, instead of just one, to obtain a valid estimate of the treatment effect — making it possible to draw credible causal inferences from observational data without having to depend on a single set of modeling assumptions.

Kurz, C.F., Krzywinski, M. & Altman, N. (2026) Points of significance: Double Robustness. Nat. Methods 23:868–869.

Nature Biotechnology cover

Thu 23-04-2026

My cover design on the 7 April 2026 Nature Biotechnology issue shows the dendrogram that represents a cluster of uniquely expressed (or downregulated) genes in human naive stem cells induced from such cells. Within each dendrogram block, the genomic barcode sequence (sampled from Supplementary Table 1) is depicted with a Code 39 barcode. The highlighted barcode is one of those used for cell isolation.

Ishiguro S. et al. A multi-kingdom genetic barcoding system for precise clone isolation (2026) Nature Biotechnology 44:616–629.

▲ My Nature Biotechnology phylogenetic tree cover (volume 44, issue 4, 7 April 2026). (more)

Browse my gallery of cover designs.

▲ A catalogue of my journal and magazine cover designs. (more)

Happy 2026 π Day—
Art for the 5%

Fri 13-03-2026

Celebrate π Day (March 14th) and enjoy the art — but only if you're part of the 5%.

Go ahead, see what you can't see.

▲ 2026 π DAY | Art for the 5%. Shown in the style of Ishihara color test plates, the art is visible only to those with colour blindness. (details)

Ishihara's Tests for Colour Deficiency

Sun 08-03-2026

Authentic and accurate images of Ishihara's test plates photographed (and lovingly color-corrected) from the 38-plate Ishihara's Tests for Colour Deficiency.

I also provide the position, size, and color of each circle on each test plate.

▲ ISHIHARA'S TEST PLATE 6 | This plate is part of the set of transformation plates. If you see 5, you're ok. If you see 2, you're not. (details)

▲ ISHIHARA'S TEST PLATE 18 | This plate is part of the set of mysterious hidden plates. If you don't see anything, you're ok. If you see 5, you're not. (details)

Symmetric alternatives to the ordinary least squares regression

Wed 23-07-2025

What immortal hand or eye, could frame thy fearful symmetry? — William Blake, "The Tyger"

This month, we look at symmetric regression, which, unlike simple linear regression, it is reversible — remaining unaltered when the variables are swapped.

Simple linear regression can summarize the linear relationship between two variables `X` and `Y` — for example, when `Y` is considered the response (dependent) and `X` the predictor (independent) variable.

However, there are times when we are not interested (or able) to distinguish between dependent and independent variables — either because they have the same importance or the same role. This is where symmetric regression can help.

▲ Nature Methods Points of Significance column: Symmetric alternatives to the ordinary least squares regression. Geometry of quantities minimized in OLS and symmetric regression. OLS minimizes `\Sigma e_y^2` in `Y` ~ `X` and `\Sigma e_x^2` `X` ~ `Y`. Pythagorean regression minimizes AB (magenta). Geometric means regression (GMR) minimizes area of ABP (orange). Orthogonal regression (OR) minimizes HP (blue). (read)

Luca Greco, George Luta, Martin Krzywinski & Naomi Altman (2025) Points of significance: Symmetric alternatives to the ordinary least squares regression. Nat. Methods 22:1610–1612.

History of the Human Genome Assembly

22 years, 3,117,275,501 bases and 0 gaps later

contents

1 · The first gapless assembly of a human genome

1.1 · Finished... again?

2 · Creating the Scientific American graphic

3 · The human genome — pieces of different genomes

3.1 · Genomic libraries

3.2 · Focused efforts

4 · 2000 to 2013 — zero to genome hero

5 · 2022 — CHM13 fills in the gaps

Propensity score weighting

Nature Biotechnology cover

Happy 2026 π Day—Art for the 5%

Ishihara's Tests for Colour Deficiency

Symmetric alternatives to the ordinary least squares regression

Happy 2026 π Day—
Art for the 5%