On March 14th celebrate `\pi` Day. Hug `\pi`—find a way to do it.
For those who favour `\tau=2\pi` will have to postpone celebrations until July 26th. That's what you get for thinking that `\pi` is wrong. I sympathize with this position and have `\tau` day art too!
If you're not into details, you may opt to party on July 22nd, which is `\pi` approximation day (`\pi` ≈ 22/7). It's 20% more accurate that the official `\pi` day!
Finally, if you believe that `\pi = 3`, you should read why `\pi` is not equal to 3.
Well—well; the sad minutes are moving,
Though loaded with trouble and pain;
And some time the loved and the loving
Shall meet on the mountains again!
—Emily Bronte
Welcome to this year's celebration of `\pi` and mathematics.
The theme this year is Sanger sequencing — old-school, one base at a time.
This year's `\pi` poem is Loud Without The Wind Was Roaring by Emily Bronte.
This year's `\pi` day song is Movements by Luca Musto.
Also, the tabbed menu above is full. Gasp.
Here's a simplified explanation of how Sanger sequencing works.
I'm skipping any detail about primers, reaction conditions and the fact that some sequences will be complementary (e.g. A→T, C→G, G→C, T→A).
Let's suppose we want to determine the sequence in TTCAGT.
To do this, we make use of a DNA copying process called polymerase chain reaction (PCR). But the name here isn't important.
PCR will take our DNA and make millions of copies of it. This kind of PCR is good for one-to-many amplification but, in its basic form, is not that useful for us.
Normally, PCR works by using a template strand of DNA (that which is to be copied) and a protein called DNA polymerase (among others), which synthesizes a new strand on top of the template by stitching together a complementary sequence using free-floating nucleotides in the solution buffer.
However, we can change how the PCR copying process happens by throwing in a few extra molecular ingredients into the reaction buffer.
We add a small amount of "special" nucleotides (A*, C*, G* and T*) which will terminate the PCR copy reaction. These special nucleotides are available to the PCR machinery in the same way that the regular nucleotides A, C, G, T are. Except, because the special bases are available at much lower concentration (e.g. 1/100), they will be incorporated into the new string at a low probability.
In this new copy reaction, we will get all the possible subsequences that start at the first base
T* TT* TTC* TTCA* TTCAG* TTCAGT*
For example, in the copied sequence TTC*, PCR has incorporated two regular T's followed by the terminating C*.
We now take these fragments (which are all floating around in a solution buffer) and order them by size using gel electrophoresis.
Briefly, this process takes advantage of the fact that (a) DNA molecules are negative charged and (b) smaller molecules diffuse faster through a gel matrix than larger ones.
We diffuse the DNA molecules through a polyacrylamide gel. But waiting for diffusion would take forever. To speed things up, we apply voltage across the gel. This pulls the negatively charged DNA molecules to the positive terminal. Shorter fragments pass through the gel with minimal hinderance but larger ones get occasionally caught up and temporarily stuck in the gel matrix and thus take longer to pass through
If all this happens in a capillary, we get a procession of size-ordered fragments coming out the other end.
Finally, remember how I said that these terminating nucleotides were "special"? They fluoresce under a laser. We use this light to detect the fragment — which shows up as a fluorescence peak.
Ideally, if the signal is clean, we will see a uniformly (more or less) series of smudges on the gel. The relative positions of the peaks tell us which DNA fragment comes next (e.g. TT* and TTCAGT* are separated by 3 peaks that correspond to TTC* TTCA* and TTCAG*).
We are able to tell the bases apart because we run four parallel and independent copy processes, each having access to only one of the terminating nucleotides. For example, the T*, TT* and TTCAGT* peaks would all show up in the T* reaction but not in the other reactions.
This used to all be done manually but in the late 90's and early 2000's this all happened inside automated sequencers.
One of these sequencers was an ABI 3700. Below I show what the screen interface looked like during a run. Traditionally, the color assignments to the bases were A (green), C (blue), G (black/yellow) and T (red). Pure RGB for intensity.
The posters show `\pi` up to the Feynman Point, which are six 9's at decimal places 762–767. This position in `\pi` is a great place to stop because of the unexpected pattern of 9's at the end.
Each digit is encoded by two bases:
0 GA 1 CA 2 TC 3 TT 4 GT 5 GC 6 AA 7 CC 8 TA 9 GG
With this scheme, 3.14 reads as TTCAGT. Hence, the title of the art "TTCAGT: a sequence of digits".
This encoding was chosen so that the number of bases in the sequence was balanced, to the extent possible. The number of peaks per base on the trace is
A 381 C 381 G 390 T 384
I fixed 9 to be GG because G is traditionally shown as black in Sanger traces and I wanted to end on this color.
I also fixed 3 to be TT (traditionally red) so that the trace starts with two red (or magenta) peaks.
There are many other encodings possible.
One kind of encoding is Huffman, which creates a tree of unique representations formed from an alphabet of symbols to encode information. Check out the paper Toward a Better Compression for DNA Sequences Using Huffman Encoding. Try the online Huffman encoder
Here's one of the optimal Huffman encodings of the first 768 digits of `\pi` into nucleotides.
1 symbol 1 C count 88/768 11.5% 9 symbol 0 A count 85/768 11.1% 2 symbol 33 TT count 81/768 10.5% 4 symbol 32 TG count 79/768 10.3% 3 symbol 31 TC count 76/768 9.9% 6 symbol 30 TA count 75/768 9.8% 8 symbol 23 GT count 72/768 9.4% 7 symbol 22 GG count 71/768 9.2% 0 symbol 21 GC count 71/768 9.2% 5 symbol 20 GA count 70/768 9.1%
The most common digits are 1 and 9, so these can be encoded by a single base (C and A, respectively). The remaining digits need two bases.
If we encoded each digit with two bases, then we'd need a string of 1,536 bases. But with the Huffman encoding, we only need 1,363 bases because we now realize a savings of 88 bases for 1 (which is now encoded by one base instead of two) and 85 bases for 9.
For simplicity, the posters use two bases per digit.
The peaks were generated from a simple model that drew each peak as a Normal distribution.
The peaks that corresponded to a digit each had a mean height, width, and position, which was perturbed on a peak-by-peak basis using random values drawn from a Normal distribution.
For example, the peak height mean was `\bar{h} = 0.6` times row height with a standard deviation of `\sigma_h = 0.1\bar{h}`. The width of each peak was `\bar{w} = 0.15S`, where `S` is the spacing between peaks, with a standard deviation of `\sigma_w = 0.1\bar{w}`. The position standard deviation was `sigma_x = 0.075S`.
Towards the last 20 peaks (10 digits), the peak height is reduced and width is increased to taper off the signal.
For each signal peak, up to four noise peaks were added to the signal. The peaks were positioned at horizontal offsets of `-2, -1, +1, 2` peak spacings. Neighbour error peaks (offset by `-1` and `1`) had a 50% probability of being drawn and the next-nearest neighbour error peaks (offset by `-2` and `2`) had a 25% probability.
The error peaks were on average 10% (neighbours) or 5% (next-nearest neighbours) of the height of the signal peaks.
These background peaks arise during the Sanger reaction for a variety of reasons. Typically the start of the trace is messy, but I don't account for this.
The posters are designed for 50 cm × 50 cm (19.7" × 19.7"). At this size the title font (Futura Medium) is 16 pt and the legend font (Futura Book) is 12 pt.
You can easily display the poster at half this size and still have the legend font readable.
There are 30 rows with up to 52 peaks per row. The first and last rows have fewer peaks.
It is not certain that everything is uncertain. —Blaise Pascal
We have already explored how we can mitigate bias caused by confounding variables in observational studies using propensity score (PS) matching (PSM) and propensity score weighting (PSW). However, any statistical model is only as good as its assumptions and, if it is specified incorrectly, it can itself produce biased estimates of the treatment effect.
This month, we explore double robustness, a powerful statistical concept that provides a valuable “safety net” against the risk of an incorrect model. It offers two opportunities, instead of just one, to obtain a valid estimate of the treatment effect — making it possible to draw credible causal inferences from observational data without having to depend on a single set of modeling assumptions.
Kurz, C.F., Krzywinski, M. & Altman, N. (2026) Points of significance: Double Robustness. Nat. Methods 23:868–869.
My cover design on the 7 April 2026 Nature Biotechnology issue shows the dendrogram that represents a cluster of uniquely expressed (or downregulated) genes in human naive stem cells induced from such cells. Within each dendrogram block, the genomic barcode sequence (sampled from Supplementary Table 1) is depicted with a Code 39 barcode. The highlighted barcode is one of those used for cell isolation.
Ishiguro S. et al. A multi-kingdom genetic barcoding system for precise clone isolation (2026) Nature Biotechnology 44:616–629.
Browse my gallery of cover designs.
Celebrate π Day (March 14th) and enjoy the art — but only if you're part of the 5%.
Go ahead, see what you can't see.
Authentic and accurate images of Ishihara's test plates photographed (and lovingly color-corrected) from the 38-plate Ishihara's Tests for Colour Deficiency.
I also provide the position, size, and color of each circle on each test plate.
What immortal hand or eye, could frame thy fearful symmetry? — William Blake, "The Tyger"
This month, we look at symmetric regression, which, unlike simple linear regression, it is reversible — remaining unaltered when the variables are swapped.
Simple linear regression can summarize the linear relationship between two variables `X` and `Y` — for example, when `Y` is considered the response (dependent) and `X` the predictor (independent) variable.
However, there are times when we are not interested (or able) to distinguish between dependent and independent variables — either because they have the same importance or the same role. This is where symmetric regression can help.
Luca Greco, George Luta, Martin Krzywinski & Naomi Altman (2025) Points of significance: Symmetric alternatives to the ordinary least squares regression. Nat. Methods 22:1610–1612.
Fuelled by philanthropy, findings into the workings of BRCA1 and BRCA2 genes have led to groundbreaking research and lifesaving innovations to care for families facing cancer.
This set of 100 one-of-a-kind prints explore the structure of these genes. Each artwork is unique — if you put them all together, you get the full sequence of the BRCA1 and BRCA2 proteins.