On March 14th celebrate `\pi` Day. Hug `\pi`—find a way to do it.
For those who favour `\tau=2\pi` will have to postpone celebrations until July 26th. That's what you get for thinking that `\pi` is wrong. I sympathize with this position and have `\tau` day art too!
If you're not into details, you may opt to party on July 22nd, which is `\pi` approximation day (`\pi` ≈ 22/7). It's 20% more accurate that the official `\pi` day!
Finally, if you believe that `\pi = 3`, you should read why `\pi` is not equal to 3.
Well—well; the sad minutes are moving,
Though loaded with trouble and pain;
And some time the loved and the loving
Shall meet on the mountains again!
—Emily Bronte
Welcome to this year's celebration of `\pi` and mathematics.
The theme this year is Sanger sequencing — old-school, one base at a time.
This year's `\pi` poem is Loud Without The Wind Was Roaring by Emily Bronte.
This year's `\pi` day song is Movements by Luca Musto.
Also, the tabbed menu above is full. Gasp.
I work in a genome center, so it was just a matter of time before one of the Pi Day celebrations encoded the digits of `\pi` as a sequence of nucleotides (A, T, G and C). It took 12 years.
Our first sequencer was the MegaBACE 1000, which now exists only in old photos).
Unfortunately, there was nothing "mega" about it. You were lucky to sequence 500 contiguous bases of DNA. So, halfakiloBACE?
Under the hood, the sequencing was done with the Sanger method.
Here's a simplified explanation of how Sanger sequencing works.
Let's suppose we want to determine the sequence in TTCAGT
.
To do this, we make use of a DNA copying process called polymerase chain reaction (PCR). But the name here isn't important. PCR will take our DNA and make millions of copies of it. But this kind of PCR is simple copy and not that useful for us.
What we do is change how the copying process happens. Instead of PCR simply assembling the same sequence as we fed it (e.g. TTCAGT), we'll mix into the process a small fraction of "special" nucleotides (A*, C*, G* and T*) which will terminate the copy reaction.
This way, we will get all the possible subsequences that start at the first base
T* TT* TTC* TTCA* TTCAG* TTCAGT*
We now take these fragments (which are all in the same solution) and order them by size using gel electrophoresis. Briefly, in this process we take agarose gel, apply voltage to it and wait for the DNA molecules (which are negatively charged) to be pulled through the gel by the electric field. After some time, the shorter fragments (which pass through the gel matrix with minimal hinderance) travel fruther than than the longer fragments (which get caught up in the gel matrix).
If all this happens in a capillary, we get a procession of size-ordered fragments coming out the other end.
Finally, remember how I said that these terminating nucleotides were "special"? Well, they fluoresce under a laser. We use this light to detect the fragment — which shows up as a fluorescence peak.
We are able to tell the bases apart because we run four parallel and independent copy processes, each having access to only one of the terminating nucleotides.
This used to all be done manually but in the late 90's and early 2000's this all happened inside automated sequencers.
One of these sequencers was an ABI 3700. Below I show what the screen interface looked like during a run. Traditionally, the color assignments to the bases were A (green), C (blue), G (black/yellow) and T (red). Pure RGB for intensity.
The posters show `\pi` up to the Feynman Point, which are six 9's at decimal places 762–767.
Each digit is encoded by two bases:
0 GA 1 CA 2 TC 3 TT 4 GT 5 GC 6 AA 7 CC 8 TA 9 GG
With this scheme, 3.14 reads as TTCAGT
. Hence, the title of the art "TTCAGT: a sequence of digits".
This encoding was chosen so that the number of bases in the sequence was balanced, to the extent possible. The number of peaks per base on the trace is
A 381 C 381 G 390 T 384
I fixed 9 to be GG because G is traditionally shown as black in Sanger traces and I wanted to end on this color.
I also fixed 3 to be TT (traditionally red) so that the trace starts with two red (or magenta) peaks.
Explore the art posters.
Celebrate π Day (March 14th) and sequence digits like its 1999. Let's call some peaks.
I don’t have good luck in the match points. —Rafael Nadal, Spanish tennis player
Points of Significance is an ongoing series of short articles about statistics in Nature Methods that started in 2013. Its aim is to provide clear explanations of essential concepts in statistics for a nonspecialist audience. The articles favor heuristic explanations and make extensive use of simulated examples and graphical explanations, while maintaining mathematical rigor.
Topics range from basic, but often misunderstood, such as uncertainty and P-values, to relatively advanced, but often neglected, such as the error-in-variables problem and the curse of dimensionality. More recent articles have focused on timely topics such as modeling of epidemics, machine learning, and neural networks.
In this article, we discuss the evolution of topics and details behind some of the story arcs, our approach to crafting statistical explanations and narratives, and our use of figures and numerical simulations as props for building understanding.
Altman, N. & Krzywinski, M. (2025) Crafting 10 Years of Statistics Explanations: Points of Significance. Annual Review of Statistics and Its Application 12:69–87.
I don’t have good luck in the match points. —Rafael Nadal, Spanish tennis player
In many experimental designs, we need to keep in mind the possibility of confounding variables, which may give rise to bias in the estimate of the treatment effect.
If the control and experimental groups aren't matched (or, roughly, similar enough), this bias can arise.
Sometimes this can be dealt with by randomizing, which on average can balance this effect out. When randomization is not possible, propensity score matching is an excellent strategy to match control and experimental groups.
Kurz, C.F., Krzywinski, M. & Altman, N. (2024) Points of significance: Propensity score matching. Nat. Methods 21:1770–1772.
P-values combined with estimates of effect size are used to assess the importance of experimental results. However, their interpretation can be invalidated by selection bias when testing multiple hypotheses, fitting multiple models or even informally selecting results that seem interesting after observing the data.
We offer an introduction to principled uses of p-values (targeted at the non-specialist) and identify questionable practices to be avoided.
Altman, N. & Krzywinski, M. (2024) Understanding p-values and significance. Laboratory Animals 58:443–446.
Variability is inherent in most biological systems due to differences among members of the population. Two types of variation are commonly observed in studies: differences among samples and the “error” in estimating a population parameter (e.g. mean) from a sample. While these concepts are fundamentally very different, the associated variation is often expressed using similar notation—an interval that represents a range of values with a lower and upper bound.
In this article we discuss how common intervals are used (and misused).
Altman, N. & Krzywinski, M. (2024) Depicting variability and uncertainty using intervals and error bars. Laboratory Animals 58:453–456.
We'd like to say a ‘cosmic hello’: mathematics, culture, palaeontology, art and science, and ... human genomes.