2026 π Day latest news buy art
syncopation & accordionCafe de Flore (Doctor Rockit)like France, but no dog poopmore quotes
very clickable

genomes + non-existence

Dark Matter of the Genome—sequences that are not there

Failing to fetch me at first keep encouraged,
Missing me one place search another,
I stop somewhere waiting for you.
—Walt Whitman, Song of Myself

One of the challenges in visualization is cueing what data is not being displayed, which is a common requirement when transitioning from overview to details, such as by zooming or filtering.

An interesting take on this is the enumeration and analysis of data that isn't merely not displayed but actually not there.

You can download the frequencies of all `k`-mers in the human genome for `k=1...12`.

I had an idea to try this and Michael Schatz pointed out an efficient algorithm [1] to find such missing sequences, or "nullomers" or, more generally, "unwords".

[1] Julia Herold, Stefan Kurtz and Robert Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics (2008) 9:167

I wanted to create some phylogenetic trees based on these nullomers but discovered that this has also been done [2]. It turns out that the nullomers tell us something about the sequence without being visible. The genome's dark matter?

[2] Davide Vergni, Daniele Santoni. Nullomers and High Order Nullomers in Genomic Sequences. (2016) PLoS ONE 11(12): e0164540.

At some point I think I'll make some art around this concept.

sequences that don't exist

Because the human genome sequence is thankfully finite, there are some sequences that do not appear in it. These are (vaguely creepily) coined nullomers.

For example, the hg38 human reference assembly (Dec 2013), has 3,049,315,783 bases across its 455 assembled pieces, which includes the 24 chromosomes (e.g., chr1–chr22, chrX, chrY), mitochondrial chromosome (chrM), alternative and unanchored scaffolds (e.g. chr1_GL383520v2_alt and chr1_KI270706v1_random.fa). In this number I'm exluding all the padding in the sequence, which is indicated by the base "n".

Not surprisingly, all possible `k`-mers (sequences of length `k`) for `k=2` exist. There are 3,018,334,391 of them and they break down in frequency as follows

gc  130065644 4.31%
ac  153830681 5.10%
gt  154194068 5.11%
cc  158048073 5.24%
gg  159302235 5.28%
tc  181782675 6.02%
ga  182772932 6.06%
ta  197567087 6.55%
ct  213517855 7.07%
ag  213785914 7.08%
ca  221181041 7.33%
tg  222266728 7.36%
at  233904527 7.75%
aa  296763858 9.83%
tt  299351073 9.92%
   ---------- -----
   3018334391  100%

The least common 2-mer is "gc", which isn't surprising since the GC content of the human genome is about 41%, and both "g" and "c" appear in essentailly equal measure.

c  623727342 20.45%
g  626335137 20.54%
a  898285419 29.46%
t  900967885 29.55%
  ---------- ------
  3049315783   100%

You can now ask "What is the smallest value of `k` for which we cannot find a `k`-mer in the human reference sequence?"

all `k`-mers up to `k=10` exist

It turns out that all possible sequences up to length 10 exist in the reference.

Depending on your familiarity with the genome and depth of intuition about statistics, this might be not at all surprising, mildly suprising or just plain confusing.

For example, if we consider `k=8`, there are `4^8 = 65536` such sequences. Alphabetically,

aaaaaaaa
aaaaaaac
aaaaaaag
aaaaaaat
aaaaaaca
...
ttttttgt
ttttttta
tttttttc
tttttttg
tttttttt

All these exist. The 8-mer "cgtacgct" is the most rare and "tttttttt" the most common.

cgtacgcg 304 0.00000997%
cgcgtacg 315 0.00001033%
cgtcgacg 322 0.00001056%
tcgacgcg 333 0.00001092%
cgatcgac 356 0.00001167%
...
tgtgtgtg 899632 0.02950286%
tatatata 925963 0.03036637%
atatatat 967263 0.03172078%
aaaaaaaa 4743237 0.15555144%
tttttttt 4773492 0.15654364%

If we repeat this all the way up to `k=10`, we find that all sequences exist. For example, there are `4^{10}=1,048,576` 10-mers and they also all exist.

cgacgatcga 2 0.00000007%
tcgcgacgta 2 0.00000007%
cgtaacgcgc 2 0.00000007%
tcgacgtacg 2 0.00000007%
tattcgcgcg 2 0.00000007%
...
cacacacaca 602118 0.01974610%
gtgtgtgtgt 604553 0.01982595%
tgtgtgtgtg 608965 0.01997064%
aaaaaaaaaa 3085421 0.10118453%
tttttttttt 3105565 0.10184514%

The least common 10-mer is "cgacgatcga", which appears only twice. In fact, the 5 10-mers shown above are the only ones that appear only twice. No 10-mer appears once.

Once we look at 11-mers, we finally see a sequence that doesn't appear. There are `4^{11} = 4,194,304` 11-mers of which 2,728 (0.07%) appear once and 4,705 (0.11%) appear twice. The most common ones are

ctgggattaca 409261 0.01342148%
tgtaatcccag 409332 0.01342380%
ctgtaatccca 420484 0.01378953%
tgggattacag 420548 0.01379163%
cacacacacac 512570 0.01680943%
gtgtgtgtgtg 517147 0.01695953%
acacacacaca 549346 0.01801548%
tgtgtgtgtgt 555966 0.01823258%
aaaaaaaaaaa 2563770 0.08407734%
ttttttttttt 2580640 0.08463058%

once you take out what's not there, you're left with what's there

Interestingly (maybe) 941 11-mers don't appear in the genome. I've listed them alphabetically below. The first one is the most interesting one, as it is the first alphabetically ordered sequence that does not appear in the genome: aaccgacgcgt.

If you look at the statistics of all the bases of these missing 11-mers, you find that they're the inverse of what is there.

  # 11-kmer nullomers     # bases in hg38
c 3166 30.59%             623727342 20.45%
g 3182 30.74%             626335137 20.54%
a 1986 19.19%             898285419 29.46%
t 2017 19.49%             900967885 29.55%

For example, since 41% of the bases in the genome are "g" or "c" we find `0.305 \approx (1-0.41)/2` of the missing bases to be "g" or "c" and similarly `0.19 \approx (1-0.59)/2` to be "a" or "t".

# all 11-mers sorted alphabetically that do not appear 
# in the human genome reference sequence (hg38 Dec 2013)
aaccgacgcgt aaccgcgcgta aaccgcgtacg aaccggttcgc aacgcgatacg aacgcgatcgg aacgcgcgaat aacgcggtcga aacgcgtagcg aacgcgtatcg aacggtcgtcg aacgttgcgcg aagcgtacgcg aagtcgcgcga aatacgcgcga aatacgtcgcg aatcgcgtacg aatcgcgtcga aatcgcgttcg aatcggttgcg aatcgtcgacg aatgtcgcgcg aattcgcgacg aattcgcgcga aattgtcgcga acacgcgtcga acccgtcgcga accgatacgcg accgatcgacg accgatcgtcg accgccgatcg accgcgacgaa accgcgcgatt accgcgtacgt accggtcggac accgtcgatcg accgtcgcgat accgtcgcgta accgttcgtcg acgaaacgtcg acgaatcgacg acgaccgcgta acgaccgttcg acgacgacgta acgacgccgta acgacgcgcta acgacgcgtat acgacggtacg acgatacgcga acgatacgcgg acgatacggcg acgatcggcga acgatcgtcgg acgatcgtgcg acgattcggcg acgcaacgtcg acgccggtcga acgcgacgatt acgcgacgcta acgcgacggta acgcgacgtaa acgcgataacg acgcgatacgt acgcgcgatac acgcgcgatat acgcgcgatta acgcggcgtaa acgcggtacga acgcggtacgt acgcgtaacga acgcgtaacgt acgcgtaccgg acgcgtacgat acgcgtcgtac acgctagtcga acgctcgaacg acggcgactaa acgggtcgacg acggtacgccg acggtacgtcg acggtcgagcg acggtcgcgta acgtaacgcgc acgtaccgtcg acgtacgaacg acgtacgatcg acgtagcgcgt acgtatcgccg acgtcgaccga acgtcgacgct acgtcgacgta acgtcgagtcg acgtcgcgata acgtcgcgcgt acgtcgcgtac acgtcggtcga acgtcgtacgc acgttaacgcg actaatcgcga actacgcgtcg acttcgcgcgt agcgtcgtacg agcgttcgacg agtacgtcgcg agtatcgcgac agtcgatcgcg agtcggtcgat agttacgcgcg ataccgcgtcg atacgaacgcg atacgacgcga atacgcgatcg atacgcgccga atacgcgcgac atacgcgcgag atacgcgttcg atacggcgcgt atacttcgccg atacttcgcgc atagactcgcg atatcgacgcg atatcgcgcgg atatcgcgcgt atatcgtcgcg atccgtcgcga atcgacgtgcg atcgcgaaacg atcgcgacgta atcgcgattcg atcgcgcgaag atcgcgcgtat atcgcgcgtcg atcgcggtcgt atcgcgtaacg atcgcgtaccg atcgcgtacgt atcgcgttccg atcggcgcgta atcgggtcgac atcggtacgcg atcggttgcga atcgtaccgcg atcgtccgacg atcgtcgaacg atcgtcgacga atcgtcgacgc atcgtcgatcg atcgtcgcgta atcgtctcgcg atgtatcgcgc atgtcgcgcga atgtcgtcgcg attcgacggcg attcgcgatcg attcgcgcgat attcgcggcgc attgtacgcgg atttcgacgcg atttcgcgacg atttcgcgcga atttcgtcgcg caacgcgctac cacgtcgcgaa cagtccgatcg catacgcgtcg catatcgcgcg ccgaatacgcg ccgaccgtacg ccgacgatcga ccgacgatcgt ccgacgcgata ccgacgcgatt ccgacgtaacg ccgacgtacga ccgatacgcga ccgatacgtcg ccgatagtcgg ccgatcgtccg ccgattacgcg ccgcgaatcga ccgcgacgtaa ccgcgataacg ccgcgatacga ccgcgcgatat ccgcgtaatcg ccgcgtcgaaa ccgcttaagcg ccggcgtaacg ccgtataacgg ccgtcgaaacg ccgtcgaacgc ccgtcgaccgg ccgtcgatcga ccgtcgccgaa ccgtcgcgtag ccgtcgcgtat ccgttacgtcg ccgttcgcacg ccgttgcgtcg cgaacggtcgc cgaacggtcgt cgaacggttcg cgaacgtacga cgaactaacgg cgaatacgcgc cgaataggcgg cgaatcgacga cgaatcgacgg cgaatcgagcg cgaatcgcgta cgaatcgtcga cgacagtacgg cgacatacgcg cgaccgactat cgaccgatacg cgaccgctata cgaccgtcgac cgaccgtcgat cgaccgttaac cgaccgttacg cgacgaacgag cgacgaacggt cgacgaccgta cgacgacgtat cgacgatacgg cgacgatcgaa cgacgatcgat cgacgatcggc cgacgattcgc cgacgcaacga cgacgcgacaa cgacgcgacat cgacgcgataa cgacgcgatac cgacgcgtaaa cgacgcgtata cgacgcgtcaa cgacgcgttac cgacgcgttta cgacgctatcg cgacggacgta cgacggatacg cgacggcgtat cgacgtaacgc cgacgtaacgg cgacgtaccga cgacgtaccgt cgacgtacgac cgacgtacgat cgacgtacggg cgacgtactcg cgacgtatcgg cgactacgcga cgactagtccg cgactatcgcg cgagcgtatcg cgagtcgaccg cgataccgcgt cgataccgtcg cgatacgaccg cgatacgcgag cgatacgcgat cgatacgcgga cgatacggcgt cgatacgttcg cgatagacggt cgatagcgcgt cgatagtcgga cgatagtcggt cgatatacgcg cgatatgcgcg cgatccgatac cgatccgtacg cgatcgaccga cgatcgaccgt cgatcgacgtc cgatcgcgaac cgatcggcgta cgatcggtcgt cgatcgtacgc cgatcgtacgg cgatcgtcggg cgatcgtcgta cgatcgtgcga cgatgcgcgac cgattacgcga cgattacgcgc cgattacgcgt cgattatcgcg cgattcgacgt cgattcggcga cgattgaccgc cgcaacggtac cgcaatagtcg cgcatatcgcg cgcattcgtcg cgccgaaacga cgccgatacga cgccgatcgaa cgccgatcgta cgccgtaatcg cgccgtacgta cgcgaaacgat cgcgaacgatt cgcgaacgtta cgcgaatcgaa cgcgaattcgt cgcgaccgaat cgcgaccgtaa cgcgacgaact cgcgacgatta cgcgacgcata cgcgacgctaa cgcgacgctat cgcgacggtta cgcgacgtaac cgcgacgttaa cgcgacttacg cgcgagcgata cgcgagcgtag cgcgataacga cgcgataacgc cgcgataattg cgcgatacacg cgcgatcaacg cgcgatcggta cgcgattcgat cgcgattgtcg cgcgcacgata cgcgcataaga cgcgcataata cgcgcgatatg cgcgcgatatt cgcgcgtatta cgcgcgttata cgcgctatacg cgcgctatccg cgcggtacgaa cgcggtacgta cgcggtatacg cgcggtcgatt cgcggttcgtt cgcgtaacgcg cgcgtaatacg cgcgtaatcga cgcgtaatcgt cgcgtaccgga cgcgtaccgtt cgcgtatcgcg cgcgtatcggt cgcgtatcgtt cgcgtattcgg cgcgtcaaacg cgcgtcgagac cgcgtcgcaat cgcgtcggatg cgcgtcgtact cgcgtcgttat cgcgttacgcg cgcgttatacg cgcgttattcg cgcgtttcgta cgctaacgtcg cgctcgacgaa cgctcgacgta cgctcgcgtat cgctcgtaacg cgctcgtacga cgctcgtatcg cgcttacgcga cggaccatacg cggactatcga cggagcgtacg cggatcgacga cggcgaacgta cggcgatctaa cggcgcgatag cggcgtaccga cggcgttaacg cggcttacgcg cggtaatcgcg cggtacgatcg cggtacgccga cggtatacggg cggtccgcata cggtcgaccgt cggtcgataac cggtcgattcg cggtcgcacga cggtcgcgtaa cggtcggtacg cggtcgtacga cggttacgtcg cggttcgatcg cggttcgtacg cgtaacgagcg cgtaacgccgt cgtaacgcgca cgtaacgcgct cgtaacgcgtt cgtaacgctcg cgtaacgtccg cgtaacgttcg cgtaccggtct cgtacgaacgc cgtacgaatcg cgtacgaccgt cgtacgacgaa cgtacgacgct cgtacgatacg cgtacgatcgg cgtacgatcgt cgtacgatgcg cgtacgccgtt cgtacgcgaat cgtacgcgact cgtacgcgata cgtacgcgatc cgtacggacgc cgtacggcggt cgtacggtcga cgtacggtcgt cgtacgtcgcg cgtacgtcggc cgtacgtgacg cgtatacgacg cgtatacgcga cgtatagcgcg cgtatatcggc cgtatcgcgtc cgtatcggtcg cgtatgtcgcg cgtattacgcg cgtattcgacg cgtattcgcgc cgtattgcgcg cgtcaatcgcg cgtcacgcgta cgtccgatcgt cgtccgtcgaa cgtcgaacgat cgtcgaattcg cgtcgaccggt cgtcgaccgtc cgtcgacgatc cgtcgacgcgt cgtcgacgctt cgtcgacgtac cgtcgactacg cgtcgactatc cgtcgagacgt cgtcgagcatc cgtcgataggc cgtcgattcga cgtcgcgaagc cgtcgcgacta cgtcgcgatac cgtcgcgatag cgtcgcgtagg cgtcgcgtatg cgtcgcgtatt cgtcgctcgaa cgtcggaatcg cgtcggtacga cgtcggtcgac cgtcggtcgat cgtcggttacg cgtcgtaccgg cgtcgttatac cgtcgttcgac cgtctcgtacg cgtgcgaacta cgtgtcgaacg cgtgtcgacga cgttaaacgcg cgttacgacga cgttacgcgtc cgttacggcgt cgttacggtcg cgttacgtcgt cgttagtcgcg cgttccgtcga cgttcgacgaa cgttcgacgcg cgttcgacggc cgttcgacggg cgttcgatcgt cgttcgcgaaa cgttcgcggaa cgttcgcggta cgttcgcgtat cgttcgctagg cgttgcgatcg cgttgcgtcgt cgtttacgcga cgtttcgaccg cgtttcgcgaa cgtttcgtcga ctaccgccgta ctacgaacggt ctacgcgcgaa ctacgcgcgac ctacgcgcgta ctacgcgtcga ctacgtcgacg ctactcgatcg ctagacgtacg ctataccgcgc ctatacgtccg ctatcgcgtcg ctatcgtcgcg ctattcgcgcg ctcgaccgtcg ctcgacgcgta ctcgacgtacg ctcgatcgacg ctcgatcgtcg ctcgcgacgta ctcgcgcggta ctcgtcgagta ctcgttcgtcg cttcgcgaacg gaccgatacgc gacgacgatcg gacgcgataat gacgcgattcg gacgcgcatag gacgcgcgtat gacgcgtaacg gacgtaacgcg gacgtacggcg gacgtacgtcg gacgtcgaacg gatacgtcgcg gatacttcgcg gatagcgcgtt gatagtcgacg gatcgcgaacg gatcgcgcgta gatcggtccga gatcggtcgta gatcgtcggtc gcccgttcgta gccgatcgttg gcgaacgcgta gcgaattacgc gcgacgatacg gcgacgatcga gcgacgcgata gcgacgcgtta gcgacgtaacg gcgattgacga gcgcatatcgt gcgccggtata gcgcgaatcga gcgcgacgata gcgcgacgtta gcgcgtaccga gcgcgtacgat gcgcgttcgat gcggtacgcgt gcggtcgacga gcggtcgtacg gcgtaacgcgc gcgtacaacga gcgtacgtcgc gcgtcgacaat gcgtcgattcg gcgtcgattgt gcgtcgtacgc gcgttacgcgt gcgttcgacgg gcgttcgcgta gctacgaaccg gctatacgcgt gctatcgcgcg gctcgtatcgt ggcgcgctata ggcgtcgatcg ggcgtcgcaat ggtacgcgacc ggtacgcgtaa ggtataccgcg ggtcgacgcga ggtcgattcgc ggtcgattgcg ggttccgtacg gtaatcgacga gtaccggcgta gtacgaggtcg gtacgccggtt gtacgcgaccg gtacgcgagta gtacgcgcgta gtacgcgcgtt gtacgctcgac gtacggcgatc gtacggcgtac gtacggtcgcg gtacgttcgcg gtagcggtacg gtataacgcgg gtatagcgacg gtatccgatcg gtatcgaaccg gtatcgacgcg gtatcgcgcga gtatcgcgtcg gtatcgtcgcg gtatcgttgcg gtatgccgcga gtatgcgaacg gtattatcgcg gtattcgacgc gtccgacgcga gtccgacgtcg gtccgagcgta gtccgatcgcg gtccgttacgc gtcgaaccgac gtcgaacgacg gtcgaacgcga gtcgacgacta gtcgacgatcg gtcgacgcgaa gtcgacgcgac gtcgacgtacg gtcgatcggta gtcgattcgag gtcgcaattcg gtcgcacgcga gtcgcccgata gtcgcgacaat gtcgcgacgta gtcgcgcataa gtcgcgccgta gtcgcgcgtaa gtcgcggataa gtcgcgtcgca gtcgcgttacg gtcgctatcgt gtcgtaacgcg gtcgtacgcga gtcgttacgcg gttatgcgcga gttattcgcgc gttcgaacgcg gttcgatcgga gttcgtacgcg gttcgtagcga taaccgtcgac taacgcgatcg taacgcgccga taacgcgcgat taacgcgtcga taacgtcgcga taacgtcgcgc taagtcgcgcg taatcgcgtcg taatcgtcgac taattcgacgc taattcgcgcg tacccgcggtt taccgcgcgat taccggtcgta taccgtacgcg taccgttcgcg tacgacgcgat tacgacgtacg tacgagctcgt tacgatcgacg tacgatcgtcg tacgcacgcga tacgccgacgc tacgcgacacg tacgcgaccga tacgcgacgag tacgcgacgca tacgcgatcga tacgcgatcgc tacgcgatgcg tacgcgattcg tacgcgcgaac tacgcgcgaca tacgcgcgagc tacgcgcgata tacgcgcgatt tacgcgcggtc tacgcgcgtaa tacgcggtacg tacgcgtaacc tacgcgtacga tacgcgtatcg tacgcgtcacg tacgcgtcgac tacgctacggc tacgctcggac tacggcgtacg tacgggcgacg tacgggcgtcg tacggtcgcga tacgtacgcga tacgtccgacg tacgtccgtcg tacgtcgagcg tacgtcgatcg tacgtcgcgca tacgtcgcgct tacgtcgcgta tacgtcgctcg tacgtcggtcg tacgtcgtacg tacgtcgtcgc tacgttacgcg tacgttcgacg tactacgcgcg tactatcgacg tactcgcgacg tagaccgacgc tagccggtacg tagccgttcga tagcgcgaatc tagcgcgacgt tagcgcgtcga tagcgtaccga tagctcgacga taggcgaaccg taggcgcgtaa tagtcgacgca tagttacgcgc tatacgcgaac tatacgcgcga tatacgcgtcg tatatgtcgcg tatccgatcgc tatccgcgcgt tatccggatcg tatccgtcgca tatcgccgact tatcgcgcgca tatcgcgcgct tatcgcggtcg tatcgcgtcga tatcgcgtcgt tatcgcgttcg tatcgctcgac tatcggcgatc tatcgtcgacg tatcgtcgccg tatcgttcgcg tatgcgaccgc tatgcgcgacg tatgcgcgcga tatgcgtcgcg tatggcgcgcc tatgtcgacgc tatgtcgcgat tatgtttcgcg tattacgcgcg tattatgcgcg tattcgacgcg tattcgcgacg tattcgcgcga tattcgcgcgg tattcgcggcg tattcgcgtcg tattcggcgcg tcatatcgcgc tccgatagtcg tccgcgactta tccgcggtacg tccgtaacgcg tccgtacgcga tccgtacgcgg tccgtcgaccg tccgtcgatcg tcgaacgatcg tcgaatacgcg tcgaccgcgta tcgaccgtcga tcgaccgtcgg tcgaccgttcg tcgacgaacga tcgacgaccga tcgacgaccgt tcgacgagtcg tcgacgatcgc tcgacgcgata tcgacgcggtt tcgacgcgtag tcgacggtacg tcgacggtatg tcgacgtaccg tcgacgtacga tcgacgtacgg tcgacgttacg tcgactaagcg tcgactagcgg tcgactcgacg tcgagccgacg tcgagcgcgta tcgagcgtacg tcgatacgccg tcgatacgcgg tcgatcgacgt tcgatcggata tcgatcgtacg tcgatcgtcgg tcgatgcgtcg tcgattacgcg tcgattagcgc tcgatttcgcg tcgcaatcggc tcgcacgatcg tcgcacgcgat tcgccgaatcg tcgccgtacgg tcgcgaaacgt tcgcgaaccga tcgcgaacgtt tcgcgaatacg tcgcgacccgt tcgcgaccgta tcgcgacgcaa tcgcgacgcga tcgcgacgtaa tcgcgacgtag tcgcgactcgt tcgcgagcgta tcgcgattcga tcgcgccgata tcgcgcgaacg tcgcgcgaata tcgcgcgacat tcgcgcgacta tcgcgcgagta tcgcgcggcta tcgcgcgtcaa tcgcgcgttga tcgcggatacg tcgcgtaatcg tcgcgtatacg tcgcgtatcgc tcgcgtcaacg tcgcgtccgat tcgcgtcgaac tcgcgtcggta tcgcgtcgtta tcgcgttaacg tcgcgtttcga tcgctcgcgat tcgctcgtcga tcggacgtacg tcggacgtagc tcggcgatacg tcggcgatata tcgggctatcg tcggtacgcgc tcggtacgcta tcggtcgacga tcggtcgtacg tcgtaacgcgc tcgtaatacgg tcgtacgaccg tcgtacgccga tcgtacggccg tcgtatcgccg tcgtatcgtcg tcgtccgatcg tcgtcgaaccg tcgtcgaatcg tcgtcgacgat tcgtcgattcg tcgtcgcgata tcgtcgcgcaa tcgtcgcgtac tcgtcggtacg tcgtcggtcga tcgtctcgcgt tcgttacgcgg tcgttcgaccg tcgttcgagcg tcgttcgcgta tcgttcgtacg tcgtttcgacg tctacgcgtcg tgacgaacgcg tgatcgcgacg tgatcgcgtag tgcgaacgacg tgcgcggcgta tgcgtaacgcg tgcgtcgtacg tgtacgcgacg tgtcgacgcga tgtcgacgcgt tgtcgcgcgat tgtcgcgcgta tgtcgcgtcgt ttaaccgtcga ttaacgcgcga ttaacgtcgcg ttaccgcgacg ttacgacgcgt ttacgcgacgc ttacgcgatcg ttacgcgcgcc ttacgcgcgga ttacgcggaca ttacgcgtacc ttacgtcgcga ttagcgcgtca ttaggtcgcgc ttagtacgcgc ttagtcgctcg ttatcgcgccg ttatcgcgcgc ttattcgcgcg ttcgacgaacg ttcgacgcgac ttcgacgcgag ttcgacgcgta ttcgacggacg ttcgagcgacg ttcgatccgtt ttcgccgatcg ttcgcgagacg ttcgcgattcg ttcgcgcatag ttcgcgcgaat ttcgcgcgata ttcgcgtacgg ttgattacgcg ttgcgcgatag ttgtcgacgcg tttagtcgcgc tttcgacgcaa tttcgacgcgt tttcgcgtacg tttcgtcgcgc tttcgtcgcgg
news + thoughts

Nature Biotechnology cover

Thu 23-04-2026

My cover design on the 7 April 2026 Nature Biotechnology issue shows the dendrogram that represents a cluster of uniquely expressed (or downregulated) genes in human naive stem cells induced from such cells. Within each dendrogram block, the genomic barcode sequence (sampled from Supplementary Table 1) is depicted with a Code 39 barcode. The highlighted barcode is one of those used for cell isolation.

Ishiguro S. et al. A multi-kingdom genetic barcoding system for precise clone isolation (2026) Nature Biotechnology 44:616–629.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Nature Biotechnology phylogenetic tree cover (volume 44, issue 4, 7 April 2026). (more)

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Happy 2026 π Day—
Art for the 5%

Fri 13-03-2026

Celebrate π Day (March 14th) and enjoy the art — but only if you're part of the 5%.

Go ahead, see what you can't see.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
2026 π DAY | Art for the 5%. Shown in the style of Ishihara color test plates, the art is visible only to those with colour blindness. (details)

Ishihara's Tests for Colour Deficiency

Sun 08-03-2026

Authentic and accurate images of Ishihara's test plates photographed (and lovingly color-corrected) from the 38-plate Ishihara's Tests for Colour Deficiency.

I also provide the position, size, and color of each circle on each test plate.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
ISHIHARA'S TEST PLATE 6 | This plate is part of the set of transformation plates. If you see 5, you're ok. If you see 2, you're not. (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
ISHIHARA'S TEST PLATE 18 | This plate is part of the set of mysterious hidden plates. If you don't see anything, you're ok. If you see 5, you're not. (details)

Symmetric alternatives to the ordinary least squares regression

Wed 23-07-2025

What immortal hand or eye, could frame thy fearful symmetry? — William Blake, "The Tyger"

This month, we look at symmetric regression, which, unlike simple linear regression, it is reversible — remaining unaltered when the variables are swapped.

Simple linear regression can summarize the linear relationship between two variables `X` and `Y` — for example, when `Y` is considered the response (dependent) and `X` the predictor (independent) variable.

However, there are times when we are not interested (or able) to distinguish between dependent and independent variables — either because they have the same importance or the same role. This is where symmetric regression can help.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Symmetric alternatives to the ordinary least squares regression. Geometry of quantities minimized in OLS and symmetric regression. OLS minimizes `\Sigma e_y^2` in `Y` ~ `X` and `\Sigma e_x^2` `X` ~ `Y`. Pythagorean regression minimizes AB (magenta). Geometric means regression (GMR) minimizes area of ABP (orange). Orthogonal regression (OR) minimizes HP (blue). (read)

Luca Greco, George Luta, Martin Krzywinski & Naomi Altman (2025) Points of significance: Symmetric alternatives to the ordinary least squares regression. Nat. Methods 22:1610–1612.

Beyond Belief Campaign BRCA Art

Wed 11-06-2025

Fuelled by philanthropy, findings into the workings of BRCA1 and BRCA2 genes have led to groundbreaking research and lifesaving innovations to care for families facing cancer.

This set of 100 one-of-a-kind prints explore the structure of these genes. Each artwork is unique — if you put them all together, you get the full sequence of the BRCA1 and BRCA2 proteins.

Propensity score weighting

Mon 17-03-2025

The needs of the many outweigh the needs of the few. —Mr. Spock (Star Trek II)

This month, we explore a related and powerful technique to address bias: propensity score weighting (PSW), which applies weights to each subject instead of matching (or discarding) them.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Propensity score weighting. (read)

Kurz, C.F., Krzywinski, M. & Altman, N. (2025) Points of significance: Propensity score weighting. Nat. Methods 22:638–640.

Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentrePHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.159 }