2025 π Day latest news buy art
Poetry is just the evidence of life. If your life is burning well, poetry is just the ashLeonard Cohenburn somethingmore quotes
very clickable
cancer research + art
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
WHAT ARE THESE SHAPES? | A set of 100 unique artworks that explores the genes that guard us from cancer.

Beyond Belief Campaign BRCA Artwork

A one-of-a-kind way of saying thank you

Art is science in love.
— E.F. Weisslitz

In collaboration with the BC Cancer Foundation, I created a set of 100 one-of-a-kind artworks gifted to Board Members and volunteers of the Beyond Belief Campaign in recognition of their efforts and contributions.

The Beyond Belief Campaign was launched in 2022 and has raised nearly $500 million for cancer research.

Our artwork takes a new twist on the BRCA1 and BRCA2 genes. What makes each piece different? Read below to discover.

2 / 100
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
BEYOND BELIEF CAMPAIGN BRCA ART | This is a one-of-a-kind print from a set of 100 — it is a fragment of a larger whole. If you put all the prints together, you'd get the full sequence of the BRCA1 and BRCA2 proteins.
13 / 100
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
BEYOND BELIEF CAMPAIGN BRCA ART | Fuelled by philanthropy, findings into the workings of BRCA1 and BRCA2 have led to groundbreaking research and lifesaving innovations to care for families facing cancer.

If you're interested in the technical details, this section is for you.

1 · BRCA1 and BRCA2 sequence

Here are the various primary identifiers of the BRCA genes and associated proteins. Use these if you'd like to look up information about the BRCA genes in resources such as the UCSC Genome Browser or Ensembl.

gene symbol alternative human readable name Ensembl gene identifier Ensembl protein identifier NCBI id NCBI gene NCBI protein identifier
BRCA1 ENST00000357654.9 ENSG00000012048.26 ENSP00000350283.3 NM_007294.4 GeneID:672 NP_009225.1
BRCA2 ENST00000380152.8 ENSG00000139618.19 ENSP00000369497.3 NM_000059.4 GeneID:675 NP_000050.3

The genomic region of BRCA1 and BRCA2 genes and protein sequence were extracted using UCSC Genome Table Browser. Data based on hg38 (Dec 2013) assembly.

1.1 · genomic sequence

The relevant UCSC Genome Table Browser form fields are shown below.

Group: Genes and Gene Predictions
Track: NCBI RefSeq
Table: UCSC RefSeq (refGene)

Position: BRCA1 > Lookup > chr17:43,044,295-43,125,364
Position: BRCA2 > Lookup > chr13:32,315,508-32,400,268

Output format: Sequence
Output filename: brca{1,2}-{genomic,protein}.fa
File type returned: Plain text

> Get Output

(x) Genomic

(x) Promoter/upstream 20kb
(x) 5' UTR exon
(x) CDS exon
(x) 3' UTR exon
(x) Intron
(x) Downstream 20kb

(x) one fasta per region
(x) split UTR/CDS

(x) exons in upper case, everything else in lowercase

Download genomic sequence: BRCA1, BRCA2.

1.2 · protein sequence

The relevant UCSC Genome Table Browser form fields are shown below.

Group: Genes and Gene Predictions
Track: NCBI RefSeq
Table: UCSC RefSeq (refGene)

Position: BRCA1 > Lookup > chr17:43,044,295-43,125,364
Position: BRCA2 > Lookup > chr13:32,315,508-32,400,268

Output format: Sequence
Output filename: brca{1,2}-{genomic,protein}.fa
File type returned: Plain text

> Get Output

(x) Protein

Download protein sequence: BRCA1, BRCA2.

2 · Detailed genomic region report

To make working with the genomic and protein sequence easier, I generated a detailed nucleotide-by-nucleotide report of the genomic regions of BRCA1 and BRCA2.

And, if you ever want to play around with visualization or art or just exploration of these genes, then these files contain a lot of what you might need, such as detailed accounting of positions and nucleotides in regions of the gene (padding, intron, exon, codons), as well as protein sequence and mutations.

Download detailed report: BRCA1, BRCA2.

I'll walk you through the format of this file.

Each line represents a nucleotide in the genome. It is annotated with all sorts of numerical and text values that identify its location relative to the genome, gene, gene region and codon.

There are four types of regions: pad (neighbourhood of the genome before and after the gene), exon (non-coding exon), intron and exoncds (coding exon).

I'll use the term nucleotide and base interchangeably.

2.1 · Padding

The first region is the 20,000 bases of padding before the gene. Here, I show only the first and last three lines in the report for this region.

brca1 17 43125365 - 43145364 1 121070 NM_007294_0 pad 1 2 1 40000 1 20000 a - - - - - brca1 17 43125366 - 43145363 2 121070 NM_007294_0 pad 1 2 2 40000 2 20000 c - - - - - brca1 17 43125367 - 43145362 3 121070 NM_007294_0 pad 1 2 3 40000 3 20000 a - - - - - .............................................................................................................. brca1 17 43145362 - 43125367 19998 121070 NM_007294_0 pad 1 2 19998 40000 19998 20000 c - - - - - brca1 17 43145363 - 43125366 19999 121070 NM_007294_0 pad 1 2 19999 40000 19999 20000 t - - - - - brca1 17 43145364 - 43125365 20000 121070 NM_007294_0 pad 1 2 20000 40000 20000 20000 c - - - - -

The fields are

1 brca1 gene name 2 17 chromosome 3 43125365 base position on + strand 4 - strand 5 43145364 base position on - strand (use this for brca1) 6 1 base index 7 121070 bases (lines) in report 8 NM_007294_0 NCBI gene identifer + "_REGION_INDEX" 9 pad region type 10 1 region type index 11 2 total number of regions of this type 12 1 base index within all regions of this type 13 40000 total bases within all regions of this type 14 1 base index within this region 15 20000 total bases within this region 16 a nucleotide 17 - --not used for non-coding regions-- 18 - --not used for non-coding regions-- 19 - --not used for non-coding regions-- 20 - --not used for non-coding regions-- 21 - --not used for non-coding regions--

The reson for two genomic positions is made necessary by the fact that BRCA1 and BRCA2 are transcribed from different strands (BRCA1 -, BRCA2 +).

For BRCA1, the UCSC table browser returns reverse complemented sequence but provides coordinates on the + strand. In other words, in the report below, the first base "a" is the first base of the reverse complement of the sequence in the region 43,125,365–43,145,364. This means that the position of this base in the assembly is 43,145,364 and the bases run backwards within the coordinate interval.

>hg38_refGene_NR_027676_0 range=chr17:43125365-43145364 5'pad=0 3'pad=0 strand=- repeatMasking=none acagagcgagactctgtctcaaaaaaaaaaaaaaagaaagaaaaaaaatt cctctgaattgtaaagaagggagacagggaccactgataagacatggtct

2.2 · Exons

The next region is the first exon. This is a non-coding exon.

Nucleotides in exons are capitalized for convenience, regardless whether the exon is coding or not coding.

brca1 17 43125271 - 43125364 20001 121070 NM_007294_1 exon 1 3 1 1496 1 94 G - - - - - brca1 17 43125272 - 43125363 20002 121070 NM_007294_1 exon 1 3 2 1496 2 94 C - - - - - brca1 17 43125273 - 43125362 20003 121070 NM_007294_1 exon 1 3 3 1496 3 94 T - - - - - .............................................................................................................. brca1 17 43125362 - 43125273 20092 121070 NM_007294_1 exon 1 3 92 1496 92 94 A - - - - - brca1 17 43125363 - 43125272 20093 121070 NM_007294_1 exon 1 3 93 1496 93 94 A - - - - - brca1 17 43125364 - 43125271 20094 121070 NM_007294_1 exon 1 3 94 1496 94 94 G - - - - -

Fields 9–16 can be used to answer useful questions, such as "which exon are we in?" or "how many bases are in this exon?".

9 exon region type 10 1 exon 1/3 11 3 12 1 base 1/1496 across all exons 13 1496 14 1 base 1/94 in this exon 15 94 16 G nucleotide

2.3 · Introns

We now come to our first intron in the gene. These are regions of the gene that are spliced out and do not contribute to the final protein sequence.

brca1 17 43124116 - 43125270 20095 121070 NM_007294_2 intron 1 22 1 73982 1 1155 g - - - - - brca1 17 43124117 - 43125269 20096 121070 NM_007294_2 intron 1 22 2 73982 2 1155 t - - - - - brca1 17 43124118 - 43125268 20097 121070 NM_007294_2 intron 1 22 3 73982 3 1155 a - - - - - .............................................................................................................. brca1 17 43125268 - 43124118 21247 121070 NM_007294_2 intron 1 22 1153 73982 1153 1155 a - - - - - brca1 17 43125269 - 43124117 21248 121070 NM_007294_2 intron 1 22 1154 73982 1154 1155 a - - - - - brca1 17 43125270 - 43124116 21249 121070 NM_007294_2 intron 1 22 1155 73982 1155 1155 g - - - - -

2.4 · Coding exons

The coding exons are the business part of the gene — these code for protein. Each three bases form a codon that encodes a particular amino acid. For example, GCC is translated to Alanine and GGC is translated to Glycine.

You can look up the relationship between codon sequence and amino acid in a codon table.

Here are the first 3 and last codon of BRCA1. It's important to realize that codons may cross exon boundaries. Read about alternative splicing and cassette exons for more details.

brca1 17 43124017 - 43124096 21269 121070 NM_007294_4 exoncds 1 22 1 5592 1 80 A 1 cmut 1 M Met brca1 17 43124018 - 43124095 21270 121070 NM_007294_4 exoncds 1 22 2 5592 2 80 T 1 cmut 2 M Met \ 1 1864 1 30 label M/1_c.3G>T_p.Met1Ile_M_I_VCV000055072_rs80357475 brca1 17 43124019 - 43124094 21271 121070 NM_007294_4 exoncds 1 22 3 5592 3 80 G 1 cmut 3 M Met brca1 17 43124020 - 43124093 21272 121070 NM_007294_4 exoncds 1 22 4 5592 4 80 G 2 - 1 D Asp brca1 17 43124021 - 43124092 21273 121070 NM_007294_4 exoncds 1 22 5 5592 5 80 A 2 - 2 D Asp \ 2 1864 1 85 label D brca1 17 43124022 - 43124091 21274 121070 NM_007294_4 exoncds 1 22 6 5592 6 80 T 2 - 3 D Asp brca1 17 43124023 - 43124090 21275 121070 NM_007294_4 exoncds 1 22 7 5592 7 80 T 3 cmut 1 L Leu brca1 17 43124024 - 43124089 21276 121070 NM_007294_4 exoncds 1 22 8 5592 8 80 T 3 cmut 2 L Leu \ 3 1864 1 156 label L/2_c.8T>G_p.Leu3Ter_L_*_VCV000055746_rs397509332 .............................................................................................................. brca1 17 43124094 - 43124019 21346 121070 NM_007294_4 exoncds 1 22 78 5592 78 80 C 26 - 3 I Ile brca1 17 43124095 - 43124018 21347 121070 NM_007294_4 exoncds 1 22 79 5592 79 80 T 27 - 1 C Cys brca1 17 43124096 - 43124017 21348 121070 NM_007294_4 exoncds 1 22 80 5592 80 80 G 27 - 2 C Cys \ 27 1864 2 44 label C

In the report, coding exons have more fields, which are used to describe codons, amino acids and mutations.

1 brca1 2 17 3 43124018 4 - 5 43124095 6 21270 7 121070 8 NM_007294_4 9 exoncds 10 1 1/22 exon 11 22 12 2 2/5592 base across all exons 13 5592 14 2 2/80 base in this exon 15 80 16 T nucleotide 17 1 codon index 18 cmut codon has a mutation 19 2 position in codon (1-3) 20 M 1-letter abbreviation for wild-type amino acid encoded by this codon 21 Met 3-letter abbreviation ... 22 1 1/1864 codon (the last is the stop codon) 23 1864 24 1 1/30 Met in protein 25 30 26 label 27 M/1_c.3G>T_p.Met1Ile_M_I_VCV000055072_rs80357475 mutation label

2.5 · Stop codons

Some codons are special and terminate the protein sequence. These are stop codons and they are TAG, TAA and TGA.

For example, the stop codon in BRCA1 is TGA

brca1 17 43045800 - 43045680 99685 121070 NM_007294_46 exoncds 22 22 5590 5592 123 125 T 1864 - 1 * Ter brca1 17 43045801 - 43045679 99686 121070 NM_007294_46 exoncds 22 22 5591 5592 124 125 G 1864 - 2 * Ter \ 1864 1864 1 1 label * brca1 17 43045802 - 43045678 99687 121070 NM_007294_46 exoncds 22 22 5592 5592 125 125 A 1864 - 3 * Ter

and in BRCA2 it is TAA.

brca2 13 32398768 + 32398164 103261 124761 NM_000059_54 exoncds 26 26 10255 10257 607 609 T 3419 - 1 * Ter brca2 13 32398769 + 32398163 103262 124761 NM_000059_54 exoncds 26 26 10256 10257 608 609 A 3419 - 2 * Ter \ 3419 3419 1 1 label * brca2 13 32398770 + 32398162 103263 124761 NM_000059_54 exoncds 26 26 10257 10257 609 609 A 3419 - 3 * Ter

3 · Gene sequence as a path

The artworks show the genes drawn as a path. Instead of a linear representation, which might look something like this

Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
CLOSEUP OF THE BRCA1 AND BRCA2 GENES | The structure of the genes comprises exons and introns. Most of the exons code for protein (very thick lines), but some do not (medium thickness lines). Introns (thin lines) do not code for protein. The BRCA1 gene is transcribed from the – strand. The BRCA2 gene is transcribed from the + strand. The magnification of this view is 4,000× relative to the karyotype view above.

I first compressed the introns (and possibly non-coding exons) so that more focus was placed on the coding exons. Keeping to the linear format, this might look something like this

Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
EXPANDED VIEW OF THE BRCA1 AND BRCA2 GENES | Introns are regions of the gene that are removed (spliced) before translation into protein sequence. In this view, introns are compressed by a factor of 20, relative to other elements. This allocates more room on the path to exons, most of which code for protein.

But, I think you'll agree, that a linear representation isn't particularly interesting to the eye. It doesn't fill a square canvas well. And, since we'll be adding labels for amino acids on the coding exons, this linear format doesn't give us a lot of room to achieve this.

Visual representations of differences in SARS Cov-2 genomes across variants.
I've drawn other sequences as paths — such as the genome of the SARS-CoV-2 virus.

So, instead, I rendered the sequence as a curved path. This is something I've done before for genomes and viruses.

3.1 · Path algorithm

The magic sauce in the path is its curvature — regions of the genome with more repeat content curve more. This is achieved by calculating the entropy of the sequence in a sliding window centered on the path position.

We start at the origin `\mathbf{p}_0 = (0,0)`. For each base, we advance the path to the next point `\mathbf{p}_{i+1} = \mathbf{p}_i + \delta\mathbf{p}` where `\delta\mathbf{p}` is the path step.

The step itself is made up of two parts $$\delta\mathbf{p} = d \times \mathbf{u_{\delta\theta}}$$

Here `d` is the distance of the step, which is a function of the type of region we're in.

type d pad 1 intron 0.2 non-coding exon 0.5 coding exon 1

and `\mathbf{u_\theta}` is a unit vector offset from the direction of the path by an angle `\delta\theta`. In other words, points (`\mathbf{p}_{i-1}`,`\mathbf{p}_{i}`) are connected by a line at an angle `\theta` and points (`\mathbf{p}_{i}`,`\mathbf{p}_{i+1}`) are connected by a line at an angle `\theta + \delta\theta`.

The angle change `\delta\theta` is determined by the entropy of the sequence (see below). For an entropy `H` calculated in a window of size `ws` centered on point `\mathbf{p}`, the change in angle is $$\delta\theta = a \times M \times (1-H)^b$$

where `a \in \{-1,1\}` depending on whether the GC content (fraction of bases that are either G or C) at the point is lower than average (`a=-1`) or higher than average (`a=1`), `M` is the maximum value for `\delta\theta` and `b` is the scaling power.

3.2 · Sample paths

Initial parameter values were `ws = 110`, `M = 13` and `b=1.15` and these were adjusted slightly to generate 1,000 candidate paths. I looked through these paths and selected one for each gene that looked interesting and would compose nicely on a square canvas.

Below, I show a sample of 9 candidate paths for each gene.

3.3 · BRCA1 paths

Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 107`, `M=12.90`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 107`, `M=12.94`, `b=1.14`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 107`, `M=13.08`, `b=1.18`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 108`, `M=12.98`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 108`, `M=13.01`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 108`, `M=13.15`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 108`, `M=13.22`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 109`, `M=13.10`, `b=1.15`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 111`, `M=13.06`, `b=1.18`

3.4 · BRCA2 paths

Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 107`, `M=12.90`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 107`, `M=12.94`, `b=1.14`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 107`, `M=13.08`, `b=1.18`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 108`, `M=12.98`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 108`, `M=13.01`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 108`, `M=13.15`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 108`, `M=13.22`, `b=1.11`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 109`, `M=13.10`, `b=1.15`
Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
`ws = 111`, `M=13.06`, `b=1.18`

3.5 · Final path selection

Beyond Belief Campaign BRCA Artwork (A one-of-a-kind way of saying thank you) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
FINAL BRCA PATHS | These two paths were selected from a large number of candiates. I was seeking paths that looked interesting and fit nicely on a square canvas.

3.6 · Entropy

In this section, I briefly cover the idea of entropy and how I used it to generate the path curvature.

There are broadly two ways to think about entropy: as information (in information theory) or as disorder (in thermodynamics). These ideas are compatible but take on different mathematical formulations.

For our use, we'll use the definition of entropy as information. Consider a device that outputs letters, one at a time.

3.6.1 · Minimum entropy

Suppose the device outputs only `\text{A}`'s — no other letters. In other words, the probability of seeing an `\text{A}` is `p_\text{A} = 1`.

There is no uncertainty (or surprise) in the next letter (they're all `\text{A}`'s).

This system transmits no information. Its entropy is zero.

3.6.2 · Maximum entropy

Now suppose that the device outputs a randomly selected letters from the set `{\text{A},\text{C},\text{G},\text{T}}`.

Because the probability of seeing each letter is `p = 1/4`, we have maximum uncertainty about what letter will come next.

This system transmits maximum information and hence its entropy is maximum.

3.6.3 · Calculating entropy

Let's revisit our random system but now change the probabilities of each letter: `p_\text{A} = 0.5`, `p_\text{C} = 0.3`, `p_\text{G} = 0.15` and `p_\text{T} = 0.05`. We no longer have maximum uncertainty. The most likely letter to appear next is `\text{A}` — in fact, half of the letters in the string are `\text{A}`'s.

The entropy of this particular string is somewhere between zero and the maximum possible value. To quantify it precisely, we need a way to calculate it. Here's the equation: $$H = - \sum_i p_i \log_2 p_i$$

where the sum is made over all the symbols in the string (we have four) and the individual `p_i` are the probabilities of each symbol. Specifically, $$H = - ( 0.5 \log_2 0.5 + 0.3 \log_2 0.3 + 0.15 \log_2 0.15 + 0.05 \log_2 0.05 ) = 1.65 $$

Because we used a base 2 logarithm, the units are bits (or shannons.

Now that we know the expression, we can go back and calculate the entropy of the previous example where the letter probabilities were all equal. $$H = - 4 \times 0.25 \log_2 0.25 = 2$$

This maximum value of 2 bits puts our 1.65 bits in perspective.

3.7 · Calculating entropy of a sequence

At each point along the sequence, I look at sliding window of `ws` bases, centered on the point. I then count all the dimers (AA, AC, AG, AT, CA, ..., TT) and turn these into probabilities. These probabilities are then fed into the entropy formula shown above.

The choice of window size and dimers is arbitrary. There's also not anything particularly biological about it. You can generate paths based on smaller or larger windows and using frequencies of individual nucleotides or longer `k`-mers.

4 · Amino acid sequence

The wild-type BRCA1 protein has 1,863 amino acids and the BRCA2 protein has 3,418.

These amino acids are partitioned among all the artworks so that any given amino acid appears only on one artwork. This partitioning can be made in many ways, so I add an additional requirement that the amino acid labels on each artwork be as far apart from one another as possible.

The goal is to avoid overlapping labels, which would happen if these partitions were random because the path turns back on itself and adjacent amino acids are very close together on the path.

4.1 · Sequence partitioning

Let's look at BRCA1 to see how this was done. First, we generate 100 random partitions of the protein sequence. These represent the initial guess of how to divide up the amino acids among the 100 artworks.

Then, for each artwork, I calculate the distance, `d_i`, on the canvas between the two closest labels. I then take the minimum of this value `\text{min}_i(d_i)`. We want this number to be as large as possible.

This problem cannot be solved analytically, but it can be relatively easily handled by a Monte Carlo simulation. This is a type of algorithm in which we guess at a solution, evaluate its quality and then repeat until we find a solution with the highest quality. In our case, the quality is value of \text{min}_i(d_i)`.

One kind of Monte Carlo method is simulated annealing, which works particularly well for this kind of problem.

Starting with the initial guess, I adjust it by swapping two amino acids between two partitions: the partition (and amino acid) with smallest `d_i` and a random partition. I recalculate `\text{min}_i(d_i)` and if it is larger, I accept the new guess. If it is smaller, I accept it based on a probability that is a function of (a) the decrease in `text{min}_i(d_i)` and (b) the current iteration of the simulation. Initially, the probability of accepting a large decrease is likely but as the simulation runs this probability decreases.

After repeating this 1,000's of times, I'm relatively confident that I found a pretty good solution. Though there is no guarantee that this is the best solution. As long as the labels don't overlap!

For example, here's the first 50 amino acids in BRCA1

>NP_009225.1 MDLSALRVEE VQNVINAMQK ILECPICLEL IKEPVSTKCD HIFCKFCMLK ...

and here's how they're assigned to the artwork instances. For example instance 61/100 has the first M, instance 65/100 has the second D, and so on.

brca1 1 61 M brca1 2 65 D brca1 3 53 L brca1 4 64 S brca1 5 38 A brca1 6 80 L brca1 7 77 R brca1 8 50 V brca1 9 41 E brca1 10 95 E ...

There's one final detail that I neglected to mention. Each artwork also shows two mutations (one per gene), and we don't want the amino acid labels to overlap with these. So, in the partitioning optimization I incorporate the list of mutations assigned to each artwork in the minimum distance calculation.

Download mapping between amino acid and artwork instance: BRCA1, BRCA2.

5 · Mutations

5.1 · ClinVar report

I downloaded a list of known mutations in BRCA1 and BRCA2 from ClinVar.

Mutations were filtered to be somatic + pathogenic + single nucleotide variant + reviewed by expert panel. ClinVar returned 501 and 623 mutations in BRCA1 and BRCA2, respectively.

Download ClinVar mutation report: BRCA1, BRCA2.

5.2 · Filtered report

I further refined the mutations to cross-reference to dbSNP and to streamline the output format.

Download filtered mutation report: BRCA1, BRCA2.

brca1 1 0.000001 3 G T 1 M I Met Ile rs80357475 # NM_007294.4(BRCA1):c.3G>T (p.Met1Ile) brca1 2 0.000000 8 T G 3 L * Leu Ter rs397509332 # NM_007294.4(BRCA1):c.8T>G (p.Leu3Ter) brca1 3 0.000200 34 C T 12 Q * Gln Ter rs80357134 # NM_007294.4(BRCA1):c.34C>T (p.Gln12Ter) brca1 4 0.000009 53 T C 18 M T Met Thr rs80356929 # NM_007294.4(BRCA1):c.53T>C (p.Met18Thr) brca1 5 0.000000 55 C T 19 Q * Gln Ter rs397509299 # NM_007294.4(BRCA1):c.55C>T (p.Gln19Ter) ... brca1 465 0.000000 5536 C T 1846 Q * Gln Ter rs80356873 # NM_007294.4(BRCA1):c.5536C>T (p.Gln1846Ter) brca1 466 0.006438 5541 C A 1847 C * Cys Ter rs397509295 # NM_007294.4(BRCA1):c.5541C>A (p.Cys1847Ter) brca1 467 0.000000 5542 C T 1848 Q * Gln Ter rs886040303 # NM_007294.4(BRCA1):c.5542C>T (p.Gln1848Ter) brca1 468 0.000000 5559 C G 1853 Y * Tyr Ter rs80357336 # NM_007294.4(BRCA1):c.5559C>G (p.Tyr1853Ter) brca1 469 0.000000 5559 C A 1853 Y * Tyr Ter rs80357336 # NM_007294.4(BRCA1):c.5559C>A (p.Tyr1853Ter)

The fields are

1 brca1 gene 2 1 mutation index 3 0.000001 dbSNP allele frequency 4 3 base position in gene 5 G wild-type base 6 T mutated base 7 1 amino acid position in protein 8 M wild-type amino acid 9 I mutated amino acid 10 Met 3-letter abbreviation of wild-type amino acid 11 Ile 3-letter abbreviation of mutated amino acid 12 rs80357475 dbSNP accession 13 # 14 NM_007294.4(BRCA1):c.3G>T (p.Met1Ile) ClinVar mutation name

5.3 · Mutation selection

Each artwork has one mutation on each gene. However, because there are more than 100 known mutations, I needed a way to select 100 mutations per gene.

First, I picked all the nonsense mutations. These are mutations in which one amino acid is mutated to another. In the filtered ClinVar reports, there are 38 and 17 mutations of this kind on BRCA1 and BRCA2.

Because mutations are depicted on the artwork by a magenta circle with the mutated amino acid abbreviation, nonsense mutations are interesting because you see an actual letter in the mutation circle. This is in contrast to terminating mutations, which show a (much less exciting) "*" in the circle.

To top up these 38 BRCA1 and 17 BRCA2 mutations to 100, I selected from the terminating mutations. The selection was made so that the positions of the mutations were spread out as much as possible along the gene path (see below). This was done with simulated annealing in the same way as the amino acid sequence partitioning (see above).

Download list of mutations on each artwork: BRCA1 and BRCA2 mutations.

news + thoughts

Symmetric alternatives to the ordinary least squares regression

Wed 23-07-2025

What immortal hand or eye, could frame thy fearful symmetry? — William Blake, "The Tyger"

This month, we look at symmetric regression, which, unlike simple linear regression, it is reversible — remaining unaltered when the variables are swapped.

Simple linear regression can summarize the linear relationship between two variables `X` and `Y` — for example, when `Y` is considered the response (dependent) and `X` the predictor (independent) variable.

However, there are times when we are not interested (or able) to distinguish between dependent and independent variables — either because they have the same importance or the same role. This is where symmetric regression can help.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Symmetric alternatives to the ordinary least squares regression. Geometry of quantities minimized in OLS and symmetric regression. OLS minimizes `\Sigma e_y^2` in `Y` ~ `X` and `\Sigma e_x^2` `X` ~ `Y`. Pythagorean regression minimizes AB (magenta). Geometric means regression (GMR) minimizes area of ABP (orange). Orthogonal regression (OR) minimizes HP (blue). (read)

Luca Greco, George Luta, Martin Krzywinski & Naomi Altman (2025) Points of significance: Symmetric alternatives to the ordinary least squares regression. Nat. Methods 22:1610–1612.

Beyond Belief Campaign BRCA Art

Wed 11-06-2025

Fuelled by philanthropy, findings into the workings of BRCA1 and BRCA2 genes have led to groundbreaking research and lifesaving innovations to care for families facing cancer.

This set of 100 one-of-a-kind prints explore the structure of these genes. Each artwork is unique — if you put them all together, you get the full sequence of the BRCA1 and BRCA2 proteins.

Propensity score weighting

Mon 17-03-2025

The needs of the many outweigh the needs of the few. —Mr. Spock (Star Trek II)

This month, we explore a related and powerful technique to address bias: propensity score weighting (PSW), which applies weights to each subject instead of matching (or discarding) them.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Propensity score weighting. (read)

Kurz, C.F., Krzywinski, M. & Altman, N. (2025) Points of significance: Propensity score weighting. Nat. Methods 22:638–640.

Happy 2025 π Day—
TTCAGT: a sequence of digits

Thu 13-03-2025

Celebrate π Day (March 14th) and sequence digits like its 1999. Let's call some peaks.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
2025 π DAY | TTCAGT: a sequence of digits. The digits of π are encoded into DNA sequence and visualized with Sanger sequencing. (details)

Crafting 10 Years of Statistics Explanations: Points of Significance

Sun 09-03-2025

I don’t have good luck in the match points. —Rafael Nadal, Spanish tennis player

Points of Significance is an ongoing series of short articles about statistics in Nature Methods that started in 2013. Its aim is to provide clear explanations of essential concepts in statistics for a nonspecialist audience. The articles favor heuristic explanations and make extensive use of simulated examples and graphical explanations, while maintaining mathematical rigor.

Topics range from basic, but often misunderstood, such as uncertainty and P-values, to relatively advanced, but often neglected, such as the error-in-variables problem and the curse of dimensionality. More recent articles have focused on timely topics such as modeling of epidemics, machine learning, and neural networks.

In this article, we discuss the evolution of topics and details behind some of the story arcs, our approach to crafting statistical explanations and narratives, and our use of figures and numerical simulations as props for building understanding.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Crafting 10 Years of Statistics Explanations: Points of Significance. (read)

Altman, N. & Krzywinski, M. (2025) Crafting 10 Years of Statistics Explanations: Points of Significance. Annual Review of Statistics and Its Application 12:69–87.

Propensity score matching

Mon 16-09-2024

I don’t have good luck in the match points. —Rafael Nadal, Spanish tennis player

In many experimental designs, we need to keep in mind the possibility of confounding variables, which may give rise to bias in the estimate of the treatment effect.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Propensity score matching. (read)

If the control and experimental groups aren't matched (or, roughly, similar enough), this bias can arise.

Sometimes this can be dealt with by randomizing, which on average can balance this effect out. When randomization is not possible, propensity score matching is an excellent strategy to match control and experimental groups.

Kurz, C.F., Krzywinski, M. & Altman, N. (2024) Points of significance: Propensity score matching. Nat. Methods 21:1770–1772.

Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentrePHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.159 }