2024 π Daylatest newsbuy art
Here we are now at the middle of the fourth large part of this talk.Pepe Deluxeget nowheremore quotes
very clickable

news + thoughts

Propensity score matching

Fri 25-10-2024

I don’t have good luck in the match points. —Rafael Nadal, Spanish tennis player

In many experimental designs, we need to keep in mind the possibility of confounding variables, which may give rise to bias in the estimate of the treatment effect.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Propensity score matching. (read)

If the control and experimental groups aren't matched (or, roughly, similar enough), this bias can arise.

Sometimes this can be dealt with by randomizing, which on average can balance this effect out. When randomization is not possible, propensity score matching is an excellent strategy to match control and experimental groups.

Kurz, C.F., Krzywinski, M. & Altman, N. (2024) Points of significance: Propensity score matching. Nat. Methods 21:1770–1772.

Nasa to send our human genome discs to the Moon

Sat 23-03-2024

We'd like to say a ‘cosmic hello’: mathematics, culture, palaeontology, art and science, and ... human genomes.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
SANCTUARY PROJECT | A cosmic hello of art, science, and genomes. (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
SANCTUARY PROJECT | Benoit Faiveley, founder of the Sanctuary project gives the Sanctuary disc a visual check at CEA LeQ Grenoble (image: Vincent Thomas). (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
SANCTUARY PROJECT | Sanctuary team examines the Life disc at INRIA Paris Saclay (image: Benedict Redgrove) (details)

Comparing classifier performance with baselines

Fri 25-10-2024

All animals are equal, but some animals are more equal than others. —George Orwell

This month, we will illustrate the importance of establishing a baseline performance level.

Baselines are typically generated independently for each dataset using very simple models. Their role is to set the minimum level of acceptable performance and help with comparing relative improvements in performance of other models.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Comparing classifier performance with baselines. (read)

Unfortunately, baselines are often overlooked and, in the presence of a class imbalance, must be established with care.

Megahed, F.M, Chen, Y-J., Jones-Farmer, A., Rigdon, S.E., Krzywinski, M. & Altman, N. (2024) Points of significance: Comparing classifier performance with baselines. Nat. Methods 21:546–548.


Happy 2024 π Day—
sunflowers ho!

Sat 09-03-2024

Celebrate π Day (March 14th) and dig into the digit garden. Let's grow something.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
2024 π DAY | A garden of 1,000 digits of π. (details)

How Analyzing Cosmic Nothing Might Explain Everything

Thu 18-01-2024

Huge empty areas of the universe called voids could help solve the greatest mysteries in the cosmos.

My graphic accompanying How Analyzing Cosmic Nothing Might Explain Everything in the January 2024 issue of Scientific American depicts the entire Universe in a two-page spread — full of nothing.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
How Analyzing Cosmic Nothing Might Explain Everything. Text by Michael Lemonick (editor), art direction by Jen Christiansen (Senior Graphics Editor), source: SDSS

The graphic uses the latest data from SDSS 12 and is an update to my Superclusters and Voids poster.

Michael Lemonick (editor) explains on the graphic:

“Regions of relatively empty space called cosmic voids are everywhere in the universe, and scientists believe studying their size, shape and spread across the cosmos could help them understand dark matter, dark energy and other big mysteries.

To use voids in this way, astronomers must map these regions in detail—a project that is just beginning.

Shown here are voids discovered by the Sloan Digital Sky Survey (SDSS), along with a selection of 16 previously named voids. Scientists expect voids to be evenly distributed throughout space—the lack of voids in some regions on the globe simply reflects SDSS’s sky coverage.”

voids

Sofia Contarini, Alice Pisani, Nico Hamaus, Federico Marulli Lauro Moscardini & Marco Baldi (2023) Cosmological Constraints from the BOSS DR12 Void Size Function Astrophysical Journal 953:46.

Nico Hamaus, Alice Pisani, Jin-Ah Choi, Guilhem Lavaux, Benjamin D. Wandelt & Jochen Weller (2020) Journal of Cosmology and Astroparticle Physics 2020:023.

Sloan Digital Sky Survey Data Release 12

constellation figures

Alan MacRobert (Sky & Telescope), Paulina Rowicka/Martin Krzywinski (revisions & Microscopium)

stars

Hoffleit & Warren Jr. (1991) The Bright Star Catalog, 5th Revised Edition (Preliminary Version).

cosmology

H0 = 67.4 km/(Mpc·s), Ωm = 0.315, Ωv = 0.685. Planck collaboration Planck 2018 results. VI. Cosmological parameters (2018).

Error in predictor variables

Fri 25-10-2024

It is the mark of an educated mind to rest satisfied with the degree of precision that the nature of the subject admits and not to seek exactness where only an approximation is possible. —Aristotle

In regression, the predictors are (typically) assumed to have known values that are measured without error.

Practically, however, predictors are often measured with error. This has a profound (but predictable) effect on the estimates of relationships among variables – the so-called “error in variables” problem.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Error in predictor variables. (read)

Error in measuring the predictors is often ignored. In this column, we discuss when ignoring this error is harmless and when it can lead to large bias that can leads us to miss important effects.

Altman, N. & Krzywinski, M. (2024) Points of significance: Error in predictor variables. Nat. Methods 21:4–6.

Background reading

Altman, N. & Krzywinski, M. (2015) Points of significance: Simple linear regression. Nat. Methods 12:999–1000.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nat. Methods 13:541–542 (2016).

Das, K., Krzywinski, M. & Altman, N. (2019) Points of significance: Quantile regression. Nat. Methods 16:451–452.


Convolutional neural networks

Tue 02-01-2024

Nature uses only the longest threads to weave her patterns, so that each small piece of her fabric reveals the organization of the entire tapestry. – Richard Feynman

Following up on our Neural network primer column, this month we explore a different kind of network architecture: a convolutional network.

The convolutional network replaces the hidden layer of a fully connected network (FCN) with one or more filters (a kind of neuron that looks at the input within a narrow window).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Convolutional neural networks. (read)

Even through convolutional networks have far fewer neurons that an FCN, they can perform substantially better for certain kinds of problems, such as sequence motif detection.

Derry, A., Krzywinski, M & Altman, N. (2023) Points of significance: Convolutional neural networks. Nature Methods 20:1269–1270.

Background reading

Derry, A., Krzywinski, M. & Altman, N. (2023) Points of significance: Neural network primer. Nature Methods 20:165–167.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nature Methods 13:541–542.

Neural network primer

Thu 17-08-2023

Nature is often hidden, sometimes overcome, seldom extinguished. —Francis Bacon

In the first of a series of columns about neural networks, we introduce them with an intuitive approach that draws from our discussion about logistic regression.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Neural network primer. (read)

Simple neural networks are just a chain of linear regressions. And, although neural network models can get very complicated, their essence can be understood in terms of relatively basic principles.

We show how neural network components (neurons) can be arranged in the network and discuss the ideas of hidden layers. Using a simple data set we show how even a 3-neuron neural network can already model relatively complicated data patterns.

Derry, A., Krzywinski, M & Altman, N. (2023) Points of significance: Neural network primer. Nature Methods 20:165–167.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nature Methods 13:541–542.

Cell Genomics cover

Mon 16-01-2023

Our cover on the 11 January 2023 Cell Genomics issue depicts the process of determining the parent-of-origin using differential methylation of alleles at imprinted regions (iDMRs) is imagined as a circuit.

Designed in collaboration with with Carlos Urzua.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Our Cell Genomics cover depicts parent-of-origin assignment as a circuit (volume 3, issue 1, 11 January 2023). (more)

Akbari, V. et al. Parent-of-origin detection and chromosome-scale haplotyping using long-read DNA methylation sequencing and Strand-seq (2023) Cell Genomics 3(1).

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Science Advances cover

Thu 05-01-2023

My cover design on the 6 January 2023 Science Advances issue depicts DNA sequencing read translation in high-dimensional space. The image showss 672 bases of sequencing barcodes generated by three different single-cell RNA sequencing platforms were encoded as oriented triangles on the faces of three 7-dimensional cubes.

More details about the design.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Science Advances cover that encodes sequence onto hypercubes (volume 9, issue 1, 6 January 2023). (more)

Kijima, Y. et al. A universal sequencing read interpreter (2023) Science Advances 9.

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Regression modeling of time-to-event data with censoring

Thu 17-08-2023

If you sit on the sofa for your entire life, you’re running a higher risk of getting heart disease and cancer. —Alex Honnold, American rock climber

In a follow-up to our Survival analysis — time-to-event data and censoring article, we look at how regression can be used to account for additional risk factors in survival analysis.

We explore accelerated failure time regression (AFTR) and the Cox Proportional Hazards model (Cox PH).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Regression modeling of time-to-event data with censoring. (read)

Dey, T., Lipsitz, S.R., Cooper, Z., Trinh, Q., Krzywinski, M & Altman, N. (2022) Points of significance: Regression modeling of time-to-event data with censoring. Nature Methods 19:1513–1515.

Music video for Max Cooper's Ascent

Tue 25-10-2022

My 5-dimensional animation sets the visual stage for Max Cooper's Ascent from the album Unspoken Words. I have previously collaborated with Max on telling a story about infinity for his Yearning for the Infinite album.

I provide a walkthrough the video, describe the animation system I created to generate the frames, and show you all the keyframes

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Frame 4897 from the music video of Max Cooper's Asent.

The video recently premiered on YouTube.

Renders of the full scene are available as NFTs.


Gene Cultures exhibit — art at the MIT Museum

Tue 25-10-2022

I am more than my genome and my genome is more than me.

The MIT Museum reopened at its new location on 2nd October 2022. The new Gene Cultures exhibit featured my visualization of the human genome, which walks through the size and organization of the genome and some of the important structures.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My art at the MIT Museum Gene Cultures exhibit tells shows the scale and structure of the human genome. Pay no attention to the pink chicken.

Annals of Oncology cover

Wed 14-09-2022

My cover design on the 1 September 2022 Annals of Oncology issue shows 570 individual cases of difficult-to-treat cancers. Each case shows the number and type of actionable genomic alterations that were detected and the length of therapies that resulted from the analysis.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
An organic arrangement of 570 individual cases of difficult-to-treat cancers showing genomic changes and therapies. Apperas on Annals of Oncology cover (volume 33, issue 9, 1 September 2022).

Pleasance E et al. Whole-genome and transcriptome analysis enhances precision cancer treatment options (2022) Annals of Oncology 33:939–949.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Annals of Oncology 570 cancer cohort cover (volume 33, issue 9, 1 September 2022). (more)

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Survival analysis—time-to-event data and censoring

Fri 05-08-2022

Love's the only engine of survival. —L. Cohen

We begin a series on survival analysis in the context of its two key complications: skew (which calls for the use of probability distributions, such as the Weibull, that can accomodate skew) and censoring (required because we almost always fail to observe the event in question for all subjects).

We discuss right, left and interval censoring and how mishandling censoring can lead to bias and loss of sensitivity in tests that probe for differences in survival times.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Survival analysis—time-to-event data and censoring. (read)

Dey, T., Lipsitz, S.R., Cooper, Z., Trinh, Q., Krzywinski, M & Altman, N. (2022) Points of significance: Survival analysis—time-to-event data and censoring. Nature Methods 19:906–908.


3,117,275,501 Bases, 0 Gaps

Sun 21-08-2022

See How Scientists Put Together the Complete Human Genome.

My graphic in Scientific American's Graphic Science section in the August 2022 issue shows the full history of the human genome assembly — from its humble shotgun beginnings to the gapless telomere-to-telomere assembly.

Read about the process and methods behind the creation of the graphic.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
3,117,275,501 Bases, 0 Gaps. Text by Clara Moskowitz (Senior Editor), art direction by Jen Christiansen (Senior Graphics Editor), source: UCSC Genome Browser.

See all my Scientific American Graphic Science visualizations.

Anatomy of SARS-Cov-2

Tue 31-05-2022

My poster showing the genome structure and position of mutations on all SARS-CoV-2 variants appears in the March/April 2022 issue of American Scientist.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Deadly Genomes: Genome Structure and Size of Harmful Bacteria and Viruses (zoom)

An accompanying piece breaks down the anatomy of each genome — by gene and ORF, oriented to emphasize relative differences that are caused by mutations.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Deadly Genomes: Genome Structure and Size of Harmful Bacteria and Viruses (zoom)

Cancer Cell cover

Wed 04-01-2023

My cover design on the 11 April 2022 Cancer Cell issue depicts cellular heterogeneity as a kaleidoscope generated from immunofluorescence staining of the glial and neuronal markers MBP and NeuN (respectively) in a GBM patient-derived explant.

LeBlanc VG et al. Single-cell landscapes of primary glioblastomas and matched explants and cell lines show variable retention of inter- and intratumor heterogeneity (2022) Cancer Cell 40:379–392.E9.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Cancer Cell kaleidoscope cover (volume 40, issue 4, 11 April 2022). (more)

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Nature Biotechnology cover

Sat 23-04-2022

My cover design on the 4 April 2022 Nature Biotechnology issue is an impression of a phylogenetic tree of over 200 million sequences.

Konno N et al. Deep distributed computing to reconstruct extremely large lineage trees (2022) Nature Biotechnology 40:566–575.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Nature Biotechnology phylogenetic tree cover (volume 40, issue 4, 4 April 2022). (more)

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Nature cover — Gene Genie

Sat 23-04-2022

My cover design on the 17 March 2022 Nature issue depicts the evolutionary properties of sequences at the extremes of the evolvability spectrum.

Vaishnav ED et al. The evolution, evolvability and engineering of gene regulatory DNA (2022) Nature 603:455–463.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Nature squiggles cover (volume 603, issue 7901, 17 March 2022). (more)

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Happy 2022 `\pi` Day—
three one four: a number of notes

Mon 14-03-2022

Celebrate `\pi` Day (March 14th) and finally hear what you've been missing.

“three one four: a number of notes” is a musical exploration of how we think about mathematics and how we feel about mathematics. It tells stories from the very beginning (314…) to the very (known) end of π (...264) as well as math (Wallis Product) and math jokes (Feynman Point), repetition (nn) and zeroes (null).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Listen to `\pi` in the style of 20th century classical music. (details)

The album is scored for solo piano in the style of 20th century classical music – each piece has a distinct personality, drawn from styles of Boulez, Feldman, Glass, Ligeti, Monk, and Satie.

Each piece is accompanied by a piku (or πku), a poem whose syllable count is determined by a specific sequence of digits from π.

Check out art from previous years: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day, 2016 `\pi` Day, 2017 `\pi` Day, 2018 `\pi` Day, 2019 `\pi` Day, 2020 `\pi` Day and 2021 `\pi` Day.


PNAS Cover — Earth BioGenome Project

Fri 28-01-2022

My design appears on the 25 January 2022 PNAS issue.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My PNAS cover design captures the vision of the Earth BioGenome Project — to sequence everything. (more)

The cover shows a view of Earth that captures the vision of the Earth BioGenome Project — understanding and conserving genetic diversity on a global scale. Continents from the Authagraph projection, which preserves areas and shapes, are represented as a double helix of 32,111 bases. Short sequences of 806 unique species, sequenced as part of EBP-affiliated projects, are mapped onto the double helix of the continent (or ocean) where the species is commonly found. The length of the sequence is the same for each species on a continent (or ocean) and the sequences are separated by short gaps. Individual bases of the sequence are colored by dots. Species appear along the path in alphabetical order (by Latin name) and the first base of the first species is identified by a small black triangle.

Lewin HA et al. The Earth BioGenome Project 2020: Starting the clock. (2022) PNAS 119(4) e2115635118.

The COVID charts — hospitalization rates

Tue 25-01-2022

As part of the COVID Charts series, I fix a muddled and storyless graphic tweeted by Adrian Dix, Canada's Health Minister.

I show you how to fix color schemes to make them colorblind-accessible and effective in revealing patters, how to reduce redundancy in labels (a key but overlooked part of many visualizations) and how to extract a story out of a table to frame the narrative.

Browse all the COVID charts.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Clear titles introduce the graphic, which starts with informative and non-obvious observations of the relationship between age, number of comorbidities, vaccination status and hospitalization rates. Supporting the story is a tidy table that gives you detailed statistics for each demographic. (more)

The class imbalance problem

Fri 15-10-2021

The exception proves the rule.

But when one class is rare, evaluating a classifier using accuracy can be misleading — because it can vary across classes. This is the class imbalance problem.

We discuss at how a data set can be rebalanced by removing data (undersampling) or adding (oversampling) synthetic data. This must be done with care — undersampling can result in the loss of information and oversampling can lead to overfitting.

We look at various resampling methods (e.g. SMOTE) and explore how they influence performance of a classifier as the imbalance ratio increases.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: The class imbalance problem. (read)

Megahed, F.M, Chen, Y-J., Megahed, A., Ong, Y., Altman, N. & Krzywinski, M. (2021) Points of significance: The class imbalance problem. Nature Methods 18:1270–1272.


Science cover — The Human Genome

Tue 24-08-2021

My cover design on the 24 Sep 2021 Science issue depicts the human genome as a spiral (scale: 1 million bases per centimeter), with colored segments representing different chromosomes. Circle size denotes the number of genes associated with Mendelian disorders and hollow circles indicate the number of mutation clusters from a pan-cancer analysis.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Science spiral cover (volume 373, issue 6562, 24 Sep 2021). (more)

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Music for the Moon: Flunk's 'Down Here / Moon Above'

Sat 29-05-2021

The Sanctuary Project is a Lunar vault of science and art. It includes two fully sequenced human genomes, sequenced and assembled by us at Canada's Michael Smith Genome Sciences Centre.

The first disc includes a song composed by Flunk for the (eventual) trip to the Moon.

But how do you send sound to space? I describe the inspiration, process and art behind the work.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The song 'Down Here / Moon Above' from Flunk's new album History of Everything Ever is our song for space. It appears on the Sanctuary genome discs, which aim to send two fully sequenced human genomes to the Moon. (more)

Browse the genome discs.

Happy 2021 `\pi` Day—
A forest of digits

Sun 14-03-2021

Celebrate `\pi` Day (March 14th) and finally see the digits through the forest.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The 26th tree in the digit forest of `\pi`. Why is there a flower on the ground?. (details)

This year is full of botanical whimsy. A Lindenmayer system forest – deterministic but always changing. Feel free to stop and pick the flowers from the ground.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The first 46 digits of `\pi` in 8 trees. There are so many more. (details)

And things can get crazy in the forest.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A forest of the digits of '\pi`, by ecosystem. (details)

Check out art from previous years: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day, 2016 `\pi` Day, 2017 `\pi` Day, 2018 `\pi` Day and 2019 `\pi` Day.


Testing for rare conditions

Sun 30-05-2021

All that glitters is not gold. —W. Shakespeare

The sensitivity and specificity of a test do not necessarily correspond to its error rate. This becomes critically important when testing for a rare condition — a test with 99% sensitivity and specificity has an even chance of being wrong when the condition prevalence is 1%.

We discuss the positive predictive value (PPV) and how practices such as screen can increase it.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Testing for rare conditions. (read)

Altman, N. & Krzywinski, M. (2021) Points of significance: Testing for rare conditions. Nature Methods 18:224–225.

Standardization fallacy

Tue 09-02-2021

We demand rigidly defined areas of doubt and uncertainty! —D. Adams

A popular notion about experiments is that it's good to keep variability in subjects low to limit the influence of confounding factors. This is called standardization.

Unfortunately, although standardization increases power, it can induce unrealistically low variability and lead to results that do not generalize to the population of interest. And, in fact, may be irreproducible.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Standardization fallacy. (read)

Not paying attention to these details and thinking (or hoping) that standardization is always good is the "standardization fallacy". In this column, we look at how standardization can be balanced with heterogenization to avoid this thorny issue.

Voelkl, B., Würbel, H., Krzywinski, M. & Altman, N. (2021) Points of significance: Standardization fallacy. Nature Methods 18:5–6.

Graphical Abstract Design Guidelines

Fri 13-11-2020

Clear, concise, legible and compelling.

Making a scientific graphical abstract? Refer to my practical design guidelines and redesign examples to improve organization, design and clarity of your graphical abstracts.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Graphical Abstract Design Guidelines — Clear, concise, legible and compelling.

"This data might give you a migrane"

Tue 06-10-2020

An in-depth look at my process of reacting to a bad figure — how I design a poster and tell data stories.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A poster of high BMI and obesity prevalence for 185 countries.

He said, he said — a word analysis of the 2020 Presidential Debates

Thu 01-10-2020

Building on the method I used to analyze the 2008, 2012 and 2016 U.S. Presidential and Vice Presidential debates, I explore word usagein the 2020 Debates between Donald Trump and Joe Biden.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Analysis of word usage by parts of speech for Trump and Biden reveals insight into each candidate.

Points of Significance celebrates 50th column

Mon 24-08-2020

We are celebrating the publication of our 50th column!

To all our coauthors — thank you and see you in the next column!

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance: Celebrating 50 columns of clear explanations of statistics. (read)

Uncertainty and the management of epidemics

Mon 24-08-2020

When modelling epidemics, some uncertainties matter more than others.

Public health policy is always hampered by uncertainty. During a novel outbreak, nearly everything will be uncertain: the mode of transmission, the duration and population variability of latency, infection and protective immunity and, critically, whether the outbreak will fade out or turn into a major epidemic.

The uncertainty may be structural (which model?), parametric (what is `R_0`?), and/or operational (how well do masks work?).

This month, we continue our exploration of epidemiological models and look at how uncertainty affects forecasts of disease dynamics and optimization of intervention strategies.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Uncertainty and the management of epidemics. (read)

We show how the impact of the uncertainty on any choice in strategy can be expressed using the Expected Value of Perfect Information (EVPI), which is the potential improvement in outcomes that could be obtained if the uncertainty is resolved before making a decision on the intervention strategy. In other words, by how much could we potentially increase effectiveness of our choice (e.g. lowering total disease burden) if we knew which model best reflects reality?

This column has an interactive supplemental component (download code) that allows you to explore the impact of uncertainty in `R_0` and immunity duration on timing and size of epidemic waves and the total burden of the outbreak and calculate EVPI for various outbreak models and scenarios.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Uncertainty and the management of epidemics. (Interactive supplemental materials)

Bjørnstad, O.N., Shea, K., Krzywinski, M. & Altman, N. (2020) Points of significance: Uncertainty and the management of epidemics. Nature Methods 17.

Background reading

Bjørnstad, O.N., Shea, K., Krzywinski, M. & Altman, N. (2020) Points of significance: Modeling infectious epidemics. Nature Methods 17:455–456.

Bjørnstad, O.N., Shea, K., Krzywinski, M. & Altman, N. (2020) Points of significance: The SEIRS model for infectious disease dynamics. Nature Methods 17:557–558.

Cover of Nature Genetics August 2020

Mon 03-08-2020

Our design on the cover of Nature Genetics's August 2020 issue is “Dichotomy of Chromatin in Color”. Thanks to Dr. Andy Mungall for suggesting this terrific title.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Dichotomy of Chromatin in Color. Nature Genetics, August 2020 issue. (read more)

The cover design accompanies our report in the issue Gagliardi, A., Porter, V.L., Zong, Z. et al. (2020) Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade–specific epigenome and transcriptome landscapes. Nature Genetics 52:800–810.

Poster Design Guidelines

Wed 15-07-2020

Clear, concise, legible and compelling.

The PDF template is a poster about making posters. It provides design, typography and data visualiation tips with minimum fuss. Follow its advice until you have developed enough design sobriety and experience to know when to go your own way.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Poster Design Guidelines — Clear, concise, legible and compelling..

The SEIRS model for infectious disease dynamics

Thu 18-06-2020

Realistic models of epidemics account for latency, loss of immunity, births and deaths.

We continue with our discussion about epidemic models and show how births, deaths and loss of immunity can create epidemic waves—a periodic fluctuation in the fraction of population that is infected.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: The SEIRS model for infectious disease dynamics. (read)

This column has an interactive supplemental component (download code) that allows you to explore epidemic waves and introduces the idea of the phase plane, a compact way to understand the evolution of an epidemic over its entire course.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: The SEIRS model for infectious disease dynamics. (Interactive supplemental materials)

Bjørnstad, O.N., Shea, K., Krzywinski, M. & Altman, N. (2020) Points of significance: The SEIRS model for infectious disease dynamics. Nature Methods 17:557–558.

Background reading

Bjørnstad, O.N., Shea, K., Krzywinski, M. & Altman, N. (2020) Points of significance: Modeling infectious epidemics. Nature Methods 17:455–456.

Gene Machines

Fri 05-06-2020

Shifting soundscapes, textures and rhythmic loops produced by laboratory machines.

In commemoration of the 20th anniversary of Canada's Michael Smith Genome Sciences Centre, Segue was commissioned to create an original composition based on audio recordings from the GSC's laboratory equipment, robots and computers—to make “music” from the noise they produce.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Gene Machines by Segue. Now available on vinyl.

Virus Mutations Reveal How COVID-19 Really Spread

Mon 01-06-2020

Genetic sequences of the coronavirus tell story of when the virus arrived in each country and where it came from.

Our graphic in Scientific American's Graphic Science section in the June 2020 issue shows a phylogenetic tree based on a snapshot of the data model from Nextstrain as of 31 March 2020.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Virus Mutations Reveal How COVID-19 Really Spread. Text by Mark Fischetti (Senior Editor), art direction by Jen Christiansen (Senior Graphics Editor), source: Nextstrain (enabled by data from GISAID).

Cover of Nature Cancer April 2020

Mon 27-04-2020

Our design on the cover of Nature Cancer's April 2020 issue shows mutation spectra of patients from the POG570 cohort of 570 individuals with advanced metastatic cancer.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Each ellipse system represents the mutation spectrum of an individual patient. Individual ellipses in the system correspond to the number of base changes in a given class and are layered by mutation count. Ellipse angle is controlled by the proportion of mutations in a class within the sample and its size is determined by a sigmoid mapping of mutation count scaled within the layer. The opacity of each system represents the duration since the diagnosis of advanced disease. (read more)

The cover design accompanies our report in the issue Pleasance, E., Titmuss, E., Williamson, L. et al. (2020) Pan-cancer analysis of advanced patient tumors reveals interactions between therapy and genomic landscapes. Nat Cancer 1:452–468.

Modeling infectious epidemics

Tue 16-06-2020

Every day sadder and sadder news of its increase. In the City died this week 7496; and of them, 6102 of the plague. But it is feared that the true number of the dead this week is near 10,000 ....
—Samuel Pepys, 1665

This month, we begin a series of columns on epidemiological models. We start with the basic SIR model, which models the spread of an infection between three groups in a population: susceptible, infected and recovered.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Modeling infectious epidemics. (read)

We discuss conditions under which an outbreak occurs, estimates of spread characteristics and the effects that mitigation can play on disease trajectories. We show the trends that arise when "flattenting the curve" by decreasing `R_0`.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Modeling infectious epidemics. (read)

This column has an interactive supplemental component (download code) that allows you to explore how the model curves change with parameters such as infectious period, basic reproduction number and vaccination level.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Modeling infectious epidemics. (Interactive supplemental materials)

Bjørnstad, O.N., Shea, K., Krzywinski, M. & Altman, N. (2020) Points of significance: Modeling infectious epidemics. Nature Methods 17:455–456.

The Outbreak Poems

Sat 04-04-2020

I'm writing poetry daily to put my feelings into words more often during the COVID-19 outbreak.

Your hours
will
last me my years.
Hole in heart
is
bigger than you
were.
From hand to
heart
in a flutter.
Come fly in
my 
heart for a while.
Can't feel you
in
my hand, dying.
Need new words,
please,
for these feelings.
Mini dee
you
put you in me.

Read the poems and learn what a piku is.


Deadly Genomes: Genome Structure and Size of Harmful Bacteria and Viruses

Tue 17-03-2020

A poster full of epidemiological worry and statistics. Now updated with the genome of SARS-CoV-2 and COVID-19 case statistics as of 3 March 2020.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Deadly Genomes: Genome Structure and Size of Harmful Bacteria and Viruses (zoom)

Bacterial and viral genomes of various diseases are drawn as paths with color encoding local GC content and curvature encoding local repeat content. Position of the genome encodes prevalence and mortality rate.

The deadly genomes collection has been updated with a posters of the genomes of SARS-CoV-2, the novel coronavirus that causes COVID-19.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Genomes of 56 SARS-CoV-2 coronaviruses that causes COVID-19.
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Ball of 56 SARS-CoV-2 coronaviruses that causes COVID-19.
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The first SARS-CoV-2 genome (MT019529) to be sequenced appears first on the poster.

Using Circos in Galaxy Australia Workshop

Wed 04-03-2020

A workshop in using the Circos Galaxy wrapper by Hiltemann and Rasche. Event organized by Australian Biocommons.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Using Circos in Galaxy Australia workshop. (zoom)

Download workshop slides.

Galaxy wrapper training materials, Saskia Hiltemann, Helena Rasche, 2020 Visualisation with Circos (Galaxy Training Materials).

Essence of Data Visualization in Bioinformatics Webinar

Thu 20-02-2020

My webinar on fundamental concepts in data visualization and visual communication of scientific data and concepts. Event organized by Australian Biocommons.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Essence of Data Visualization in Bioinformatics webinar. (zoom)

Download webinar slides.


Markov models — training and evaluation of hidden Markov models

Thu 20-02-2020

With one eye you are looking at the outside world, while with the other you are looking within yourself.
—Amedeo Modigliani

Following up with our Markov Chain column and Hidden Markov model column, this month we look at how Markov models are trained using the example of biased coin.

We introduce the concepts of forward and backward probabilities and explicitly show how they are calculated in the training process using the Baum-Welch algorithm. We also discuss the value of ensemble models and the use of pseudocounts for cases where rare observations are expected but not necessarily seen.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Markov models — training and evaluation of hidden Markov models. (read)

Grewal, J., Krzywinski, M. & Altman, N. (2019) Points of significance: Markov models — training and evaluation of hidden Markov models. Nature Methods 17:121–122.

Background reading

Altman, N. & Krzywinski, M. (2019) Points of significance: Hidden Markov models. Nature Methods 16:795–796.

Altman, N. & Krzywinski, M. (2019) Points of significance: Markov Chains. Nature Methods 16:663–664.

Genome Sciences Center 20th Anniversary Clothing, Music, Drinks and Art

Tue 28-01-2020

Science. Timeliness. Respect.

Read about the design of the clothing, music, drinks and art for the Genome Sciences Center 20th Anniversary Celebration, held on 15 November 2019.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Luke and Mayia wearing limited edition volunteer t-shirts. The pattern reproduces the human genome with chromosomes as spirals. (zoom)

As part of the celebration and with the help of our engineering team, we framed 48 flow cells from the lab.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Precisely engineered frame mounts of flow cells used to sequence genomes in our laboratory. (zoom)

Each flow cell was accompanied by an interpretive plaque explaining the technology behind the flow cell and the sample information and sequence content.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The plaque at the back of one of the framed Illumina flow cell. This one has sequence from a patient's lymph node diagnosed with Burkitt's lymphoma. (zoom)

Scientific data visualization: Aesthetic for diagrammatic clarity

Mon 13-01-2020

The scientific process works because all its output is empirically constrained.

My chapter from The Aesthetics of Scientific Data Representation, More than Pretty Pictures, in which I discuss the principles of data visualization and connect them to the concept of "quality" introduced by Robert Pirsig in Zen and the Art of Motorcycle Maintenance.


Yearning for the Infinite — Aleph 2

Mon 18-11-2019

Discover Cantor's transfinite numbers through my music video for the Aleph 2 track of Max Cooper's Yearning for the Infinite (album page, event page).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Yearning for the Infinite, Max Cooper at the Barbican Hall, London. Track Aleph 2. Video by Martin Krzywinski. Photo by Michal Augustini. (more)

I discuss the math behind the video and the system I built to create the video.

Hidden Markov Models

Mon 18-11-2019

Everything we see hides another thing, we always want to see what is hidden by what we see.
—Rene Magritte

A Hidden Markov Model extends a Markov chain to have hidden states. Hidden states are used to model aspects of the system that cannot be directly observed and themselves form a Markov chain and each state may emit one or more observed values.

Hidden states in HMMs do not have to have meaning—they can be used to account for measurement errors, compress multi-modal observational data, or to detect unobservable events.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Hidden Markov Models. (read)

In this column, we extend the cell growth model from our Markov Chain column to include two hidden states: normal and sedentary.

We show how to calculate forward probabilities that can predict the most likely path through the HMM given an observed sequence.

Grewal, J., Krzywinski, M. & Altman, N. (2019) Points of significance: Hidden Markov Models. Nature Methods 16:795–796.

Background reading

Altman, N. & Krzywinski, M. (2019) Points of significance: Markov Chains. Nature Methods 16:663–664.

Hola Mundo Cover

Sat 21-09-2019

My cover design for Hola Mundo by Hannah Fry. Published by Blackie Books.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Hola Mundo by Hannah Fry. Cover design is based on my 2013 `\pi` day art. (read)

Curious how the design was created? Read the full details.


Markov Chains

Tue 30-07-2019

You can look back there to explain things,
but the explanation disappears.
You'll never find it there.
Things are not explained by the past.
They're explained by what happens now.
—Alan Watts

A Markov chain is a probabilistic model that is used to model how a system changes over time as a series of transitions between states. Each transition is assigned a probability that defines the chance of the system changing from one state to another.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Markov Chains. (read)

Together with the states, these transitions probabilities define a stochastic model with the Markov property: transition probabilities only depend on the current state—the future is independent of the past if the present is known.

Once the transition probabilities are defined in matrix form, it is easy to predict the distribution of future states of the system. We cover concepts of aperiodicity, irreducibility, limiting and stationary distributions and absorption.

This column is the first part of a series and pairs particularly well with Alan Watts and Blond:ish.

Grewal, J., Krzywinski, M. & Altman, N. (2019) Points of significance: Markov Chains. Nature Methods 16:663–664.

1-bit zoomable gigapixel maps of Moon, Solar System and Sky

Mon 22-07-2019

Places to go and nobody to see.

Exquisitely detailed maps of places on the Moon, comets and asteroids in the Solar System and stars, deep-sky objects and exoplanets in the northern and southern sky. All maps are zoomable.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
3.6 gigapixel map of the near side of the Moon, annotated with 6,733. (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
100 megapixel and 10 gigapixel map of the Solar System on 20 July 2019, annotated with 758k asteroids, 1.3k comets and all planets and satellites. (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
100 megapixle and 10 gigapixel map of the Northern Celestial Hemisphere, annotated with 44 million stars, 74,000 deep-sky objects and 3,000 exoplanets. (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
100 megapixle and 10 gigapixel map of the Southern Celestial Hemisphere, annotated with 69 million stars, 88,000 deep-sky objects and 1000 exoplanets. (details)

Quantile regression

Sat 01-06-2019
Quantile regression robustly estimates the typical and extreme values of a response.

Quantile regression explores the effect of one or more predictors on quantiles of the response. It can answer questions such as "What is the weight of 90% of individuals of a given height?"

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Quantile regression. (read)

Unlike in traditional mean regression methods, no assumptions about the distribution of the response are required, which makes it practical, robust and amenable to skewed distributions.

Quantile regression is also very useful when extremes are interesting or when the response variance varies with the predictors.

Das, K., Krzywinski, M. & Altman, N. (2019) Points of significance: Quantile regression. Nature Methods 16:451–452.

Background reading

Altman, N. & Krzywinski, M. (2015) Points of significance: Simple linear regression. Nature Methods 12:999–1000.


Analyzing outliers: Robust methods to the rescue

Sat 30-03-2019
Robust regression generates more reliable estimates by detecting and downweighting outliers.

Outliers can degrade the fit of linear regression models when the estimation is performed using the ordinary least squares. The impact of outliers can be mitigated with methods that provide robust inference and greater reliability in the presence of anomalous values.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Analyzing outliers: Robust methods to the rescue. (read)

We discuss MM-estimation and show how it can be used to keep your fitting sane and reliable.

Greco, L., Luta, G., Krzywinski, M. & Altman, N. (2019) Points of significance: Analyzing outliers: Robust methods to the rescue. Nature Methods 16:275–276.

Background reading

Altman, N. & Krzywinski, M. (2016) Points of significance: Analyzing outliers: Influential or nuisance. Nature Methods 13:281–282.

Two-level factorial experiments

Fri 22-03-2019
To find which experimental factors have an effect, simultaneously examine the difference between the high and low levels of each.

Two-level factorial experiments, in which all combinations of multiple factor levels are used, efficiently estimate factor effects and detect interactions—desirable statistical qualities that can provide deep insight into a system.

They offer two benefits over the widely used one-factor-at-a-time (OFAT) experiments: efficiency and ability to detect interactions.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Two-level factorial experiments. (read)

Since the number of factor combinations can quickly increase, one approach is to model only some of the factorial effects using empirically-validated assumptions of effect sparsity and effect hierarchy. Effect sparsity tells us that in factorial experiments most of the factorial terms are likely to be unimportant. Effect hierarchy tells us that low-order terms (e.g. main effects) tend to be larger than higher-order terms (e.g. two-factor or three-factor interactions).

Smucker, B., Krzywinski, M. & Altman, N. (2019) Points of significance: Two-level factorial experiments Nature Methods 16:211–212.

Background reading

Krzywinski, M. & Altman, N. (2014) Points of significance: Designing comparative experiments.. Nature Methods 11:597–598.

Happy 2019 `\pi` Day—
Digits, internationally

Tue 12-03-2019

Celebrate `\pi` Day (March 14th) and set out on an exploration explore accents unknown (to you)!

This year is purely typographical, with something for everyone. Hundreds of digits and hundreds of languages.

A special kids' edition merges math with color and fat fonts.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
116 digits in 64 languages. (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
223 digits in 102 languages. (details)

Check out art from previous years: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day, 2016 `\pi` Day, 2017 `\pi` Day and 2018 `\pi` Day.


Tree of Emotional Life

Sun 17-02-2019

One moment you're :) and the next you're :-.

Make sense of it all with my Tree of Emotional life—a hierarchical account of how we feel.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A section of the Tree of Emotional Life.

Find and snap to colors in an image

Sat 29-12-2018

One of my color tools, the colorsnap application snaps colors in an image to a set of reference colors and reports their proportion.

Below is Times Square rendered using the colors of the MTA subway lines.


Colors used by the New York MTA subway lines.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Times Square in New York City.
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Times Square in New York City rendered using colors of the MTA subway lines.
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Granger rainbow snapped to subway lines colors from four cities. (zoom)

Take your medicine ... now

Wed 19-12-2018

Drugs could be more effective if taken when the genetic proteins they target are most active.

Design tip: rediscover CMYK primaries.

More of my American Scientific Graphic Science designs

Ruben et al. A database of tissue-specific rhythmically expressed human genes has potential applications in circadian medicine Science Translational Medicine 10 Issue 458, eaat8806.


Predicting with confidence and tolerance

Wed 07-11-2018
I abhor averages. I like the individual case. —J.D. Brandeis.

We focus on the important distinction between confidence intervals, typically used to express uncertainty of a sampling statistic such as the mean and, prediction and tolerance intervals, used to make statements about the next value to be drawn from the population.

Confidence intervals provide coverage of a single point—the population mean—with the assurance that the probability of non-coverage is some acceptable value (e.g. 0.05). On the other hand, prediction and tolerance intervals both give information about typical values from the population and the percentage of the population expected to be in the interval. For example, a tolerance interval can be configured to tell us what fraction of sampled values (e.g. 95%) will fall into an interval some fraction of the time (e.g. 95%).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Predicting with confidence and tolerance. (read)

Altman, N. & Krzywinski, M. (2018) Points of significance: Predicting with confidence and tolerance Nature Methods 15:843–844.

Background reading

Krzywinski, M. & Altman, N. (2013) Points of significance: Importance of being uncertain. Nature Methods 10:809–810.

4-day Circos course

Wed 31-10-2018

A 4-day introductory course on genome data parsing and visualization using Circos. Prepared for the Bioinformatics and Genome Analysis course in Institut Pasteur Tunis, Tunis, Tunisia.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Composite of the kinds of images you will learn to make in this course.

Oryza longistaminata genome cake

Mon 24-09-2018

Data visualization should be informative and, where possible, tasty.

Stefan Reuscher from Bioscience and Biotechnology Center at Nagoya University celebrates a publication with a Circos cake.

The cake shows an overview of a de-novo assembled genome of a wild rice species Oryza longistaminata.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Circos cake celebrating Reuscher et al. 2018 publication of the Oryza longistaminata genome.

Optimal experimental design

Tue 31-07-2018
Customize the experiment for the setting instead of adjusting the setting to fit a classical design.

The presence of constraints in experiments, such as sample size restrictions, awkward blocking or disallowed treatment combinations may make using classical designs very difficult or impossible.

Optimal design is a powerful, general purpose alternative for high quality, statistically grounded designs under nonstandard conditions.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Optimal experimental design. (read)

We discuss two types of optimal designs (D-optimal and I-optimal) and show how it can be applied to a scenario with sample size and blocking constraints.

Smucker, B., Krzywinski, M. & Altman, N. (2018) Points of significance: Optimal experimental design Nature Methods 15:599–600.

Background reading

Krzywinski, M., Altman, N. (2014) Points of significance: Two factor designs. Nature Methods 11:1187–1188.

Krzywinski, M. & Altman, N. (2014) Points of significance: Analysis of variance (ANOVA) and blocking. Nature Methods 11:699–700.

Krzywinski, M. & Altman, N. (2014) Points of significance: Designing comparative experiments. Nature Methods 11:597–598.

The Whole Earth Cataloguer

Mon 30-07-2018
All the living things.

An illustration of the Tree of Life, showing some of the key branches.

The tree is drawn as a DNA double helix, with bases colored to encode ribosomal RNA genes from various organisms on the tree.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The circle of life. (read, zoom)

All living things on earth descended from a single organism called LUCA (last universal common ancestor) and inherited LUCA’s genetic code for basic biological functions, such as translating DNA and creating proteins. Constant genetic mutations shuffled and altered this inheritance and added new genetic material—a process that created the diversity of life we see today. The “tree of life” organizes all organisms based on the extent of shuffling and alteration between them. The full tree has millions of branches and every living organism has its own place at one of the leaves in the tree. The simplified tree shown here depicts all three kingdoms of life: bacteria, archaebacteria and eukaryota. For some organisms a grey bar shows when they first appeared in the tree in millions of years (Ma). The double helix winding around the tree encodes highly conserved ribosomal RNA genes from various organisms.

Johnson, H.L. (2018) The Whole Earth Cataloguer, Sactown, Jun/Jul, p. 89

Why we can't give up this odd way of typing

Mon 30-07-2018
All fingers report to home row.

An article about keyboard layouts and the history and persistence of QWERTY.

My Carpalx keyboard optimization software is mentioned along with my World's Most Difficult Layout: TNWMLC. True typing hell.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
TNWMLC requires seriously flexible digits. It’s 87% more difficult than using a standard Qwerty keyboard, according to Martin Krzywinski, who created it (Credit: Ben Nelms). (read)

McDonald, T. (2018) Why we can't give up this odd way of typing, BBC, 25 May 2018.


Molecular Case Studies Cover

Fri 06-07-2018

The theme of the April issue of Molecular Case Studies is precision oncogenomics. We have three papers in the issue based on work done in our Personalized Oncogenomics Program (POG).

The covers of Molecular Case Studies typically show microscopy images, with some shown in a more abstract fashion. There's also the occasional Circos plot.

I've previously taken a more fine-art approach to cover design, such for those of Nature, Genome Research and Trends in Genetics. I've used microscopy images to create a cover for PNAS—the one that made biology look like astrophysics—and thought that this is kind of material I'd start with for the MCS cover.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Cover design for Apr 2018 issue of Molecular Case Studies. (details)

Happy 2018 `\tau` Day—Art for everyone

Wed 27-06-2018
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
You know what day it is. (details)

Universe Superclusters and Voids

Mon 25-06-2018

A map of the nearby superclusters and voids in the Unvierse.

By "nearby" I mean within 6,000 million light-years.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The Universe — Superclustesr and Voids. The two supergalactic hemispheres showing Abell clusters, superclusters and voids within a distance of 6,000 million light-years from the Milky Way. (details)

Datavis for your feet—the 178.75 lb socks

Sat 23-06-2018

In the past, I've been tangentially involved in fashion design. I've also been more directly involved in fashion photography.

It was now time to design my first ... pair of socks.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Some datavis for your feet: the 178.75 lb socks. (get some)

In collaboration with Flux Socks, the design features the colors and relative thicknesses of Rogue olympic weightlifting plates. The first four plates in the stack are the 55, 45, 35, and 25 competition plates. The top 4 plates are the 10, 5, 2.5 and 1.25 lb change plates.

The perceived weight of each sock is 178.75 lb and 357.5 lb for the pair.

The actual weight is much less.

Genes Behind Psychiatric Disorders

Sun 24-06-2018

Find patterns behind gene expression and disease.

Expression, correlation and network module membership of 11,000+ genes and 5 psychiatric disorders in about 6" x 7" on a single page.

Design tip: Stay calm.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
An analysis of dust reveals how the presence of men, women, dogs and cats affects the variety of bacteria in a household. Appears on Graphic Science page in December 2015 issue of Scientific American.

More of my American Scientific Graphic Science designs

Gandal M.J. et al. Shared Molecular Neuropathology Across Major Psychiatric Disorders Parallels Polygenic Overlap Science 359 693–697 (2018)

Curse(s) of dimensionality

Tue 05-06-2018
There is such a thing as too much of a good thing.

We discuss the many ways in which analysis can be confounded when data has a large number of dimensions (variables). Collectively, these are called the "curses of dimensionality".

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Curse(s) of dimensionality. (read)

Some of these are unintuitive, such as the fact that the volume of the hypersphere increases and then shrinks beyond about 7 dimensions, while the volume of the hypercube always increases. This means that high-dimensional space is "mostly corners" and the distance between points increases greatly with dimension. This has consequences on correlation and classification.

Altman, N. & Krzywinski, M. (2018) Points of significance: Curse(s) of dimensionality Nature Methods 15:399–400.

Statistics vs Machine Learning

Tue 03-04-2018
We conclude our series on Machine Learning with a comparison of two approaches: classical statistical inference and machine learning. The boundary between them is subject to debate, but important generalizations can be made.

Inference creates a mathematical model of the datageneration process to formalize understanding or test a hypothesis about how the system behaves. Prediction aims at forecasting unobserved outcomes or future behavior. Typically we want to do both and know how biological processes work and what will happen next. Inference and ML are complementary in pointing us to biologically meaningful conclusions.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Statistics vs machine learning. (read)

Statistics asks us to choose a model that incorporates our knowledge of the system, and ML requires us to choose a predictive algorithm by relying on its empirical capabilities. Justification for an inference model typically rests on whether we feel it adequately captures the essence of the system. The choice of pattern-learning algorithms often depends on measures of past performance in similar scenarios.

Bzdok, D., Krzywinski, M. & Altman, N. (2018) Points of Significance: Statistics vs machine learning. Nature Methods 15:233–234.

Background reading

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: a primer. Nature Methods 14:1119–1120.

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: supervised methods. Nature Methods 15:5–6.

...more about the Points of Significance column

Happy 2018 `\pi` Day—Boonies, burbs and boutiques of `\pi`

Wed 14-03-2018

Celebrate `\pi` Day (March 14th) and go to brand new places. Together with Jake Lever, this year we shrink the world and play with road maps.

Streets are seamlessly streets from across the world. Finally, a halva shop on the same block!

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A great 10 km run loop between Istanbul, Copenhagen, San Francisco and Dublin. Stop off for halva, smørrebrød, espresso and a Guinness on the way. (details)

Intriguing and personal patterns of urban development for each city appear in the Boonies, Burbs and Boutiques series.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
In the Boonies, Burbs and Boutiques of `\pi` we draw progressively denser patches using the digit sequence 159 to inform density. (details)

No color—just lines. Lines from Marrakesh, Prague, Istanbul, Nice and other destinations for the mind and the heart.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Roads from cities rearranged according to the digits of `\pi`. (details)

The art is featured in the Pi City on the Scientific American SA Visual blog.

Check out art from previous years: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day, 2016 `\pi` Day and 2017 `\pi` Day.

Machine learning: supervised methods (SVM & kNN)

Thu 18-01-2018
Supervised learning algorithms extract general principles from observed examples guided by a specific prediction objective.

We examine two very common supervised machine learning methods: linear support vector machines (SVM) and k-nearest neighbors (kNN).

SVM is often less computationally demanding than kNN and is easier to interpret, but it can identify only a limited set of patterns. On the other hand, kNN can find very complex patterns, but its output is more challenging to interpret.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Machine learning: supervised methods (SVM & kNN). (read)

We illustrate SVM using a data set in which points fall into two categories, which are separated in SVM by a straight line "margin". SVM can be tuned using a parameter that influences the width and location of the margin, permitting points to fall within the margin or on the wrong side of the margin. We then show how kNN relaxes explicit boundary definitions, such as the straight line in SVM, and how kNN too can be tuned to create more robust classification.

Bzdok, D., Krzywinski, M. & Altman, N. (2018) Points of Significance: Machine learning: a primer. Nature Methods 15:5–6.

Background reading

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: a primer. Nature Methods 14:1119–1120.

...more about the Points of Significance column


Human Versus Machine

Mon 18-12-2017
Balancing subjective design with objective optimization.

In a Nature graphics blog article, I present my process behind designing the stark black-and-white Nature 10 cover.

Nature 10, 18 December 2017

Machine learning: a primer

Thu 18-01-2018
Machine learning extracts patterns from data without explicit instructions.

In this primer, we focus on essential ML principles— a modeling strategy to let the data speak for themselves, to the extent possible.

The benefits of ML arise from its use of a large number of tuning parameters or weights, which control the algorithm’s complexity and are estimated from the data using numerical optimization. Often ML algorithms are motivated by heuristics such as models of interacting neurons or natural evolution—even if the underlying mechanism of the biological system being studied is substantially different. The utility of ML algorithms is typically assessed empirically by how well extracted patterns generalize to new observations.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Machine learning: a primer. (read)

We present a data scenario in which we fit to a model with 5 predictors using polynomials and show what to expect from ML when noise and sample size vary. We also demonstrate the consequences of excluding an important predictor or including a spurious one.

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: a primer. Nature Methods 14:1119–1120.

...more about the Points of Significance column

Snowflake simulation

Sat 23-12-2017
Symmetric, beautiful and unique.

Just in time for the season, I've simulated a snow-pile of snowflakes based on the Gravner-Griffeath model.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A few of the beautiful snowflakes generated by the Gravner-Griffeath model. (explore)

The work is described as a wintertime tale in In Silico Flurries: Computing a world of snow and co-authored with Jake Lever in the Scientific American SA Blog.

Gravner, J. & Griffeath, D. (2007) Modeling Snow Crystal Growth II: A mesoscopic lattice map with plausible dynamics.


Genes that make us sick

Thu 02-11-2017
Where disease hides in the genome.

My illustration of the location of genes in the human genome that are implicated in disease appears in The Objects that Power the Global Economy, a book by Quartz.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The location of genes implicated in disease in the human genome, shown here as a spiral. (more...)

Ensemble methods: Bagging and random forests

Mon 16-10-2017
Many heads are better than one.

We introduce two common ensemble methods: bagging and random forests. Both of these methods repeat a statistical analysis on a bootstrap sample to improve the accuracy of the predictor. Our column shows these methods as applied to Classification and Regression Trees.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Ensemble methods: Bagging and random forests. (read)

For example, we can sample the space of values more finely when using bagging with regression trees because each sample has potentially different boundaries at which the tree splits.

Random forests generate a large number of trees by not only generating bootstrap samples but also randomly choosing which predictor variables are considered at each split in the tree.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Ensemble methods: bagging and random forests. Nature Methods 14:933–934.

Background reading

Krzywinski, M. & Altman, N. (2017) Points of Significance: Classification and regression trees. Nature Methods 14:757–758.

...more about the Points of Significance column

Classification and regression trees

Mon 16-10-2017
Decision trees are a powerful but simple prediction method.

Decision trees classify data by splitting it along the predictor axes into partitions with homogeneous values of the dependent variable. Unlike logistic or linear regression, CART does not develop a prediction equation. Instead, data are predicted by a series of binary decisions based on the boundaries of the splits. Decision trees are very effective and the resulting rules are readily interpreted.

Trees can be built using different metrics that measure how well the splits divide up the data classes: Gini index, entropy or misclassification error.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Classification and decision trees. (read)

When the predictor variable is quantitative and not categorical, regression trees are used. Here, the data are still split but now the predictor variable is estimated by the average within the split boundaries. Tree growth can be controlled using the complexity parameter, a measure of the relative improvement of each new split.

Individual trees can be very sensitive to minor changes in the data and even better prediction can be achieved by exploiting this variability. Using ensemble methods, we can grow multiple trees from the same data.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Classification and regression trees. Nature Methods 14:757–758.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Multiple Linear Regression Nature Methods 12:1103-1104.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Model Selection and Overfitting. Nature Methods 13:703-704.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Regularization. Nature Methods 13:803-804.

...more about the Points of Significance column


Personal Oncogenomics Program 5 Year Anniversary Art

Wed 26-07-2017

The artwork was created in collaboration with my colleagues at the Genome Sciences Center to celebrate the 5 year anniversary of the Personalized Oncogenomics Program (POG).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
5 Years of Personalized Oncogenomics Program at Canada's Michael Smith Genome Sciences Centre. The poster shows 545 cancer cases. (left) Cases ordered chronologically by case number. (right) Cases grouped by diagnosis (tissue type) and then by similarity within group.

The Personal Oncogenomics Program (POG) is a collaborative research study including many BC Cancer Agency oncologists, pathologists and other clinicians along with Canada's Michael Smith Genome Sciences Centre with support from BC Cancer Foundation.

The aim of the program is to sequence, analyze and compare the genome of each patient's cancer—the entire DNA and RNA inside tumor cells— in order to understand what is enabling it to identify less toxic and more effective treatment options.

Principal component analysis

Thu 06-07-2017
PCA helps you interpret your data, but it will not always find the important patterns.

Principal component analysis (PCA) simplifies the complexity in high-dimensional data by reducing its number of dimensions.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Principal component analysis. (read)

To retain trend and patterns in the reduced representation, PCA finds linear combinations of canonical dimensions that maximize the variance of the projection of the data.

PCA is helpful in visualizing high-dimensional data and scatter plots based on 2-dimensional PCA can reveal clusters.

Altman, N. & Krzywinski, M. (2017) Points of Significance: Principal component analysis. Nature Methods 14:641–642.

Background reading

Altman, N. & Krzywinski, M. (2017) Points of Significance: Clustering. Nature Methods 14:545–546.

...more about the Points of Significance column

`k` index: a weightlighting and Crossfit performance measure

Wed 07-06-2017

Similar to the `h` index in publishing, the `k` index is a measure of fitness performance.

To achieve a `k` index for a movement you must perform `k` unbroken reps at `k`% 1RM.

The expected value for the `k` index is probably somewhere in the range of `k = 26` to `k=35`, with higher values progressively more difficult to achieve.

In my `k` index introduction article I provide detailed explanation, rep scheme table and WOD example.


Dark Matter of the English Language—the unwords

Wed 07-06-2017

I've applied the char-rnn recurrent neural network to generate new words, names of drugs and countries.

The effect is intriguing and facetious—yes, those are real words.

But these are not: necronology, abobionalism, gabdologist, and nonerify.

These places only exist in the mind: Conchar and Pobacia, Hzuuland, New Kain, Rabibus and Megee Islands, Sentip and Sitina, Sinistan and Urzenia.

And these are the imaginary afflictions of the imagination: ictophobia, myconomascophobia, and talmatomania.

And these, of the body: ophalosis, icabulosis, mediatopathy and bellotalgia.

Want to name your baby? Or someone else's baby? Try Ginavietta Xilly Anganelel or Ferandulde Hommanloco Kictortick.

When taking new therapeutics, never mix salivac and labromine. And don't forget that abadarone is best taken on an empty stomach.

And nothing increases the chance of getting that grant funded than proposing the study of a new –ome! We really need someone to looking into the femome and manome.

Dark Matter of the Genome—the nullomers

Wed 31-05-2017

An exploration of things that are missing in the human genome. The nullomers.

Julia Herold, Stefan Kurtz and Robert Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics (2008) 9:167

Clustering

Sat 01-07-2017
Clustering finds patterns in data—whether they are there or not.

We've already seen how data can be grouped into classes in our series on classifiers. In this column, we look at how data can be grouped by similarity in an unsupervised way.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Clustering. (read)

We look at two common clustering approaches: `k`-means and hierarchical clustering. All clustering methods share the same approach: they first calculate similarity and then use it to group objects into clusters. The details of the methods, and outputs, vary widely.

Altman, N. & Krzywinski, M. (2017) Points of Significance: Clustering. Nature Methods 14:545–546.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.

...more about the Points of Significance column


What's wrong with pie charts?

Thu 25-05-2017

In this redesign of a pie chart figure from a Nature Medicine article [1], I look at how to organize and present a large number of categories.

I first discuss some of the benefits of a pie chart—there are few and specific—and its shortcomings—there are few but fundamental.

I then walk through the redesign process by showing how the tumor categories can be shown more clearly if they are first aggregated into a small number groups.

(bottom left) Figure 2b from Zehir et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. (2017) Nature Medicine doi:10.1038/nm.4333

Tabular Data

Tue 11-04-2017
Tabulating the number of objects in categories of interest dates back to the earliest records of commerce and population censuses.

After 30 columns, this is our first one without a single figure. Sometimes a table is all you need.

In this column, we discuss nominal categorical data, in which data points are assigned to categories in which there is no implied order. We introduce one-way and two-way tables and the `\chi^2` and Fisher's exact tests.

Altman, N. & Krzywinski, M. (2017) Points of Significance: Tabular data. Nature Methods 14:329–330.

...more about the Points of Significance column

Happy 2017 `\pi` Day—Star Charts, Creatures Once Living and a Poem

Tue 14-03-2017


on a brim of echo,

capsized chamber
drawn into our constellation, and cooling.
—Paolo Marcazzan

Celebrate `\pi` Day (March 14th) with star chart of the digits. The charts draw 40,000 stars generated from the first 12 million digits.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
12,000,000 digits of `\pi` interpreted as a star catalogue. (details)

The 80 constellations are extinct animals and plants. Here you'll find old friends and new stories. Read about how Desmodus is always trying to escape or how Megalodon terrorizes the poor Tecopa! Most constellations have a story.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Find friends and stories among the 80 constellations of extinct animals and plants. Oh look, a Dodo guardings his eggs! (details)

This year I collaborate with Paolo Marcazzan, a Canadian poet, who contributes a poem, Of Black Body, about space and things we might find and lose there.

Check out art from previous years: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day and and 2016 `\pi` Day.


Data in New Dimensions: convergence of art, genomics and bioinformatics

Tue 07-03-2017

Art is science in love.
— E.F. Weisslitz

A behind-the-scenes look at the making of our stereoscopic images which were at display at the AGBT 2017 Conference in February. The art is a creative collaboration with Becton Dickinson and The Linus Group.

Its creation began with the concept of differences and my writeup of the creative and design process focuses on storytelling and how concept of differences is incorporated into the art.

Oh, and this might be a good time to pick up some red-blue 3D glasses.

BD Genomics 3D art exhibit - AGBT 2017 / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A stereoscopic image and its interpretive panel of single-cell transcriptomes of blood cells: diseased versus healthy control.

Interpreting P values

Thu 02-03-2017
A P value measures a sample’s compatibility with a hypothesis, not the truth of the hypothesis.

This month we continue our discussion about `P` values and focus on the fact that `P` value is a probability statement about the observed sample in the context of a hypothesis, not about the hypothesis being tested.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Interpreting P values. (read)

Given that we are always interested in making inferences about hypotheses, we discuss how `P` values can be used to do this by way of the Benjamin-Berger bound, `\bar{B}` on the Bayes factor, `B`.

Heuristics such as these are valuable in helping to interpret `P` values, though we stress that `P` values vary from sample to sample and hence many sources of evidence need to be examined before drawing scientific conclusions.

Altman, N. & Krzywinski, M. (2017) Points of Significance: Interpreting P values. Nature Methods 14:213–214.

Background reading

Krzywinski, M. & Altman, N. (2017) Points of significance: P values and the search for significance. Nature Methods 14:3–4.

Krzywinski, M. & Altman, N. (2013) Points of significance: Significance, P values and t–tests. Nature Methods 10:1041–1042.

...more about the Points of Significance column

Snellen Charts—Typography to Really Look at

Sat 18-02-2017

Another collection of typographical posters. These ones really ask you to look.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Snellen charts designed using physical constants, Braille and elemental abundances in the universe and human body.

The charts show a variety of interesting symbols and operators found in science and math. The design is in the style of a Snellen chart and typset with the Rockwell font.


Essentials of Data Visualization—8-part video series

Fri 17-02-2017
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca

In collaboration with the Phil Poronnik and Kim Bell-Anderson at the University of Sydney, I'm delighted to share with you our 8-part video series project about thinking about drawing data and communicating science.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Essentials of Data Visualization: Thinking about drawing data and communicating science.

We've created 8 videos, each focusing on a different essential idea in data visualization: encoding, shapes, color, uncertainty, design, drawing missing or unobserved data, labels and process.

The videos were designed as teaching materials. Each video comes with a slide deck and exercises.

P values and the search for significance

Wed 31-05-2017
Little P value
What are you trying to say
Of significance?
—Steve Ziliak

We've written about P values before and warned readers about common misconceptions about them, which are so rife that the American Statistical Association itself has a long statement about them.

This month is our first of a two-part article about P values. Here we look at 'P value hacking' and 'data dredging', which are questionable practices that invalidate the correct interpretation of P values.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: P values and the search for significance. (read)

We also illustrate how P values can lead us astray by asking "What is the smallest P value we can expect if the null hypothesis is true but we have done many tests, either explicitly or implicitly?"

Incidentally, this is our first column in which the standfirst is a haiku.

Altman, N. & Krzywinski, M. (2017) Points of Significance: P values and the search for significance. Nature Methods 14:3–4.

Background reading

Krzywinski, M. & Altman, N. (2013) Points of significance: Significance, P values and t–tests. Nature Methods 10:1041–1042.

...more about the Points of Significance column

Intuitive Design

Thu 03-11-2016

Appeal to intuition when designing with value judgments in mind.

Figure clarity and concision are improved when the selection of shapes and colors is grounded in the Gestalt principles, which describe how we visually perceive and organize information.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
One of the Gestalt principles tells us that the magenta and green shapes will be perceived as as two groups, overriding the fact that the shapes within the group might be different. What the principle does not tell us is how the reader is likely to value each group. (read)

The Gestalt principles are value free. For example, they tell us how we group objects but do not speak to any meaning that we might intuitively infer from visual characteristics.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Intuitive Design. (read)

This month, we discuss how appealing to such intuitions—related to shapes, colors and spatial orientation— can help us add information to a figure as well as anticipate and encourage useful interpretations.

Krzywinski, M. (2016) Points of View: Intuitive Design. Nature Methods 13:895.

...more about the Points of View column


Regularization

Fri 04-11-2016

Constraining the magnitude of parameters of a model can control its complexity.

This month we continue our discussion about model selection and evaluation and address how to choose a model that avoids both overfitting and underfitting.

Ideally, we want to avoid having either an underfitted model, which is usually a poor fit to the training data, or an overfitted model, which is a good fit to the training data but not to new data.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Regularization (read)

Regularization is a process that penalizes the magnitude of model parameters. This is done by not only minimizing the SSE, `\mathrm{SSE} = \sum_i (y_i - \hat{y}_i)^2 `, as is done normally in a fit, but adding to this minimized quantity the sum of the mode's squared parameters, `\mathrm{SSE} + \lambda \sum_i \hat{\beta}^2_i`.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Regularization. Nature Methods 13:803-804.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Model Selection and Overfitting. Nature Methods 13:703-704.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

...more about the Points of Significance column

Model Selection and Overfitting

Fri 04-11-2016

With four parameters I can fit an elephant and with five I can make him wiggle his trunk. —John von Neumann.

By increasing the complexity of a model, it is easy to make it fit to data perfectly. Does this mean that the model is perfectly suitable? No.

When a model has a relatively large number of parameters, it is likely to be influenced by the noise in the data, which varies across observations, as much as any underlying trend, which remains the same. Such a model is overfitted—it matches training data well but does not generalize to new observations.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Model Selection and Overfitting (read)

We discuss the use of training, validation and testing data sets and how they can be used, with methods such as cross-validation, to avoid overfitting.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Model Selection and Overfitting. Nature Methods 13:703-704.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

...more about the Points of Significance column

Classifier Evaluation

Tue 13-09-2016

It is important to understand both what a classification metric expresses and what it hides.

We examine various metrics use to assess the performance of a classifier. We show that a single metric is insufficient to capture performance—for any metric, a variety of scenarios yield the same value.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Classifier Evaluation (read)

We also discuss ROC and AUC curves and how their interpretation changes based on class balance.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

...more about the Points of Significance column


Happy 2016 `\pi` Approximation, roughly speaking

Sun 24-07-2016

Today is the day and it's hardly an approximation. In fact, `22/7` is 20% more accurate of a representation of `\pi` than `3.14`!

Time to celebrate, graphically. This year I do so with perfect packing of circles that embody the approximation.

By warping the circle by 8% along one axis, we can create a shape whose ratio of circumference to diameter, taken as twice the average radius, is 22/7.

If you prefer something more accurate, check out art from previous `\pi` days: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day, and 2016 `\pi` Day.

Logistic Regression

Tue 13-09-2016

Regression can be used on categorical responses to estimate probabilities and to classify.

The next column in our series on regression deals with how to classify categorical data.

We show how linear regression can be used for classification and demonstrate that it can be unreliable in the presence of outliers. Using a logistic regression, which fits a linear model to the log odds ratio, improves robustness.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Logistic regression? (read)

Logistic regression is solved numerically and in most cases, the maximum-likelihood estimates are unique and optimal. However, when the classes are perfectly separable, the numerical approach fails because there is an infinite number of solutions.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

Background reading

Altman, N. & Krzywinski, M. (2016) Points of Significance: Regression diagnostics? Nature Methods 13:385-386.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Multiple Linear Regression Nature Methods 12:1103-1104.

Altman, N. & Krzywinski, M. (2015) Points of significance: Simple Linear Regression Nature Methods 12:999-1000.

...more about the Points of Significance column

Visualizing Clonal Evolution in Cancer

Thu 02-06-2016

Genomic instability is one of the defining characteristics of cancer and within a tumor, which is an ever-evolving population of cells, there are many genomes. Mutations accumulate and propagate to create subpopulations and these groups of cells, called clones, may respond differently to treatment.

It is now possible to sequence individual cells within a tumor to create a profile of genomes. This profile changes with time, both in the kinds of mutation that are found and in their proportion in the overall population.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Ways to present temporal and phylogenetic evolution of clones in cancer. M Krzywinski (2016) Molecular Cell 62:652-656. (read)

Clone evolution diagrams visualize these data. These diagrams can be qualitative, showing only trends, or quantitative, showing temporal and population changes to scale. In this Molecular Cell forum article I provide guidelines for drawing these diagrams, focusing with how to use color and navigational elements, such as grids, to clarify the relationships between clones.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
How to draw clone evolution diagrams better. M Krzywinski (2016) Molecular Cell xxx:xxx-xxx. (read)

I'd like to thank Maia Smith and Cydney Nielsen for assistance in preparing some of the figures in the paper.

Krzywinski, M. (2016) Visualizing Clonal Evolution in Cancer. Mol Cell 62:652-656.


Binning High-Resolution Data

Wed 01-06-2016

Limitations in print resolution and visual acuity impose limits on data density and detail.

Your printer can print at 1,200 or 2,400 dots per inch. At reading distance, your reader can resolve about 200–300 lines per inch. This large gap—how finely we can print and how well we can see—can create problems when we don't take visual acuity into account.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Binning high-resolution data. (read)

The column provides some guidelines—particularly relevant when showing whole-genome data, where the scale of elements of interest such as genes is below the visual acuity limit—for binning data so that they are represented by elements that can be comfortably discerned.

Krzywinski, M. (2016) Points of view: Binning high-resolution data. Nature Methods 13:463.

...more about the Points of View column

Regression diagnostics

Wed 11-05-2016

Residual plots can be used to validate assumptions about the regression model.

Continuing with our series on regression, we look at how you can identify issues in your regression model.

The difference between the observed value and the model's predicted value is the residual, `r = y_i - \hat{y}_i`, a very useful quantity to identify the effects of outliers and trends in the data that might suggest your model is inadequate.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Regression diagnostics? (read)

We also discuss normal probability plots (or Q-Q plots) and show how these can be used to check that the residuals are normally distributed, which is one of the assumptions of regression (constant variance being another).

Background reading

Altman, N. & Krzywinski, M. (2016) Points of Significance: Analyzing outliers: Influential or nuisance? Nature Methods 13:281-282.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Multiple Linear Regression Nature Methods 12:1103-1104.

Altman, N. & Krzywinski, M. (2015) Points of significance: Simple Linear Regression Nature Methods 12:999-1000.

...more about the Points of Significance column

Analyzing Outliers: Influential or Nuisance?

Fri 08-04-2016

Some outliers influence the regression fit more than others.

This month our column addresses the effect that outliers have on linear regression.

You may be surprised, but not all outliers have the same influence on the fit (e.g. regression slope) or inference (e.g. confidence or prediction intervals). Outliers with large leverage—points that are far from the sample average—can have a very large effect. On the other hand, if the outlier is close to the sample average, it may not influence the regression slope at all.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Analyzing Outliers: Influential or Nuisance? (read)

Quantities such as Cook's distance and the so-called hat matrix, which defines leverage, are useful in assessing the effect of outliers.

Background reading

Altman, N. & Krzywinski, M. (2015) Points of Significance: Multiple Linear Regression Nature Methods 12:1103-1104.

Altman, N. & Krzywinski, M. (2015) Points of significance: Simple Linear Regression Nature Methods 12:999-1000.

...more about the Points of Significance column


Typographical posters of bird songs

Mon 28-03-2016

Chirp, chirp, chirp but much better looking.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The song of the Northern Flicker, Black-capped Chickadee, Olive-sided Flycatcher and Red-eyed Vireo. Sweet to the eye and ear. (details)

If you like these, check out my other typographical art posters.

Happy 2016 Pi Day—gravity of `\pi`

Mon 14-03-2016

Celebrate `\\pi` Day (March 14th) with colliding digits in space. This year, I celebrate the detection of gravitational waves at the LIGO lab and simulate the effect of gravity on masses created from the digits of `\\pi`.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
1,000 digits of `\pi` under the influence of gravity. (details)

Some strange things can happen.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
44 digits of `\pi` under the influence of gravity. (details)

The art is featured in the Gravity of Pi article on the Scientific American SA Visual blog.

Check out art from previous years: 2013 `\\pi` Day and 2014 `\\pi` Day and 2015 `\\pi` Day.

Neural Circuit Diagrams

Sun 13-03-2016

Use alignment and consistency to untangle complex circuit diagrams.

This month we apply the ideas presented in our column about drawing pathways to neural circuit diagrams. Neural circuits are networks of cells or regions, typically with a large number of variables, such as cell and neurotransmitter type.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Neural circuit diagrams. (read)

We discuss how to effectively route arrows, how to avoid pitfalls of redundant encoding and suggest ways to encorporate emphasis in the layout.

Hunnicutt, B.J. & Krzywinski, M. (2016) Points of View: Neural circuit diagrams. Nature Methods 13:189.

background reading

Hunnicutt, B.J. & Krzywinski, M. (2016) Points of Viev: Pathways. Nature Methods 13:5.

Wong, B. (2010) Points of Viev: Gestalt principles (part 1). Nature Methods 7:863.

Wong, B. (2010) Points of Viev: Gestalt principles (part 2). Nature Methods 7:941.

...more about the Points of View column


Pathways

Mon 04-01-2016

Apply visual grouping principles to add clarity to information flow in pathway diagrams.

We draw on the Gestalt principles of connection, grouping and enclosure to construct practical guidelines for drawing pathways with a clear layout that maintains hierarchy.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Pathways. (read)

We include tips about how to use negative space and align nodes to emphasizxe groups and how to effectively draw curved arrows to clearly show paths.

Hunnicutt, B.J. & Krzywinski, M. (2016) Points of Viev: Pathways. Nature Methods 13:5.

background reading

Wong, B. (2010) Points of Viev: Gestalt principles (part 1). Nature Methods 7:863.

Wong, B. (2010) Points of Viev: Gestalt principles (part 2). Nature Methods 7:941.

...more about the Points of View column

Multiple Linear Regression

Mon 04-01-2016

When multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple.

This month we continue with the topic of regression and expand the discussion of simple linear regression to include more than one variable. As it turns out, although the analysis and presentation of results builds naturally on the case with a single variable, the interpretation of the results is confounded by the presence of correlation between the variables.

By extending the example of the relationship of weight and height—we now include jump height as a second variable that influences weight—we show that the regression coefficient estimates can be very inaccurate and even have the wrong sign when the predictors are correlated and only one is considered in the model.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Multiple Linear Regression. (read)

Care must be taken! Accurate prediction of the response is not an indication that regression slopes reflect the true relationship between the predictors and the response.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Multiple Linear Regression Nature Methods 12:1103-1104.

Background reading

Altman, N. & Krzywinski, M. (2015) Points of significance: Simple Linear Regression Nature Methods 12:999-1000.

...more about the Points of Significance column

Circos and Hive Workshop Workshop—Poznan, Poland

Sun 13-12-2015

Taught how Circos and hive plots can be used to show sequence relationships at Biotalent Functional Annotation of Genome Sequences Workshop at the Institute for Plant Genetics in Poznan, Poland.

Students generated images published in Fast Diploidization in Close Mesopolyploid Relatives of Arabidopsis.

Workshop materials: slides, handout, Circos and hive plot files.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Drawing synteny between modern and ancient genomes with Circos.

Students also learned how to use hive plots to show synteny.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Hive plots are great at showing 3-way sequence comparisons. Here three modern species of Australian Brassicaceae (S. nutans, S. lineare, B. antipoda) are compared based on their common relationships to the ancestral karotype.

Mandakova, T. et al. Fast Diploidization in Close Mesopolyploid Relatives of Arabidopsis The Plant Cell, Vol. 22: 2277-2290, July 2010


Play the Bacteria Game

Mon 14-12-2015

Choose your own dust adventure!

Nobody likes dusting but everyone should find dust interesting.

Working with Jeannie Hunnicutt and with Jen Christiansen's art direction, I created this month's Scientific American Graphic Science visualization based on a recent paper The Ecology of microscopic life in household dust.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
An analysis of dust reveals how the presence of men, women, dogs and cats affects the variety of bacteria in a household. Appears on Graphic Science page in December 2015 issue of Scientific American.

We have also written about the making of the graphic, for those interested in how these things come together.

This was my third information graphic for the Graphic Science page. Unlike the previous ones, it's visually simple and ... interactive. Or, at least, as interactive as a printed page can be.

More of my American Scientific Graphic Science designs

Barberan A et al. (2015) The ecology of microscopic life in household dust. Proc. R. Soc. B 282: 20151139.

Names for 5,092 colors

Tue 03-11-2015

A very large list of named colors generated from combining some of the many lists that already exist (X11, Crayola, Raveling, Resene, wikipedia, xkcd, etc).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Confused? So am I. That's why I made a list.

For each color, coordinates in RGB, HSV, XYZ, Lab and LCH space are given along with the 5 nearest, as measured with ΔE, named neighbours.

I also provide a web service. Simply call this URL with an RGB string.

Simple Linear Regression

Sat 07-11-2015

It is possible to predict the values of unsampled data by using linear regression on correlated sample data.

This month, we begin our column with a quote, shown here in its full context from Box's paper Science and Statistics.

In applying mathematics to subjects such as physics or statistics we make tentative assumptions about the real world which we know are false but which we believe may be useful nonetheless. The physicist knows that particles have mass and yet certain results, approximating what really happens, may be derived from the assumption that they do not. Equally, the statistician knows, for example, that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.
Box, G. J. Am. Stat. Assoc. 71, 791–799 (1976).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Simple Linear Regression. (read)

This column is our first in the series about regression. We show that regression and correlation are related concepts—they both quantify trends—and that the calculations for simple linear regression are essentially the same as for one-way ANOVA.

While correlation provides a measure of a specific kind of association between variables, regression allows us to fit correlated sample data to a model, which can be used to predict the values of unsampled data.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Simple Linear Regression Nature Methods 12:999-1000.

Background reading

Altman, N. & Krzywinski, M. (2015) Points of significance: Association, correlation and causation Nature Methods 12:899-900.

Krzywinski, M. & Altman, N. (2014) Points of significance: Analysis of variance (ANOVA) and blocking. Nature Methods 11:699-700.

...more about the Points of Significance column


Association, correlation and causation

Sat 07-11-2015

Correlation implies association, but not causation. Conversely, causation implies association, but not correlation.

This month, we distinguish between association, correlation and causation.

Association, also called dependence, is a very general relationship: one variable provides information about the other. Correlation, on the other hand, is a specific kind of association: an increasing or decreasing trend. Not all associations are correlations. Moreover, causality can be connected only to association.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Association, correlation and causation. (read)

We discuss how correlation can be quantified using correlation coefficients (Pearson, Spearman) and show how spurious corrlations can arise in random data as well as very large independent data sets. For example, per capita cheese consumption is correlated with the number of people who died by becoming tangled in bedsheets.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Association, correlation and causation Nature Methods 12:899-900.

...more about the Points of Significance column

Bayesian networks

Thu 01-10-2015

For making probabilistic inferences, a graph is worth a thousand words.

This month we continue with the theme of Bayesian statistics and look at Bayesian networks, which combine network analysis with Bayesian statistics.

In a Bayesian network, nodes represent entities, such as genes, and the influence that one gene has over another is represented by a edge and probability table (or function). Bayes' Theorem is used to calculate the probability of a state for any entity.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Bayesian networks. (read)

In our previous columns about Bayesian statistics, we saw how new information (likelihood) can be incorporated into the probability model (prior) to update our belief of the state of the system (posterior). In the context of a Bayesian network, relationships called conditional dependencies can arise between nodes when information is added to the network. Using a small gene regulation network we show how these dependencies may connect nodes along different paths.

Background reading

Puga, J.L, Krzywinski, M. & Altman, N. (2015) Points of Significance: Bayesian Statistics Nature Methods 12:277-278.

Puga, J.L, Krzywinski, M. & Altman, N. (2015) Points of Significance: Bayes' Theorem Nature Methods 12:277-278.

...more about the Points of Significance column

Unentangling complex plots

Fri 10-07-2015

The Points of Significance column is on vacation this month.

Meanwhile, we're showing you how to manage small multiple plots in the Points of View column Unentangling Complex Plots.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Unentangling complex plots. (download, more about Points of View)

Data in small multiples can vary in range, noise level and trend. Gregor McInerny and myself show you how you can deal with this by cropped and scaling the multiples to a different range to emphasize relative changes while preserving the context of the full data range to show absolute changes.

McInerny, G. & Krzywinski, M. (2015) Points of View: Unentangling complex plots. Nature Methods 12:591.

...more about the Points of View column


Fixing Jurassic World science visualizations

Fri 10-07-2015

The Jurassic World Creation Lab webpage shows you how one might create a dinosaur from a sample of DNA. First extract, sequence, assemble and fill in the gaps in the DNA and then incubate in an egg and wait.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
We can't get dinosaur genomics right, but we can get it less wrong. (a) Corn genome used in Jurassic World Creation Lab website. Image is from the Science publication B73 Maize Genome: Complexity, Diversity, and Dynamics. Photo and composite by Universal Studios and Amblin Entertainment. (b) Random data on 8 chromosomes from chicken genome resized to triceratops genome size (3.2 Gb). Image by Martin Krzywinski. (c) Actual genome data for lizard genome, UCSC anoCar2.0, May 2010. Image by Martin Krzywinski. Triceratops outline in (b,c) from wikipedia.

With enough time, you'll grow your own brand new dinosaur. Or a stalk of corn ... with more teeth.

What went wrong? Let me explain.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Corn World: Teeth on the Cob.

Printing Genomes

Tue 07-07-2015

You've seen bound volumes of printouts of the human reference genome. But what if at the Genome Sciences Center we wanted to print everything we sequence today?

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Curiously, printing is 44 times as expensive as sequencing. (details)

Gene Volume Control

Thu 11-06-2015

I was commissioned by Scientific American to create an information graphic based on Figure 9 in the landmark Nature Integrative analysis of 111 reference human epigenomes paper.

The original figure details the relationships between more than 100 sequenced epigenomes and genetic traits, including disease like Crohn's and Alzheimer's. These relationships were shown as a heatmap in which the epigenome-trait cell depicted the P value associated with tissue-specific H3K4me1 epigenetic modification in regions of the genome associated with the trait.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Figure 9 from Integrative analysis of 111 reference human epigenomes (Nature (2015) 518 317–330). (details)

As much as I distrust network diagrams, in this case this was the right way to show the data. The network was meticulously laid out by hand to draw attention to the layered groups of diseases of traits.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Network diagram redesign of the heatmap for a select set of traits. Only relationships with –log P > 3.9 are displayed. Appears on Graphic Science page in June 2015 issue of Scientific American. (details)

This was my second information graphic for the Graphic Science page. Last year, I illustrated the extent of differences in the gene sequence of humans, Denisovans, chimps and gorillas.


Sampling distributions and the bootstrap

Thu 11-06-2015

The bootstrap is a computational method that simulates new sample from observed data. These simulated samples can be used to determine how estimates from replicate experiments might be distributed and answer questions about precision and bias.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Sampling distributions and the bootstrap. (read)

We discuss both parametric and non-parametric bootstrap. In the former, observed data are fit to a model and then new samples are drawn using the model. In the latter, no model assumption is made and simulated samples are drawn with replacement from the observed data.

Kulesa, A., Krzywinski, M., Blainey, P. & Altman, N (2015) Points of Significance: Sampling distributions and the bootstrap Nature Methods 12:477-478.

Background reading

Krzywinski, M. & Altman, N. (2013) Points of Significance: Importance of being uncertain. Nature Methods 10:809-810.

...more about the Points of Significance column

Bayesian statistics

Sat 07-11-2015

Building on last month's column about Bayes' Theorem, we introduce Bayesian inference and contrast it to frequentist inference.

Given a hypothesis and a model, the frequentist calculates the probability of different data generated by the model, P(data|model). When this probability to obtain the observed data from the model is small (e.g. `alpha` = 0.05), the frequentist rejects the hypothesis.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Bayesian Statistics. (read)

In contrast, the Bayesian makes direct probability statements about the model by calculating P(model|data). In other words, given the observed data, the probability that the model is correct. With this approach it is possible to relate the probability of different models to identify one that is most compatible with the data.

The Bayesian approach is actually more intuitive. From the frequentist point of view, the probability used to assess the veracity of a hypothesis, P(data|model), commonly referred to as the P value, does not help us determine the probability that the model is correct. In fact, the P value is commonly misinterpreted as the probability that the hypothesis is right. This is the so-called "prosecutor's fallacy", which confuses the two conditional probabilities P(data|model) for P(model|data). It is the latter quantity that is more directly useful and calculated by the Bayesian.

Puga, J.L, Krzywinski, M. & Altman, N. (2015) Points of Significance: Bayes' Theorem Nature Methods 12:277-278.

Background reading

Puga, J.L, Krzywinski, M. & Altman, N. (2015) Points of Significance: Bayes' Theorem Nature Methods 12:277-278.

...more about the Points of Significance column

Bayes' Theorem

Wed 22-04-2015

In our first column on Bayesian statistics, we introduce conditional probabilities and Bayes' theorem

P(B|A) = P(A|B) × P(B) / P(A)

This relationship between conditional probabilities P(B|A) and P(A|B) is central in Bayesian statistics. We illustrate how Bayes' theorem can be used to quickly calculate useful probabilities that are more difficult to conceptualize within a frequentist framework.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Bayes' Theorem. (read)

Using Bayes' theorem, we can incorporate our beliefs and prior experience about a system and update it when data are collected.

Puga, J.L, Krzywinski, M. & Altman, N. (2015) Points of Significance: Bayes' Theorem Nature Methods 12:277-278.

Background reading

Oldford, R.W. & Cherry, W.H. Picturing probability: the poverty of Venn diagrams, the richness of eikosograms. (University of Waterloo, 2006)

...more about the Points of Significance column


Happy 2015 Pi Day—can you see `pi` through the treemap?

Sat 14-03-2015

Celebrate `pi` Day (March 14th) with splitting its digit endlessly. This year I use a treemap approach to encode the digits in the style of Piet Mondrian.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Digits of `pi`, `phi` and `e`. (details)

The art has been featured in Ana Swanson's Wonkblog article at the Washington Post—10 Stunning Images Show The Beauty Hidden in `pi`.

I also have art from 2013 `pi` Day and 2014 `pi` Day.

Split Plot Design

Tue 03-03-2015

The split plot design originated in agriculture, where applying some factors on a small scale is more difficult than others. For example, it's harder to cost-effectively irrigate a small piece of land than a large one. These differences are also present in biological experiments. For example, temperature and housing conditions are easier to vary for groups of animals than for individuals.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Split plot design. (read)

The split plot design is an expansion on the concept of blocking—all split plot designs include at least one randomized complete block design. The split plot design is also useful for cases where one wants to increase the sensitivity in one factor (sub-plot) more than another (whole plot).

Altman, N. & Krzywinski, M. (2015) Points of Significance: Split Plot Design Nature Methods 12:165-166.

Background reading

1. Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

2. Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of variance (ANOVA) and blocking Nature Methods 11:699-700.

3. Blainey, P., Krzywinski, M. & Altman, N. (2014) Points of Significance: Replication Nature Methods 11:879-880.

...more about the Points of Significance column

Color palettes for color blindness

Tue 03-03-2015

In an audience of 8 men and 8 women, chances are 50% that at least one has some degree of color blindness1. When encoding information or designing content, use colors that is color-blind safe.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A 12-color palette safe for color blindness

Points of Significance Column Now Open Access

Tue 10-02-2015

Nature Methods has announced the launch of a new statistics collection for biologists.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column is now open access. (column archive)

As part of that collection, announced that the entire Points of Significance collection is now open access.

This is great news for educators—the column can now be freely distributed in classrooms.

...more about the Points of Significance column

Before and After—Designing Tiny Figures for Nature Methods

Tue 13-01-2015

I've posted a writeup about the design and redesign process behind the figures in our Nature Methods Points of Significance column.

I have selected several figures from our past columns and show how they evolved from their draft to published versions.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Fig 2 from Points of Significance: Nested designs. (Krzywinski, M. & Altman, N. (2014) Nature Methods 11:977-978.) (...more)

Clarity, concision and space constraints—we have only 3.4" of horizontal space— all have to be balanced for a figure to be effective.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Fig 2c (excerpt) from Points of Significance: Designing comparative experiments. (Krzywinski, M. & Altman, N. (2014) Nature Methods 11:597-598.) (...more)

It's nearly impossible to find case studies of scientific articles (or figures) through the editing and review process. Nobody wants to show their drafts. With this writeup I hope to add to this space and encourage others to reveal their process. Students love this. See whether you agree with my decisions!

Sources of Variation

Thu 08-01-2015

Past columns have described experimental designs that mitigate the effect of variation: random assignment, blocking and replication.

The goal of these designs is to observe a reproducible effect that can be due only to the treatment, avoiding confounding and bias. Simultaneously, to sample enough variability to estimate how much we expect the effect to differ if the measurements are repeated with similar but not identical samples (replicates).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Sources of Variation. (read)

We need to distinguish between sources of variation that are nuisance factors in our goal to measure mean biological effects from those that are required to assess how much effects vary in the population.

Altman, N. & Krzywinski, M. (2014) Points of Significance: Two Factor Designs Nature Methods 11:5-6.

Background reading

1. Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

2. Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of variance (ANOVA) and blocking Nature Methods 11:699-700.

3. Blainey, P., Krzywinski, M. & Altman, N. (2014) Points of Significance: Replication Nature Methods 11:879-880.

...more about the Points of Significance column


Two Factor Designs

Tue 09-12-2014

We've previously written about how to analyze the impact of one variable in our ANOVA column. Complex biological systems are rarely so obliging—multiple experimental factors interact and producing effects.

ANOVA is a natural way to analyze multiple factors. It can incorporate the possibility that the factors interact—the effect of one factor depends on the level of another factor. For example, the potency of a drug may depend on the subject's diet.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Two Factor Designs. (read)

We can increase the power of the analysis by allowing for interaction, as well as by blocking.

Krzywinski, M., Altman, (2014) Points of Significance: Two Factor Designs Nature Methods 11:1187-1188.

Background reading

Blainey, P., Krzywinski, M. & Altman, N. (2014) Points of Significance: Replication Nature Methods 11:879-880.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of variance (ANOVA) and blocking Nature Methods 11:699-700.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

...more about the Points of Significance column

Nested Designs—Assessing Sources of Noise

Mon 29-09-2014

Sources of noise in experiments can be mitigated and assessed by nested designs. This kind of experimental design naturally models replication, which was the topic of last month's column.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Nested designs. (read)

Nested designs are appropriate when we want to use the data derived from experimental subjects to make general statements about populations. In this case, the subjects are random factors in the experiment, in contrast to fixed factors, such as we've seen previously.

In ANOVA analysis, random factors provide information about the amount of noise contributed by each factor. This is different from inferences made about fixed factors, which typically deal with a change in mean. Using the F-test, we can determine whether each layer of replication (e.g. animal, tissue, cell) contributes additional variation to the overall measurement.

Krzywinski, M., Altman, N. & Blainey, P. (2014) Points of Significance: Nested designs Nature Methods 11:977-978.

Background reading

Blainey, P., Krzywinski, M. & Altman, N. (2014) Points of Significance: Replication Nature Methods 11:879-880.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of variance (ANOVA) and blocking Nature Methods 11:699-700.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

...more about the Points of Significance column

Replication—Quality over Quantity

Tue 02-09-2014

It's fitting that the column published just before Labor day weekend is all about how to best allocate labor.

Replication is used to decrease the impact of variability from parts of the experiment that contribute noise. For example, we might measure data from more than one mouse to attempt to generalize over all mice.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Replication. (read)

It's important to distinguish technical replicates, which attempt to capture the noise in our measuring apparatus, from biological replicates, which capture biological variation. The former give us no information about biological variation and cannot be used to directly make biological inferences. To do so is to commit pseudoreplication. Technical replicates are useful to reduce the noise so that we have a better chance to detect a biologically meaningful signal.

Blainey, P., Krzywinski, M. & Altman, N. (2014) Points of Significance: Replication Nature Methods 11:879-880.

Background reading

Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of variance (ANOVA) and blocking Nature Methods 11:699-700.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

...more about the Points of Significance column


Monkeys on a Hilbert Curve—Scientific American Graphic

Tue 19-08-2014

I was commissioned by Scientific American to create an information graphic that showed how our genomes are more similar to those of the chimp and bonobo than to the gorilla.

I had about 5 x 5 inches of print space to work with. For 4 genomes? No problem. Bring out the Hilbert curve!

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Our genomes are much more similar to the chimp and bonobo than to the gorilla. And, we're practically still Denisovans. (details)

To accompany the piece, I will be posting to the Scientific American blog about the process of creating the figure. And to emphasize that the genome is not a blueprint!

As part of this project, I created some Hilbert curve art pieces. And while exploring, found thousands of Hilbertonians!

Happy Pi Approximation Day— π, roughly speaking 10,000 times

Wed 13-08-2014

Celebrate Pi Approximation Day (July 22nd) with the art of arm waving. This year I take the first 10,000 most accurate approximations (m/n, m=1..10,000) and look at their accuracy.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Accuracy of the first 10,000 m/n approximations of Pi. (details)

I turned to the spiral again after applying it to stack stacked ring plots of frequency distributions in Pi for the 2014 Pi Day.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Frequency distribution of digits of Pi in groups of 4 up to digit 4,988. (details)

Analysis of Variance (ANOVA) and Blocking—Accounting for Variability in Multi-factor Experiments

Mon 07-07-2014

Our 10th Points of Significance column! Continuing with our previous discussion about comparative experiments, we introduce ANOVA and blocking. Although this column appears to introduce two new concepts (ANOVA and blocking), you've seen both before, though under a different guise.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Analysis of variance (ANOVA) and blocking. (read)

If you know the t-test you've already applied analysis of variance (ANOVA), though you probably didn't realize it. In ANOVA we ask whether the variation within our samples is compatible with the variation between our samples (sample means). If the samples don't all have the same mean then we expect the latter to be larger. The ANOVA test statistic (F) assigns significance to the ratio of these two quantities. When we only have two-samples and apply the t-test, t2 = F.

ANOVA naturally incorporates and partitions sources of variation—the effects of variables on the system are determined based on the amount of variation they contribute to the total variation in the data. If this contribution is large, we say that the variation can be "explained" by the variable and infer an effect.

We discuss how data collection can be organized using a randomized complete block design to account for sources of uncertainty in the experiment. This process is called blocking because we are blocking the variation from a known source of uncertainty from interfering with our measurements. You've already seen blocking in the paired t-test example, in which the subject (or experimental unit) was the block.

We've worked hard to bring you 20 pages of statistics primers (though it feels more like 200!). The column is taking a month off in August, as we shrink our error bars.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of Variance (ANOVA) and Blocking Nature Methods 11:699-700.

Background reading

Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Comparing Samples — Part I — t-tests Nature Methods 11:215-216.

Krzywinski, M. & Altman, N. (2013) Points of Significance: Significance, P values and t-tests Nature Methods 10:1041-1042.

...more about the Points of Significance column


Designing Experiments—Coping with Biological and Experimental Variation

Thu 29-05-2014

This month, Points of Significance begins a series of articles about experimental design. We start by returning to the two-sample and paired t-tests for a discussion of biological and experimental variability.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Designing Comparative Experiments. (read)

We introduce the concept of blocking using the paired t-test as an example and show how biological and experimental variability can be related using the correlation coefficient, ρ, and how its value imapacts the relative performance of the paired and two-sample t-tests.

We also emphasize that when reporting data analyzed with the paired t-test, differences in sample means (and their associated 95% CI error bars) should be shown—not the original samples—because the correlation in the samples (and its benefits) cannot be gleaned directly from the sample data.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

Background reading

Krzywinski, M. & Altman, N. (2014) Points of Significance: Comparing Samples — Part I — t-tests Nature Methods 11:215-216.

Krzywinski, M. & Altman, N. (2013) Points of Significance: Significance, P values and t-tests Nature Methods 10:1041-1042.

Have skew, will test

Wed 28-05-2014

Our May Points of Significance Nature Methods column jumps straight into dealing with skewed data with Non Parametric Tests.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Non Parametric Testing. (read)

We introduce non-parametric tests and simulate data scenarios to compare their performance to the t-test. You might be surprised—the t-test is extraordinarily robust to distribution shape, as we've discussed before. When data is highly skewed, non-parametric tests perform better and with higher power. However, if sample sizes are small they are limited to a small number of possible P values, of which none may be less than 0.05!

Krzywinski, M. & Altman, N. (2014) Points of Significance: Non Parametric Testing Nature Methods 11:467-468.

Background reading

Krzywinski, M. & Altman, N. (2014) Points of Significance: Comparing Samples — Part I — t-tests Nature Methods 11:215-216.

Krzywinski, M. & Altman, N. (2013) Points of Significance: Significance, P values and t-tests Nature Methods 10:1041-1042.

Mind your p's and q's

Sat 29-03-2014

In the April Points of Significance Nature Methods column, we continue our and consider what happens when we run a large number of tests.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Comparing Samples — Part II — Multiple Testing. (read)

Observing statistically rare test outcomes is expected if we run enough tests. These are statistically, not biologically, significant. For example, if we run N tests, the smallest P value that we have a 50% chance of observing is 1–exp(–ln2/N). For N = 10k this P value is Pk=10kln2 (e.g. for 104=10,000 tests, P4=6.9×10–5).

We discuss common correction schemes such as Bonferroni, Holm, Benjamini & Hochberg and Storey's q and show how they impact the false positive rate (FPR), false discovery rate (FDR) and power of a batch of tests.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Comparing Samples — Part II — Multiple Testing Nature Methods 11:215-216.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Comparing Samples — Part I — t-tests Nature Methods 11:215-216.

Krzywinski, M. & Altman, N. (2013) Points of Significance: Significance, P values and t-tests Nature Methods 10:1041-1042.


Happy Pi Day— go to planet π

Fri 21-03-2014

Celebrate Pi Day (March 14th) with the art of folding numbers. This year I take the number up to the Feynman Point and apply a protein folding algorithm to render it as a path.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Digits of Pi form landmass and shoreline. (details)

For those of you who liked the minimalist and colorful digit grid, I've expanded on the concept to show stacked ring plots of frequency distributions.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Frequency distribution of digits of Pi in groups of 6 up to the Feynman Point. (details)

And if spirals are your thing...

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Frequency distribution of digits of Pi in groups of 4 up to digit 4,988. (details)

Have data, will compare

Fri 07-03-2014

In the March Points of Significance Nature Methods column, we continue our discussion of t-tests from November (Significance, P values and t-tests).

We look at what happens how uncertainty of two variables combines and how this impacts the increased uncertainty when two samples are compared and highlight the differences between the two-sample and paired t-tests.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Comparing Samples — Part I. (read)

When performing any statistical test, it's important to understand and satisfy its requirements. The t-test is very robust with respect to some of its assumptions, but not others. We explore which.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Comparing Samples — Part I Nature Methods 11:215-216.

Krzywinski, M. & Altman, N. (2013) Points of Significance: Significance, P values and t-tests Nature Methods 10:1041-1042.

Circos at British Library Beautiful Science Exhibit

Thu 06-03-2014

Beautiful Science explores how our understanding of ourselves and our planet has evolved alongside our ability to represent, graph and map the mass data of the time. The exhibit runs 20 February — 26 May 2014 and is free to the public. There is a good Nature blog writeup about it, a piece in The Guardian, and a great video that explains the the exhibit narrated by Johanna Kieniewicz, the curator.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Circos at the British Library Beautiful Science exhibit. (about exhibit)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Mailed invitation to the exhibit features my science art. (zoom)

I am privileged to contribute an information graphic to the exhibit in the Tree of Life section. The piece shows how sequence similarity varies across species as a function of evolutionary distance. The installation is a set of 6 30x30 cm backlit panels. They look terrific.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Circos Circles of Life installation at Beautiful Science exhibit at the British Library. (zoom)

Think outside the bar—box plots

Fri 31-01-2014

Quick, name three chart types. Line, bar and scatter come to mind. Perhaps you said pie too—tsk tsk. Nobody ever thinks of the box plot.

Box plots reveal details about data without overloading a figure with a full frequency distribution histogram. They're easy to compare and now easy to make with BoxPlotR (try it). In our fifth Points of Significance column, we take a break from the theory to explain this plot type and—I hope— convince you that they're worth thinking about.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Visualizing samples with box plots. (read)

The February issue of Nature Methods kicks the bar chart two more times: Dan Evanko's Kick the Bar Chart Habit editorial and a Points of View: Bar charts and box plots column by Mark Streit and Nils Gehlenborg.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Visualizing samples with box plots Nature Methods 11:119-120.

Wired Data|Life 2013 talk

Thu 05-12-2013

I recently presented at the Wired Data|Life 2013 conference, sharing my thoughts on The Art and Science of Data Visualization.

For specialists, visualizations should expose detail to allow for exploration and inspiration. For enthusiasts, they should provide context, integrate facts and inform. For the layperson, they should capture the essence of the topic, narrate a story and deligt.

Wired's Brandon Keim wrote up a short article about me and some of my work—Circle of Life: The Beautiful New Way to Visualize Biological Data.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The Art and Science of Data Visualization (PDF)

Power and Sample Size

Fri 31-01-2014

Experimental designs that lack power cannot reliably detect real effects. Power of statistical tests is largely unappreciated and many underpowered studies continue to be published.

This month, Naomi and I explain what power is, how it relates to Type I and Type II errors and sample size. By understanding the relationship between these quantities you can design a study that has both low false positive rate and high power.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Power and Sample Size. (read)

Krzywinski, M. & Altman, N. (2013) Points of Significance: Power and Sample Size Nature Methods 10:1139-1140.


20 imperatives of science—limits of evidence

Fri 22-11-2013

20 Tips for Interpreting Scientific Claims is a wonderful comment in Nature warning us about the limits of evidence.

I've made a poster (download hires PDF, PNG) of this list, grouping them into categories that are my own. Thrust this into everyone's hands, including your own.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
20 tips for interpreting scientific claims. From Sutherland et al, Nature 2013. (PDF, PNG, read article)

Sutherland WJ, Spiegelhalter D & Burgman M (2013) Policy: Twenty tips for interpreting scientific claims. Nature 503:335–337.

Significance, P values and t-tests

Fri 31-01-2014

Have you wondered how statistical tests work? Why does everyone want such a small P value?

This month, Naomi and I explain how significance is measured in statistics and remind you that it does not imply biological significance. You'll also learn why the t-distribution is so important and why its shape is similar to that of a normal distribution, but not quite.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Significance, P values and t-tests. (read)

Krzywinski, M. & Altman, N. (2013) Points of Significance: Significance, P values and t-tests Nature Methods 10:1041-1042.

Drinks & Science Workshop: Effective Presentations and Slides

Thu 10-10-2013

Your slides are not your presentation. They are a representation of your presentation.

Effective presentations require that you have a clear narrative—control detail and emphasis to deliver your message. Engage the audience early. Don't dump on them.

Effective slides are visual cues. Show only what you can't easily say. Text should acts as emphasis. Don't read.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Drinks & Science Workshop: Effective Presentations and Slides. Science Online Vancouver. (workshop slides)

A workshop I gave on Oct 8th at Science Online Vancouver at Science World.


Error Bars

Mon 30-09-2013

Error bar overlap does not imply significance. Error bar gap does not imply lack of significance. Chances are you find these statements surprising.

You've seen and used error bars. But do you understand how to interpret them in the context of statistical signifiance? This month we address the most common (and commonly misunderstood) method of visualizing uncertainty.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Error Bars. (read)

We discuss error bars based on standard deviation, standard error of the mean and confidence intervals. It turns out that none of these behave as our intuition would wish.

Krzywinski, M. & Altman, N. (2013) Points of Significance: Error Bars Nature Methods 10:921-922.

Launch of Nature Methods Statistics Column

Mon 30-09-2013

This month, Nature Method is launching Points of Significance a new column to educate, enlighten and, if possible, entertaining bench scientists about statistics.

I will be working closely with with Naomi Altman from The Pennsylvania State University and Dan Evanko, the Chief Editor at Nature Methods, to make the column engaging and useful.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Importance of Being Uncertain. (read)

Our first publication — The Importance of Being Uncertain — acknowledges not only the imperative of being right about how we're wrong, but also our appreciation for Oscar Wilde.

Krzywinski, M. & Altman, N. (2013) Points of Significance: Importance of Being Uncertain Nature Methods 10:809-810.

Points of View — The Collection

Tue 30-07-2013

Interested in data visualization? The Points of View columns are an excellent way to learn practical tips and design principles that help you communicate clearly. All the columns are now available as a collection, and open access during August 2013.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The full collection of Nature Methods Points of View columns is now available for free for the month of August. (collection, more about Points of View)

The columns were written by Bang Wong, Martin Krzywinski, Nils Gehlenborg, Cydney Nielsen, Noam Shoresh, Rikke Schmidt Kjærgaard, Erica Savig and Alberto Cairo.


Storytelling with Graphics

Tue 30-07-2013

This month, Alberto Cairo and I examine the importance of storytelling in presenting data. A strong narrative captures the reader's attention, informs and inspires.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Storytelling. (download, more about Points of View)

Instead of "explain, not merely show," seek to "narrate, not merely explain."

Analyze as a specialist, present as a communicator

Thu 25-07-2013

The distinction between the specialist and the communicator was made by Albert Cairo at 2013 Bloomberg Design Conference. I have used this principle to structure my talk to the UBC Tableau Users Group.

Design is algorithmics for the page. Use its principles to inform how to choose from among the options offered by your software. Recognize the limitations of your tool, as well as those features that are ineffective.

Don't practise visual intuitics—use shapes whose size and proportion can be well judged.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
What we see isn't always what it is. The luminance effect powerfully affects our interpretation of tone and color. (download talk)

Real Human Genome Art

Tue 16-07-2013

A collaboration of science and art with Joanna Rudnick and Aaron De La Cruz.

The science of cancer genomics will be interpreted by individuals whose lives are affected by genomic mutations using the art style of Aaron De La Cruz.

Beautiful, meaningful and personal.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A day of collaboration between science, art and people affected with cancer-causing mutations. (...more)

Multidimensional data

Thu 27-06-2013

This month, Erica Savig and I look at the design process for a figure from her paper Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators. The underlying data set has 1.2 billion individual observations, categorized by drug, cell line, protein and stimulation condition.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Multidimensional data. (download, more about Points of View)

Bodenmiller B, Zunder ER, Finck R et al. 2012 Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators Nature Biotechnology 30:858-867.

Although spatial encoding is the most perceptually accurate, in this case it's not the best channel to display quantitative information. Instead, the x/y position on the page is used to organize small multiples of the network of affected proteins.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Data meets pointilism. The full data set was used to create the cover of the September 2012 issue of Nature Biotechnology. (about the cover)

Choosing Plotting Symbols

Thu 30-05-2013

In this months column, Bang and I consider how to choose effective plotting symbols in the Points of View column Plotting Symbols.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Plotting Symbols. (download, more about Points of View)

Choose symbols that overlap without ambiguity and communicate relationships in data.

Figure Design and Writing — Two Goals, One Process

Mon 29-04-2013

This month I look at how creating effective figures is similar to the process of writing well in the Points of View column Elements of Visual Style.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column: Elements of Visual Style. (download, more about Points of View)

Using Strunk's Elements of Style as an example of writing guidelines, I look how these can be translated to creating figures.


VIZBI 2013 Keynote—Visual Design Principles

Wed 27-03-2013

When we create figures, we must communicate and design. In my talk I discuss some of the rules that turn graphical improvisation into a structured and reproducible process.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Try to focus on a spot in these posters that celebrate Pi day. (download talk)

The fractal tree was created with OneZoom, which received the best poster award at the conference.

Happy Pi Day— 3.14

Thu 14-03-2013

Celebrate Pi Day (March 14th) with a funky modern posters. Transcend, don't repeat, yourself and watch the dots shimmer.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Try to focus on a spot in these posters that celebrate Pi day. (download posters)

The posters were inspired by the beautiful AIDS posters by Elena Miska.

For the Love of Type

Thu 07-03-2013

I am always drawn to type and periodically I must do something about it.

If you were a type, what type would you be? Me, Gill Sans on weekdays and Perpetua on the weekend.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Finding intrigue and consensus within and among letters. (typography posters)

Return of Nature Methods Points of View

Tue 26-02-2013

I take over from Bang Wong as primary contributor to the Points of View column, a monthly advice and opinion piece about data visualization and information and figure design in molecular biology.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of View column returns. (read more, Nature Methods blog)

Nature Encode Explorer

Tue 26-02-2013
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature uses Circos motif on cover and interactive ENCODE data explorer. (read more)

Nature's special issue dedicated to the Encode Project uses the Circos motif on its cover as well as the interactive Encode Explorer, which is available as an app at iTunes.

Bloomberg Businessweek Design Conference

Wed 23-01-2013

Together with Alberto Cairo, and then in conversation with Sam Grobart, I presented about science and design at Bloomberg's Businessweek Design Conference in San Francisco.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Science loves design, but doesn't always know it. (download talk)

ICDM2012 Keynote — Needles in Stacks of Needles

Thu 13-12-2012

My ICDM2012 keynote on genomics and data mining: Needles in Stacks of Needles.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Computers compute but humans are ultimately responsible for identifying what is relevant and useful. (abstract, download talk, ICDM2012)

Genome Research cover

Wed 14-11-2012

Creating strings of genome jewelery. Read about how it was done.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Cover image accompanying Spark: A navigational paradigm for genomic data exploration. Genome Research 22 (11). (details, Genome Research)

The design accompanies Cydney Nielsen's Spark manuscript, which appeared in Genome Research.

Biovis 2012 — Getting into Visualization of Large Biological Data Sets

Tue 16-10-2012

Guidelines for data encoding and visualization in biology, presented presented at Biovis 2012 (Visweek 2012).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
20 imperatives of information design. (Krzywinski et al biovis2012)

2012 Presidential Debates — a Lexical Analysis

Thu 04-10-2012

Building on the method I used to analyze the 2008 debates, I look at the 2012 Debates between Obama and Romney, lexically speaking. Obama speaks to "folks", while Romney fearmongers with "kill" and "hurt".

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Analysis of word usage by parts of speech for Obama and Romney reveals insight into each candidate.

Trends in Genetics cover

Fri 28-09-2012

Making things round, not square. Read about how it was done.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature and human life are as various as our several constitutions. Who shall say what prospect life offers to another? —Henry David Thoreau

A Circos-based design for the cover of the human genetics special issue of Trends in Genetics (Trends in Genetics October 2012, 28 (10)).


Science needs words

Thu 03-05-2012

And usually, really long and funny ones.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Scientists love new words, when the old ones aren't long enough.

My neologisms were picked up by James Gorman of the New York Times in an article Ome, the sound of the scientific universe expanding.

PNAS cover

Tue 01-05-2012

Biology or astrophysics? Read about how it was done.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Hint: biology.

The image was published on the cover of PNAS (PNAS 1 May 2012; 109 (18))

the art of numbers

Sat 14-04-2012

Numerology is bogus but art based on numbers has a beautiful random quality. Oh, and none of the metaphysical baggage.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Distribution of the first 3,422, 13,689 and 123,201 digits of π.
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Progression and transition probabilities of digits in e, φ and π.

accidental similarity number

Tue 20-03-2012

The quantity formed by the overlap of two or more numbers.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The accidental similarity number of π, φ and e.

the 4ness of pi

Fri 13-04-2012

How much 4ness does π have?

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The iness of each digit of π generalizes 4ness. It measures the similarity of the digit to its neighbours.

Compare the iness of π to that of the other famous transcendental number, e, and the mysterious but attractive Golden Ratio, φ.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The iness of e and φ.

ASCII Illustration—Outer Space, Sequence and Typography

Mon 23-01-2012

I have found a way to combine my curiosity about space, fear of large sequence assemblies and love of typography in a single illustration. Inspired by typographical portraits, I wanted to automate representing an image with multiple font weights, while sampling characters from a quote or debate transcripts.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Part of the Pioneer plaque rendered with the sequence of human chromosome 1, using 4 layers of sizes (17pt, 33pt, 59pt and 93pt) and 8 weights of Gotham.

Tangering Tango—Color of 2012

Tue 17-01-2012

If you made widgets, you could be justified in campaigning a widget of the year. Business acumen suggests it should be one of your widgets. Pantone has done exactly that, naming their 17-1463 color (tangerine tango), as color of the year 2012.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Tangerine Tango - Pantone's color of the year.

I prefer green—green jive.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Green Jive - My own color of the year.

World's Most Expensive Photograph

Thu 10-11-2011

I really like the world's most expensive photograph, Rhein II by Andreas Gursky. Cautious use of the word "expensive" should be practised — in this case, merely meaning that only one person saw the $4.3 million price tag. Others saw lower prices, or no price tag at all.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Rhein II by Andreas Gursky. $4.3 million.

Here's my own attempt at such compositions.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Near Jokulsarlon on the way to Hofn, Iceland.

Adobe Swatches for Brewer Palettes

Fri 28-10-2011

I could not find Illustrator swatch files for this awesome color resource, so I created them myself.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Brewer palettes are ideal for information design. Download Illustrator swatch files (.ase .ai).

If you're interested in color and design and don't know about Brewer palettes, see my presentation.


Global Visualization of Google Searches by Language

Fri 28-10-2011

World-wide Google searches, categorized by one of 21 languages, are visualized with WebGL, available from Chrome Experiments. The data offers some fascinating insights such as (a) in what two places in the US are Google searches in Chinese are performed? (b) what are the most remote locations are from which Google searches were detected? (c) Why is Istanbul the 3rd top location for searches? Why is Miami in the top 10?

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Global visualization of google searches by language reveals English dominates (42% searches) with Spanish a distant second (14%) and German and French third (7% each).

Download geotagged data.

PSA Genomics Workshop Slides

Fri 28-10-2011

Designing effective visualizations in the biological sciences.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Neither communication nor design are purely subjective.

Circos and Hive Plots: Challenging visualization paradigms in genomics and network analysis.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Neither communication nor design are purely subjective.

Tor tor & Loa Loa — 546 Organisms with Same Genus and Species

Sat 09-07-2011

In a recent conversation, I was challenged to name as many organisms with the same genus and species as I could. Neither a biologist, and especially not a taxonomist, my responses were limited to organisms with sequenced genomes I had come across in the literature. Immediately to mind sprung Gallus gallus (chicken) and ... nothing else. Well, that was embarrassing.

I was suddently taken up by the urge to find all instances of this occurrence. Using resources at the NCBI Taxonomy Browser I downloaded the NCBI taxonomy table which contains 1,097,405 entries in the names.dmp file (not all of these are unique genus/species combinations).

To my suprise I discovered that my performance in this challenge was beyond dysmal. In fact, there are 380 genuses which contain organisms that have the same genus and species name. Most of them (317) include a single organism, but some have many. For example the genus Salamandra has 14 organisms with the species salamandra, including Salamandra salamandra, Salamandra salamandra crespoi and Salamandra salamandra morenica. The genus Regulus has 13 organisms, including Regulus regulus azoricus, Regulus regulus japonensis and Regulus regulus regulus (these are all Goldcrests).

In total, there are 546 unique entries, when organisms with a unique subspecies name are considered distinct. If subspecies is not considered, the number of organisms with the same genus as species (i.e., regardless of subspecies) is 383. Here are organisms whose genus/species name is shorter than 6 letters (82 entries).

Shortest Species/Genus Duplicates (82, 5 letters or less)

Agama agama, Alces alces, Alle alle, Alosa alosa, Anser anser, Appia appia, Apus apus, Arita arita, Arius arius, Aroma aroma, Axis axis, Badis badis, Bagre bagre, Bison bison, Boops boops, Brama brama, Bubo bubo, Bufo bufo, Bulla bulla, Buteo buteo, Butis butis, Catla catla, Chaca chaca, Conta conta, Crex crex, Cynea cynea, Dama dama, Dario dario, Diuca diuca, Dives dives, Ensis ensis, Equus equus, Ficus ficus, Gemma gemma, Gesta gesta, Glis glis, Gobio gobio, Grus grus, Guira guira, Gulo gulo, Hara hara, Hucho hucho, Huso huso, Indri indri, Irus irus, Juga juga, Labeo labeo, Lima lima, Loa loa, Lota lota, Lutra lutra, Lynx lynx, Meles meles, Melo melo, Meza meza, Mitu mitu, Mola mola, Molva molva, Mops mops, Myaka myaka, Naja naja, Nasua nasua, Papio papio, Pauxi pauxi, Perna perna, Pica pica, Pipa pipa, Pipra pipra, Plica plica, Rapa rapa, Rita rita, Sarda sarda, Sisko sisko, Solea solea, Sula sula, Suta suta, Tinca tinca, Todus todus, Tor tor, Uncia uncia, Vimba vimba, Volva volva.

Longest Species/Genus Duplicates (5, 14 letters or more)

Coccothraustes coccothraustes

Labiostrongylus labiostrongylus

Macrobilharzia macrobilharzia

Macropostrongylus macropostrongylus

Xanthocephalus xanthocephalus

The nematode worm Macropostrongylus macropostrongylus has the honour of being the longest genus/species duplicate organism. Given this distinction, it is surprising that Pubmed returns only 2 papers that refer to it.

Dataset

Download the full list. The number next to each ENTRY field is the NCBI Taxonomy ID for the organism. In a small number of cases there are ambiguities in parsing the data file (e.g. Troglodytes cf. troglodytes PS-2, Troglodytes sp. troglodytes PS-1). I left these in.


Visual Acuity and Sequence Visualization

Tue 03-04-2012

Visual acuity limits of the human eye restrict the resolution at which we can comfortably visualize data.

In this short guide, I explain why dividing a scale into no more than 500 divisions is a good idea.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Visualizing 1.5 Mb (S. cerevisiae chrIV) in a 183 mm wide figure (size limit in Nature for double column figures) restricts scale division to 2.9 kb to ensure comfortable reading.

2011 EMBO Journal Cover Contest

Tue 14-06-2011

For the EMBO Journal 2011 Cover Contest, I prepared two entries, one for the scientific category and one for the non-scientific category.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The non-scientific entry is abstract photo of fiber optics. The scientific entry was an information graphic showing a hive panel of genomic annotations in human, mouse and dog genomes. The hive panel is based on the use of the newly introduced hive plot.

The 2011 winners have been announced. My non-scientific entry (photo of fiber optics) received honourable mention and was included in the Favourites of the Jury gallery.

New Circos Domains

Tue 03-05-2011

Until now, Circos did not have its own domain name, having been served from the lengthy and boring http://mkweb.bcgsc.ca/circos.

Circos - circular and genome visualization - now at the new domain circos.ca

Recently, I was surprised to find out that the following domains were available

All these now point to the Circos site.


ee spammings - beautiful language of spam poetry

Mon 02-05-2011

ee spammings are spam edited into a format reminiscent of the poetry of ee cummings. Unwanted solicitations for questionable endeavours and products suddenly turn into heady words of the new literature. Art suddenly freed from the husk of spam.

ee spammings - beautiful language of spam poetry

Literature 2.0 — from unlikely origins.


Here's one example that emphasizes that today is ok.

i got 
to touch you

i like us 
and know the more. 

believe
       recontact 
me

today ok!
but matters

waiting for
           happy

I now have over 20 ee spammings — enjoy them all.

Neologisms - New Words, Much Needed

Tue 26-04-2011

What do inconversible, mystific, postpetizer, prenopsis and suscitate have in common?

They are words that don't exist, but should. Learn new words.

Hive Plot Ads

Sat 16-04-2011

Download large ads: 00 01 02 03 04


World's Most Popular Questions
Today's Zeitgeist

Mon 28-03-2011
World's

What are the world's top questions?

Using Google's autocomplete feature, I have tabulated the world's most popular questions. By combining a interrogative term, such as what, who or why, with a term from a related set, such as do I, can I, and can't I, it is possible to sample the space of questions and obtain the most popular for a given start word combination.

I have tabulated the most popular questions by category.

general limits & desires
love money
career & education health
sizes & extremes religion & faith

Science

What kind of questions about science are people asking? From the Career & Education section,

  • Can biology lead to new theorems?
  • Can physics explain miracles?
  • Can math be fun?
  • Can science and religion coexist?
  • Can history repeat itself?
  • Can psychology be morally neutral?

Curios

What are the strangest questions? I'll let you explore, but these have me wondering:

  • Has the world gone mad or is it me?
  • Why can't I hold all these limes?
  • What happens if I make a formal commitment to Satan?
  • Why can't I sell my kidney?
  • Who is the most powerful Jedi?
  • Can Jesus microwave a burrito?
  • Where is the hardest part of your head?

Circos Table Browser

Thu 24-03-2011

Circos can be used to visualize tabular data, such as spreadsheets.

Circos - Circular Visualization of Tabular Data - Martin Krzywinski

1,000s of tables have already been visualized. Has yours?

648 Ratios

Thu 17-02-2011

Hive plots are excellent at visualizing ratios. They're not just an anti-hairball network visualization agent.

Below are visualized 3 x 8 x 27 = 648 (axes, ribbons, plots) ratios visualized.

Hive Plots - Network Visualization - Ratio Visualization - Martin Krzywinski

The image above compares the relative ratios of region annotations in human, mouse and dog genomes.


Cáceres Creativa - Model and Strategy for Urban Innovation

Fri 11-02-2011
$alt

Cáceres is a small city of 100,000 inhabitants in western Spain, where the city government is promoting Cáceres Creativa, a project to build citizens collaboratively sustainable future for the city based on activating the creative capacity of the population.

The project has been published as a book (excerpt), which provides a basis for working with city residents and businesses in this collaborative design. $alt

Circos proved useful in showing the complex relationships that are established in such an environment is a city which combines flows of energy and resources, physical items and intellectual concepts. The online Circos tableviewer was used to generate the images.

Storage Cluster

Fri 11-02-2011

Taking photos of inanimate objects is rewarding. Your subject doesn't complain, nor move, and a coffee break fits naturally into the workflow at any time. In this case, the inanimate object is over 3 Pb (3,000 Tb) of storage composed of a variety of Netapp appliances.

Genome Sciences Center Genesis Compute Cluster

Using three gelled Hensel Integras (500 Ws monoheads — here I'm using only the modelling light for illumination along with red, blue and green filters) (lighting details), I spent some time getting to know the components up close.

Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski

Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Storage Cluster / Lumondo Photography / Martin Krzywinski

See more photos.

All photos by Martin Krzywinski (Lumondo Photography).

Genesis 1.0

Fri 11-02-2011

Our new compute cluster has been released to the user community.

Genome Sciences Center Genesis Compute Cluster

This cluster consists of 420 compute nodes each with 12 cores and 48GB RAM, totaling 5,040 cores and 20TB RAM. Each node has 160GB local /tmp space and all nodes are tied together over an Inifiniband 40Gbs network.

Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski

The nodes all have access to a dedicated storage system over the Infiniband Network running GPFS with a total 700TB of usable scratch space. The filesystem is served by 8 IBM x3850 servers. All nodes are running CentOS5.4 and are using open source Grid Engine 6.2u5 as their scheduler.

Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski Genome Sciences Center Genesis Compute Cluster / Lumondo Photography / Martin Krzywinski

Lighting details and more photos.

All photos by Martin Krzywinski (Lumondo Photography).

1 First the server room was expanded 2 It was empty and without racks, and the lights were dim. Sysadmins scurried about and unpacked equipment 3 The circuit was closed and there were electrons 4 IT staff were pleased and accounts were handed out to users 5 Who had work they called "important" 6 But which the IT staff merely called "jobs".


Photo

Thu 17-02-2011

Periodically, I take my camera, point it at things. Here, I'll share a favourite from my creations.

Lumondo Photography - Martin Krzywinski - Diving Horror

This image — I will keep the subject a mystery — gives me the same feeling as some of the Hubble images. For this shot, I didn't need to reach orbit.

Lumondo Photography - Martin Krzywinski

Other images in this series are available on flickr.

I also like geometry and lines. This shot is a tense composition of the Hancock Building at Copley Square in Boston.

Lumondo Photography - Martin Krzywinski - Boston Copley Square

and an assortment of baggage carts at St Pancreas station (London) which catches the eye.

Lumondo Photography - Martin Krzywinski - St Pancreas Station - London

I like to collect time in a photo, be it uniformly as in this diptych of street and traffic lights from a moving car

Lumondo Photography - Martin Krzywinski - Driving Lights

or blended, as in this skyline of Vancouver showing the flow of time from 5.30pm to 9.30pm.

Lumondo Photography - Martin Krzywinski - High Dynamic Time Range Photography - Vancouver Skyline

WIZARD — Longest English Reverse Complement

Wed 26-01-2011

DNA is composed of two strands, which are complementary. Given a sequence, its reverse complement is created by swapping A/T and G/C and writing the remapped sequence backwards (e.g. ATGC is first remapped to TACG and then reversed to GCAT).

Consider the corresponding concept applied to English words (or any language, for that matter). First, construct the complementarity map, which assigns to the nth letter of the alphabet the N-n letter, given an alphabet of N letters.

abcdefghijklmnopqrstuvwxyz
||||||||||||||||||||||||||
zyxwvutsrqponmlkjihgfedcba

For example, a becomes z, b becomes y, and so on. To create a reverse complement of a word, apply this mapping and then reverse the new word (e.g. 'dog' is remapped to 'wlt' and then reversed to obtain 'tlw').

So far, that's not very exciting.

But consider the question: What is the longest English word that is a palindrome under this set of rules (reverse complementarity). In other words, it's the same forward and backward after complementing the letters. Clearly "dog" is not such a palindrome since its reverse complement is "tlw".

The answer? wizard and hovels.

wizard
||||||
draziw -> 'wizard' backwards

It's an amazingly fitting answer, since a wizard is someone with special powers.

A few interesting 4-letter words that are their own reverse complement palindromes are bevy, grit, trig and wold. Common surnames that match are Ghrist, Elizarov and Prawdzik. Female first name Zola and male first name Iver are also reverse complement palindromes, as are trolig (Norwegian for 'likely', as well as an IKEA curtain product) and aviverez (2nd person plural future of 'aviver', French for 'brighten').

I've scanend a very large word list (4,138,000 unique English and foreign words) and identified 108 reverse complement palindromes. If you find a new entry longer than 6 letters, let me know.

Typefaces that are worth it

Mon 28-03-2011

Finding just the right font is hard work. There are so many to choose from. Or are there?

If the type face is not on this list, don't use it (except Bodoni &mdash I hate Bodoni &mdash don't use it). If you need a shorter list, consult the quintissential 15 serif and 15 sans-serif fonts.

You'll notice a rotating image of type faces at the top of this page. Here's the full list.

Comic Sans font Dax font Frutiger font Gill Sans font Gotham font Helvetica font Syntax font The Sans font

I love Gotham and have used it in visualization projects. It's more rational than Helvetica and still enjoys a freshness that has evapourated from Helvetica after near-ubiquitous use. Don't get me wrong, there is still not enough Helvetica in the world, but more Gotham would be nice.


Paper

Mon 24-01-2011

Anyone who has met me, quickly learns that I have a personal and antagonistic relationship with Comic Sans, the type face that shouldn't have been.

In a recent article in the journal Cognition, Fortune favours the bold (and the italicized): Effects of disfluence on educational outcomes, Diemand-Yauman et al. suggest that rendering educational materials in a hard-to-read font, and thereby recruiting the effects of the disfluency ("the subjective experience of difficulty associated with cognitive operations"), improves retention of material.

Regardless whether the effect is real, there must be better ways to improve education than through bad design.

Kittens

Mon 24-01-2011

Surely you like kittens. So don't hurt your audience.

Edward Tufte says no to Powerpoint.

Fri 21-01-2011

Side Interest Spawns Brazilian Fashion Line

In a cosmically improbable confluence of multidisciplinary pursuits, my work on keyboard layouts, which as one of its fruits has produced the TNWMLC keyboard layout — the most difficult for English typing — has been incorporated into the eponymously named Brazilian fashion line by Julia Valle.

TNWMLC Fashion Line by Julia Valle, using work of Martin Krzywinski and the carpalx project.


Spatter of Network Communities

Mon 24-01-2011

Looking into network data sets for the linear layout project, I found pretty hairballs which make a juicy spatter pattern.

Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.152 }