visualization + design

Statistics for aneuploidy level `h` = 1 2 3 4 5 6 7 8 9 10

Diploid Genome Coverage Tables

Given a location `x` defined in the context of `h` chromosomes, the probability that position `x` is covered at least `\phi` times is `P_{h,\phi}` and given by $$ P_{h,\phi} = \left( 1 - \sum \frac{1}{k!} \left( \frac{\rho}{h}^k \right) e^{-\rho/h} \right)^h \tag{1} $$

For more details, see Wendl, M.C. and R.K. Wilson. 2008. Aspects of coverage in medical DNA sequencing. BMC Bioinformatics 9: 239.

For a given sequencing redundancy `\rho` (e.g. `\rho`-fold, as determined by the length of the haploid genome) of a diploid genome, the fraction of the diploid genome represented by at least `\phi` reads is reported by `P_{h,\phi}`. Coverage by fewer than `\phi` reads is reported as `1-P_{h,\phi}`. Coverage by exactly `\phi` reads is `P_{h,\phi} - P_{h,\phi+1}`. Entries for which fractional coverage is `\lt 10^{-5}` are not shown.

A rudimentary Monte Carlo simulation of genome coverage is also available, and is a useful supplement to the exact probabilities shown here.

CUSTOM DEPTH AND PLOIDY To create a table with a specific ploidy (e.g. 12) and haploid-equivalent (see below) depth (e.g. `200 \times`), use

http://mkweb.bcgsc.ca/coverage/?aneuploidy=12&depth=200

EXAMPLE 1

Suppose you carried out 3-fold redundant (`\rho=3`) sequencing of a haploid genome (`h=1`). 95.02% of the genome will be covered by at least one read (`P_{1,1}`) while 22.40% will be covered by exactly 3 reads (`P_{1,3} - P_{1,4}`).

EXAMPLE 2

You are sequencing a sample with a tumor content of 25% and you're interested in the depth of sequencing required to detect heterozygous mutations in the tumor. This scenario is equivalent to an aneuploidy = 8 genome—any given allele is present 8 times. If you sequence at (`\rho=200`), then 95% of the bases will be covered at a depth of at least `\phi = 14` (`P_{8,14} = 0.9494`). If you're satisfied with `\phi = 5` then you only need `\rho = 100` since now `P_{8,5} = 0.9580`.

ANALYTICAL vs STOCHASTIC

View plot that compares analytical vs stochastic results.

HAPLOID vs DIPLOID

View plot that compares 100x and 200x coverage of haploid and diploid genomes.

CODE

Download Perl scripts for analytical (to produce the tables below for any `\rho`) and stochastic coverage calculations.

sequencing redundancy for a diploid genome

View table for sequencing redundancy `\rho` = 1 2 3 4 5 6 7 8 9 10 20 25 50 75 100 of a diploid genome.

IMPORTANT The redundancy is always calculated using the size of the haploid genome. For example, if we collect 600 Gb of reads, our sequencing redundancy is `600 / 3 = 200 \times`. We've used the length of the haploid genome (3 Gb) in the calculation. If we now apply this `200 \times` sequencing to a diploid genome, our average coverage will not be `200 \times` but slightly less than `100 \times`.

sequencing redundancy 1-fold (`\rho / h = 0.5`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.8452	0.0000	1.0000
1	0.1467	0.8452	0.1548
2	0.0079	0.9919	0.0081
3	0.0002	0.9998	0.0002

sequencing redundancy 2-fold (`\rho / h = 1.0`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.6004	0.0000	1.0000
1	0.3298	0.6004	0.3996
2	0.0634	0.9302	0.0698
3	0.0061	0.9936	0.0064
4	0.0003	0.9996	0.0004
5	0.0000	1.0000	0.0000

sequencing redundancy 3-fold (`\rho / h = 1.5`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.3965	0.0000	1.0000
1	0.4080	0.3965	0.6035
2	0.1590	0.8045	0.1955
3	0.0322	0.9635	0.0365
4	0.0040	0.9957	0.0043
5	0.0003	0.9997	0.0003
6	0.0000	1.0000	0.0000

sequencing redundancy 4-fold (`\rho / h = 2.0`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.2524	0.0000	1.0000
1	0.3948	0.2524	0.7476
2	0.2483	0.6472	0.3528
3	0.0841	0.8955	0.1045
4	0.0176	0.9796	0.0204
5	0.0025	0.9972	0.0028
6	0.0003	0.9997	0.0003
7	0.0000	1.0000	0.0000

sequencing redundancy 5-fold (`\rho / h = 2.5`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.1574	0.0000	1.0000
1	0.3346	0.1574	0.8426
2	0.2998	0.4921	0.5079
3	0.1493	0.7919	0.2081
4	0.0469	0.9412	0.0588
5	0.0101	0.9882	0.0118
6	0.0016	0.9982	0.0018
7	0.0002	0.9998	0.0002
8	0.0000	1.0000	0.0000

sequencing redundancy 6-fold (`\rho / h = 3.0`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.0971	0.0000	1.0000
1	0.2615	0.0971	0.9029
2	0.3087	0.3586	0.6414
3	0.2083	0.6673	0.3327
4	0.0903	0.8756	0.1244
5	0.0271	0.9659	0.0341
6	0.0059	0.9930	0.0070
7	0.0010	0.9989	0.0011
8	0.0001	0.9999	0.0001
9	0.0000	1.0000	0.0000

sequencing redundancy 7-fold (`\rho / h = 3.5`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.0595	0.0000	1.0000
1	0.1938	0.0595	0.9405
2	0.2854	0.2533	0.7467
3	0.2465	0.5388	0.4612
4	0.1393	0.7853	0.2147
5	0.0551	0.9246	0.0754
6	0.0160	0.9797	0.0203
7	0.0035	0.9957	0.0043
8	0.0006	0.9993	0.0007
9	0.0001	0.9999	0.0001
10	0.0000	1.0000	0.0000

sequencing redundancy 8-fold (`\rho / h = 4.0`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.0363	0.0000	1.0000
1	0.1385	0.0363	0.9637
2	0.2447	0.1748	0.8252
3	0.2595	0.4195	0.5805
4	0.1832	0.6790	0.3210
5	0.0916	0.8622	0.1378
6	0.0339	0.9538	0.0462
7	0.0096	0.9878	0.0122
8	0.0022	0.9974	0.0026
9	0.0004	0.9995	0.0005
10	0.0001	0.9999	0.0001

sequencing redundancy 9-fold (`\rho / h = 4.5`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.0221	0.0000	1.0000
1	0.0964	0.0221	0.9779
2	0.1986	0.1185	0.8815
3	0.2504	0.3170	0.6830
4	0.2136	0.5674	0.4326
5	0.1307	0.7811	0.2189
6	0.0597	0.9117	0.0883
7	0.0210	0.9715	0.0285
8	0.0059	0.9925	0.0075
9	0.0013	0.9984	0.0016
10	0.0002	0.9997	0.0003
11	0.0000	1.0000	0.0000

sequencing redundancy 10-fold (`\rho / h = 5.0`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.0134	0.0000	1.0000
1	0.0658	0.0134	0.9866
2	0.1545	0.0792	0.9208
3	0.2260	0.2338	0.7662
4	0.2271	0.4598	0.5402
5	0.1656	0.6870	0.3130
6	0.0909	0.8525	0.1475
7	0.0388	0.9434	0.0566
8	0.0132	0.9822	0.0178
9	0.0036	0.9954	0.0046
10	0.0008	0.9990	0.0010
11	0.0002	0.9998	0.0002
12	0.0000	1.0000	0.0000

sequencing redundancy 20-fold (`\rho / h = 10.0`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
0	0.0001	0.0000	1.0000
1	0.0009	0.0001	0.9999
2	0.0045	0.0010	0.9990
3	0.0150	0.0055	0.9945
4	0.0371	0.0206	0.9794
5	0.0720	0.0576	0.9424
6	0.1137	0.1297	0.8703
7	0.1486	0.2433	0.7567
8	0.1629	0.3919	0.6081
9	0.1513	0.5549	0.4451
10	0.1200	0.7062	0.2938
11	0.0819	0.8261	0.1739
12	0.0485	0.9081	0.0919
13	0.0251	0.9566	0.0434
14	0.0114	0.9816	0.0184
15	0.0046	0.9930	0.0070
16	0.0016	0.9976	0.0024
17	0.0005	0.9993	0.0007
18	0.0002	0.9998	0.0002
19	0.0000	0.9999	0.0001
20	0.0000	1.0000	0.0000

sequencing redundancy 25-fold (`\rho / h = 12.5`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
1	0.0001	0.0000	1.0000
2	0.0006	0.0001	0.9999
3	0.0024	0.0007	0.9993
4	0.0076	0.0031	0.9969
5	0.0188	0.0107	0.9893
6	0.0385	0.0294	0.9706
7	0.0668	0.0679	0.9321
8	0.0995	0.1348	0.8652
9	0.1281	0.2342	0.7658
10	0.1436	0.3623	0.6377
11	0.1410	0.5059	0.4941
12	0.1217	0.6469	0.3531
13	0.0929	0.7686	0.2314
14	0.0629	0.8615	0.1385
15	0.0380	0.9244	0.0756
16	0.0205	0.9624	0.0376
17	0.0100	0.9829	0.0171
18	0.0044	0.9929	0.0071
19	0.0018	0.9973	0.0027
20	0.0006	0.9991	0.0009
21	0.0002	0.9997	0.0003
22	0.0001	0.9999	0.0001
23	0.0000	1.0000	0.0000

sequencing redundancy 50-fold (`\rho / h = 25.0`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
7	0.0000	0.0000	1.0000
8	0.0001	0.0000	1.0000
9	0.0003	0.0002	0.9998
10	0.0007	0.0004	0.9996
11	0.0017	0.0012	0.9988
12	0.0034	0.0028	0.9972
13	0.0066	0.0063	0.9937
14	0.0118	0.0129	0.9871
15	0.0194	0.0247	0.9753
16	0.0300	0.0441	0.9559
17	0.0432	0.0741	0.9259
18	0.0583	0.1173	0.8827
19	0.0737	0.1756	0.8244
20	0.0873	0.2493	0.7507
21	0.0969	0.3366	0.6634
22	0.1008	0.4334	0.5666
23	0.0984	0.5342	0.4658
24	0.0901	0.6326	0.3674
25	0.0774	0.7227	0.2773
26	0.0625	0.8001	0.1999
27	0.0475	0.8626	0.1374
28	0.0339	0.9101	0.0899
29	0.0228	0.9440	0.0560
30	0.0145	0.9668	0.0332
31	0.0087	0.9813	0.0187
32	0.0049	0.9900	0.0100
33	0.0026	0.9949	0.0051
34	0.0013	0.9975	0.0025
35	0.0006	0.9989	0.0011
36	0.0003	0.9995	0.0005
37	0.0001	0.9998	0.0002
38	0.0001	0.9999	0.0001
39	0.0000	1.0000	0.0000
40	0.0000	1.0000	0.0000

sequencing redundancy 75-fold (`\rho / h = 37.5`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
14	0.0000	0.0000	1.0000
15	0.0000	0.0000	1.0000
16	0.0001	0.0001	0.9999
17	0.0002	0.0001	0.9999
18	0.0003	0.0003	0.9997
19	0.0007	0.0006	0.9994
20	0.0013	0.0013	0.9987
21	0.0023	0.0026	0.9974
22	0.0039	0.0049	0.9951
23	0.0063	0.0088	0.9912
24	0.0099	0.0152	0.9848
25	0.0147	0.0250	0.9750
26	0.0210	0.0398	0.9602
27	0.0288	0.0608	0.9392
28	0.0379	0.0896	0.9104
29	0.0478	0.1275	0.8725
30	0.0579	0.1754	0.8246
31	0.0672	0.2332	0.7668
32	0.0748	0.3005	0.6995
33	0.0799	0.3753	0.6247
34	0.0818	0.4552	0.5448
35	0.0803	0.5370	0.4630
36	0.0755	0.6172	0.3828
37	0.0681	0.6927	0.3073
38	0.0588	0.7608	0.2392
39	0.0487	0.8196	0.1804
40	0.0387	0.8683	0.1317
41	0.0295	0.9071	0.0929
42	0.0216	0.9366	0.0634
43	0.0152	0.9582	0.0418
44	0.0102	0.9734	0.0266
45	0.0066	0.9837	0.0163
46	0.0041	0.9903	0.0097
47	0.0025	0.9944	0.0056
48	0.0014	0.9969	0.0031
49	0.0008	0.9984	0.0016
50	0.0004	0.9991	0.0009
51	0.0002	0.9996	0.0004
52	0.0001	0.9998	0.0002
53	0.0001	0.9999	0.0001
54	0.0000	1.0000	0.0000
55	0.0000	1.0000	0.0000

sequencing redundancy 100-fold (`\rho / h = 50.0`)

`\phi`	`P_{h,\phi} - P_{h,\phi+1}`	`1-P_{h,\phi}`	`P_{h,\phi}`
23	0.0000	0.0000	1.0000
24	0.0000	0.0000	1.0000
25	0.0001	0.0001	0.9999
26	0.0001	0.0001	0.9999
27	0.0003	0.0003	0.9997
28	0.0005	0.0005	0.9995
29	0.0008	0.0010	0.9990
30	0.0014	0.0018	0.9982
31	0.0022	0.0032	0.9968
32	0.0034	0.0054	0.9946
33	0.0051	0.0088	0.9912
34	0.0075	0.0139	0.9861
35	0.0107	0.0214	0.9786
36	0.0148	0.0322	0.9678
37	0.0198	0.0470	0.9530
38	0.0257	0.0668	0.9332
39	0.0325	0.0925	0.9075
40	0.0398	0.1250	0.8750
41	0.0472	0.1647	0.8353
42	0.0544	0.2120	0.7880
43	0.0609	0.2664	0.7336
44	0.0660	0.3273	0.6727
45	0.0693	0.3932	0.6068
46	0.0706	0.4625	0.5375
47	0.0696	0.5331	0.4669
48	0.0665	0.6027	0.3973
49	0.0616	0.6692	0.3308
50	0.0553	0.7308	0.2692
51	0.0480	0.7861	0.2139
52	0.0404	0.8341	0.1659
53	0.0330	0.8746	0.1254
54	0.0261	0.9075	0.0925
55	0.0200	0.9336	0.0664
56	0.0148	0.9535	0.0465
57	0.0107	0.9684	0.0316
58	0.0074	0.9790	0.0210
59	0.0050	0.9865	0.0135
60	0.0033	0.9915	0.0085
61	0.0021	0.9948	0.0052
62	0.0013	0.9969	0.0031
63	0.0008	0.9982	0.0018
64	0.0005	0.9990	0.0010
65	0.0003	0.9994	0.0006
66	0.0001	0.9997	0.0003
67	0.0001	0.9998	0.0002
68	0.0000	0.9999	0.0001
69	0.0000	1.0000	0.0000
70	0.0000	1.0000	0.0000

VIEW ALL

news + thoughts

Beyond Belief Campaign BRCA Art

Wed 11-06-2025

Fuelled by philanthropy, findings into the workings of BRCA1 and BRCA2 genes have led to groundbreaking research and lifesaving innovations to care for families facing cancer.

This set of 100 one-of-a-kind prints explore the structure of these genes. Each artwork is unique — if you put them all together, you get the full sequence of the BRCA1 and BRCA2 proteins.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca

Propensity score weighting

Mon 17-03-2025

The needs of the many outweigh the needs of the few. —Mr. Spock (Star Trek II)

This month, we explore a related and powerful technique to address bias: propensity score weighting (PSW), which applies weights to each subject instead of matching (or discarding) them.

▲ Nature Methods Points of Significance column: Propensity score weighting. (read)

Kurz, C.F., Krzywinski, M. & Altman, N. (2025) Points of significance: Propensity score weighting. Nat. Methods 22:1–3.

Happy 2025 π Day—
TTCAGT: a sequence of digits

Thu 13-03-2025

Celebrate π Day (March 14th) and sequence digits like its 1999. Let's call some peaks.

▲ 2025 π DAY | TTCAGT: a sequence of digits. The digits of π are encoded into DNA sequence and visualized with Sanger sequencing. (details)

Crafting 10 Years of Statistics Explanations: Points of Significance

Sun 09-03-2025

I don’t have good luck in the match points. —Rafael Nadal, Spanish tennis player

Points of Significance is an ongoing series of short articles about statistics in Nature Methods that started in 2013. Its aim is to provide clear explanations of essential concepts in statistics for a nonspecialist audience. The articles favor heuristic explanations and make extensive use of simulated examples and graphical explanations, while maintaining mathematical rigor.

Topics range from basic, but often misunderstood, such as uncertainty and P-values, to relatively advanced, but often neglected, such as the error-in-variables problem and the curse of dimensionality. More recent articles have focused on timely topics such as modeling of epidemics, machine learning, and neural networks.

In this article, we discuss the evolution of topics and details behind some of the story arcs, our approach to crafting statistical explanations and narratives, and our use of figures and numerical simulations as props for building understanding.

▲ Crafting 10 Years of Statistics Explanations: Points of Significance. (read)

Altman, N. & Krzywinski, M. (2025) Crafting 10 Years of Statistics Explanations: Points of Significance. Annual Review of Statistics and Its Application 12:69–87.

Propensity score matching

Mon 16-09-2024

I don’t have good luck in the match points. —Rafael Nadal, Spanish tennis player

In many experimental designs, we need to keep in mind the possibility of confounding variables, which may give rise to bias in the estimate of the treatment effect.

▲ Nature Methods Points of Significance column: Propensity score matching. (read)

If the control and experimental groups aren't matched (or, roughly, similar enough), this bias can arise.

Sometimes this can be dealt with by randomizing, which on average can balance this effect out. When randomization is not possible, propensity score matching is an excellent strategy to match control and experimental groups.

Kurz, C.F., Krzywinski, M. & Altman, N. (2024) Points of significance: Propensity score matching. Nat. Methods 21:1770–1772.

Understanding p-values and significance

Tue 24-09-2024

P-values combined with estimates of effect size are used to assess the importance of experimental results. However, their interpretation can be invalidated by selection bias when testing multiple hypotheses, fitting multiple models or even informally selecting results that seem interesting after observing the data.

We offer an introduction to principled uses of p-values (targeted at the non-specialist) and identify questionable practices to be avoided.

▲ Understanding p-values and significance. (read)

Altman, N. & Krzywinski, M. (2024) Understanding p-values and significance. Laboratory Animals 58:443–446.