This section celebrates the words of William Shakespeare.
If you love letters in just the right combination, these pages, the art is for you. If you like to delve into the words yourself, use my plain-text annotated version of all his plays.
The posters are available for purchase.
Here I've made all of 37 Shakespeare's plays available in a single plain-text file. Each spoken line and annotation (e.g. start of scene, character exit, etc) are provided on separate and indexed lines.
I am grateful to Liam Larsen's Kaggle project, which was the only plain-text easily parsable version of Shakespeare that I've been able to find. Liam's file didn't include Henry IV Part 2, which I've added to my file as parsed from the Shakespeare pages at MIT.
My format is different than Liam's. I provide more information about what the line represents and annotate some lines with flags to indicate start/end of a segment, such as scene, act, or a character's appearance.
If you spot any errors or inconsistencies in the file, please let me know.
Here's a snippet of the first and last records from A Comedy of Errors. The field delimiter is a pipe "|".
A_Comedy_of_Errors | play_start | 1966 A_Comedy_of_Errors | act_start | 274 | 1 A_Comedy_of_Errors | scene_start | 1026 | 1 | 1 | A hall in DUKE SOLINUS'S palace. A_Comedy_of_Errors | enter | 1 | 1 | DUKE SOLINUS, AEGEON, Gaoler, Officers, and other Attendants A_Comedy_of_Errors | line | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | AEGEON | +a,+ca,+cp,+cs,+p,+s | Proceed, Solinu s, to procure my fall A_Comedy_of_Errors | line | 1 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | AEGEON | | And by the doom of death end woes a nd all. A_Comedy_of_Errors | line | 1 | 1 | 3 | 3 | 3 | 3 | 1 | 1 | 1 | DUKE_SOLINUS | +ca,+cp,+cs | Merchant of Syracu se, plead no more; A_Comedy_of_Errors | line | 1 | 1 | 4 | 4 | 4 | 4 | 1 | 2 | 2 | DUKE_SOLINUS | | I am not partial to infringe our laws: ... A_Comedy_of_Errors | line | 5 | 1 | 453 | 1963 | 453 | 1023 | 99 | 1 | 314 | DROMIO_OF_SYRACUSE | -ca,-cp,-cs | We'll draw cuts for the senior: till then lead thou first. A_Comedy_of_Errors | line | 5 | 1 | 454 | 1964 | 454 | 1024 | 63 | 1 | 185 | DROMIO_OF_EPHESUS | | Nay, then, thus: A_Comedy_of_Errors | line | 5 | 1 | 455 | 1965 | 455 | 1025 | 63 | 2 | 186 | DROMIO_OF_EPHESUS | | We came into the world like brother and brother; A_Comedy_of_Errors | line | 5 | 1 | 456 | 1966 | 456 | 1026 | 63 | 3 | 187 | DROMIO_OF_EPHESUS | -a,-ca,-cp,-cs,-p,-s | And now let's go hand in hand, not one before another. A_Comedy_of_Errors | exeunt | 5 | 1 | all ...
Every line has the format
play_name | record_type | ...
where record_type
is one of
play_start - start of the play act_start - start of an act scene_start - start of a scene prologue - start of prologue enter - a character enters exit - character or characters exit exeunt - character or characters exit line - spoken line misc - action, emote, death, alarm, or other non-spoken event
The exit
and exeunt
labels are interchangeable. Although strictly exit
is singular and exeunt
is plural, there are exit
lines in which multiple characters leave. The misc
may correspond to an entrance, re-entrance or exit.
Depending on the record_type
the line has different number of fields.
# * indicates the field may be blank (e.g. speaker) play_start | spoken_lines_in_play act_start | spoken_lines_in_act | act_number scene_start | spoken_lines_in_scene | act_number | scene_number | scene_description prologue | act_number | 0 enter | act | scene | speaker* | description exit | act | scene | speaker* | description exeunt | act | scene | speaker* | description line | act | scene | line_in_play | line_in_act | line_in_scene | speaker_appearance | line_in_speaker_appearance | speaker_line | flag* | line_text misc | act | scene | speaker* | description
All counts start at 1, except the prologue scene number which is 0.
Only spoken lines count towards the line count.
Every speaker has three line counters. speaker_appearance
gives the index of the speaker's appearance (contiguous set of lines). line_in_speaker_appearance
counts the lines within a speaker's appearance (contiguous set of lines). speaker_line
counts the total lines spoken by the speaker across the play. For example, at the start of the Comedy of Errors
# Aegeon's first apperance of 2 lines (running total for Aegeon: 2 lines) ... 1 | 1 | 1 | AEGEON | +a,+ca,+cp,+cs,+p,+s | Proceed, Solinus, to procure my fall ... 1 | 2 | 2 | AEGEON | | And by the doom of death end woes and all. # Duke Solinus's first apperance of 23 lines (running total for Duke Solinus: 23 lines) ... 1 | 1 | 1 | DUKE_SOLINUS | +ca,+cp,+cs | Merchant of Syracuse, plead no more; ... 1 | 2 | 2 | DUKE_SOLINUS | | I am not partial to infringe our laws: ... 1 | 3 | 3 | DUKE_SOLINUS | | The enmity and discord which of late ... ... 1 | 21 | 21 | DUKE_SOLINUS | | Thy substance, valued at the highest rate, ... 1 | 22 | 22 | DUKE_SOLINUS | | Cannot amount unto a hundred marks; ... 1 | 23 | 23 | DUKE_SOLINUS | | Therefore by law thou art condemned to die. # Aegeon's second appearance of 2 lines (running total for Aegeon: 4 lines) ... 2 | 1 | 3 | AEGEON | | Yet this my comfort: when your words are done, ... 2 | 2 | 4 | AEGEON | | My woes end likewise with the evening sun. # Duke Solinus's second apperance of 3 lines (running total for Duke Solinus: 26 lines) ... 2 | 1 | 24 | DUKE_SOLINUS | | Well, Syracusian, say in brief the cause ... 2 | 2 | 25 | DUKE_SOLINUS | | Why thou departed'st from thy native home ... 2 | 3 | 26 | DUKE_SOLINUS | | And for what cause thou camest to Ephesus. # Aegeon's third appearance of 65 lines (running total for Aegeon: 69 lines) ... 3 | 1 | 5 | AEGEON | | A heavier task could not have been imposed ... 3 | 2 | 6 | AEGEON | | Than I to speak my griefs unspeakable: ... 3 | 3 | 7 | AEGEON | | Yet, that the world may witness that my end ... ... 3 | 63 | 67 | AEGEON | | Of Corinth that, of Epidaurus this: ... 3 | 64 | 68 | AEGEON | | But ere they came,--O, let me say no more! ... 3 | 65 | 69 | AEGEON | | Gather the sequel by that went before. # Duke Solinus's third apperance of 2 lines (running total for Duke Solinus: 28 lines) ... 3 | 1 | 27 | DUKE_SOLINUS | | Nay, forward, old man; do not break off so; ... 3 | 2 | 28 | DUKE_SOLINUS | | For we may pity, though not pardon thee.
The flag
field is zero or more of
# first line +p in play +a in act +s in scene # last line -p in play -a in act -s in scene # first line of speaker in +cp play +ca act +cs scene # last line of speaker in -cp play -ca act -cs scene
Searching for "-cp" and "death" gives you all the last lines of a given character in the play that said "death".
> grep "\-cp" shakespeare.all.plays.plain.text.txt | grep death A_Winters_Tale | line | 5 | 1 | 242 | 2968 | 242 | 242 | 4 | 6 | 24 | Lord | -ca,-cp,-cs | With divers deaths in death. Antony_and_Cleopatra | line | 4 | 14 | 114 | 2877 | 532 | 114 | 27 | 2 | 47 | EROS | -ca,-cp,-cs | Of Antony's death. As_you_like_it | line | 5 | 4 | 17 | 2477 | 234 | 17 | 24 | 1 | 75 | SILVIUS | +cs,-ca,-cp,-cs | Though to have her and death were both one thing. Coriolanus | line | 5 | 4 | 40 | 3542 | 470 | 40 | 12 | 5 | 38 | Messenger | -ca,-cp,-cs | They'll give him death by inches. Henry_IV,_Part_1 | line | 5 | 3 | 14 | 2776 | 258 | 14 | 11 | 3 | 41 | SIR_WALTER_BLUNT | -ca,-cp,-cs | Lord Stafford's death. Henry_VI_Part_1 | line | 1 | 3 | 85 | 418 | 418 | 85 | 1 | 6 | 6 | Officer | -ca,-cp,-cs | henceforward, upon pain of death. Henry_VI_Part_3 | line | 2 | 2 | 65 | 859 | 274 | 65 | 1 | 3 | 3 | PRINCE | -ca,-cp,-cs | And in that quarrel use it to the death. King_Lear | line | 4 | 6 | 276 | 2874 | 616 | 276 | 38 | 5 | 76 | OSWALD | -ca,-cp,-cs | Upon the British party: O, untimely death! Merchant_of_Venice | line | 5 | 1 | 311 | 2650 | 311 | 311 | 36 | 4 | 84 | NERISSA | -ca,-cp,-cs | After his death, of all he dies possess'd of. Richard_II | line | 4 | 1 | 19 | 1914 | 19 | 19 | 6 | 12 | 22 | BAGOT | -ca,-cp,-cs | In this your cousin's death. Richard_III | line | 4 | 4 | 200 | 2840 | 500 | 200 | 44 | 13 | 142 | DUCHESS_OF_YORK | -ca,-cp,-cs | Shame serves thy life and doth thy death attend. Timon_of_Athens | line | 2 | 2 | 94 | 709 | 132 | 94 | 4 | 2 | 7 | Page | -ca,-cp,-cs | dog's death. Answer not; I am gone. Titus_Andronicus | line | 3 | 1 | 242 | 1281 | 242 | 242 | 1 | 7 | 7 | Messenger | -ca,-cp,-cs | More than remembrance of my father's death.
Searching for "-cp" and sorting by the speaker's line count gives you a ranked list of the most number of spoken lines in a play. Here are the top 10:
grep "\-cp" shakespeare.all.plays.plain.text.txt | sort -nr +20 -21 | head -10 Hamlet | line | 5 | 2 | 374 | 3963 | 681 | 374 | 358 | 7 | 1498 | HAMLET | -ca,-cp,-cs | Which have solicited. The rest is silence. Othello | line | 5 | 2 | 350 | 3483 | 494 | 350 | 272 | 2 | 1099 | IAGO | -ca,-cp,-cs | From this time forth I never will speak word. Henry_V | line | 5 | 2 | 372 | 3216 | 503 | 373 | 147 | 6 | 1029 | KING_HENRY_V | -ca,-cp,-cs | EPILOGUE Othello | line | 5 | 2 | 411 | 3544 | 555 | 411 | 274 | 2 | 887 | OTHELLO | -ca,-cp,-cs | Killing myself, to die upon a kiss. Measure_for_measure | line | 5 | 1 | 578 | 2838 | 578 | 578 | 194 | 16 | 857 | DUKE_VINCENTIO | -a,-ca,-cp,-cs,-p,-s | What's yet behind, that's meet you all should know. Antony_and_Cleopatra | line | 4 | 15 | 70 | 3003 | 658 | 70 | 202 | 9 | 849 | MARK_ANTONY | -ca,-cp,-cs | I can no more. Timon_of_Athens | line | 5 | 1 | 246 | 2361 | 247 | 247 | 207 | 10 | 824 | TIMON | -ca,-cp,-cs | Sun, hide thy beams! Timon hath done his reign. Richard_II | line | 5 | 5 | 113 | 2742 | 507 | 113 | 98 | 8 | 758 | KING_RICHARD_II | -ca,-cp,-cs | Whilst my gross flesh sinks downward, here to die. King_Lear | line | 5 | 3 | 367 | 3480 | 458 | 367 | 187 | 7 | 752 | KING_LEAR | -ca,-cp,-cs | Look there, look there! Julius_Caesar | line | 5 | 5 | 57 | 2566 | 349 | 57 | 194 | 3 | 728 | BRUTUS | -ca,-cp,-cs | I kill'd not thee with half so good a will.
Hamlet has 1,498 lines, almost 50% more than the next character, Othello, who has 1,099.
Who has the longest delivery? To find out just sort on the line_in_speaker_appearance
field.
> grep -w line shakespeare.all.plays.plain.text.txt | sort -nr +18 -19 | head -1 Henry_IV,_Part_2 | line | 1 | 2 | 229 | 496 | 455 | 229 | 10 | 139 | 202 | FALSTAFF | | so both the degrees prevent my curses. Boy!
It's Sir John Falstaff in Henry IV Part 2, who delivers 139 consecutive lines in his 10th delivery.
After that, it's King Henry V, who delivers 83 consecutive lines in his 2nd delivery.
> grep -w line shakespeare.all.plays.plain.text.txt | sort -nr +18 -19 | grep -v FALSTAFF | head -1 Henry_IV,_Part_2 | line | 5 | 2 | 146 | 2941 | 227 | 146 | 2 | 83 | 101 | KING_HENRY_V | -ca,-cp,-cs,-s | God shorten Harry's happy life one day!
Hamlet has 358 turns to speak, the most of any character. To find out, sort on the speaker_appearance
field.
> grep -w line shakespeare.all.plays.plain.text.txt | sort -nr +16 -17 | head -1 Hamlet | line | 5 | 2 | 374 | 3963 | 681 | 374 | 358 | 7 | 1498 | HAMLET | -ca,-cp,-cs | Which have solicited. The rest is silence.
After Hamlet, it's Othello who has 274 turns to speak.
> grep -w line shakespeare.all.plays.plain.text.txt | sort -nr +16 -17 | grep -v HAMLET | head -1 Othello | line | 5 | 2 | 411 | 3544 | 555 | 411 | 274 | 2 | 887 | OTHELLO | -ca,-cp,-cs | Killing myself, to die upon a kiss.
Let's count up the number of times "death" is mentioned by all characters.
# number of times "death" is spoken by character > grep -w line shakespeare.all.plays.plain.text.txt | grep -i death | cut -d "|" -f 1,12 | suc | sort -nr | head -15 21 Romeo_and_Juliet | ROMEO 18 Measure_for_measure | DUKE_VINCENTIO 16 Julius_Caesar | BRUTUS 15 Henry_VI_Part_1 | TALBOT 14 Romeo_and_Juliet | FRIAR_LAURENCE 14 Richard_III | GLOUCESTER 13 Hamlet | KING_CLAUDIUS 12 Antony_and_Cleopatra | MARK_ANTONY 10 Richard_II | KING_RICHARD_II 10 Henry_VI_Part_2 | KING_HENRY_VI 10 Hamlet | HAMLET 9 Romeo_and_Juliet | JULIET 9 Measure_for_measure | ISABELLA 8 Richard_III | QUEEN_MARGARET 8 Richard_III | DUCHESS_OF_YORK
Romeo has 21 lines in which he says "death" (any lines with the word appearing twice is counted only once). After that, it's Duke Vincentio with 18 lines and Brutus with 16 lines.
If we just count the number of times "death" is said in a play, then Romeo and Juliet wins with 73 lines with the word. Followed closely by Richard III with 72 mentions.
# number of times "death" appears in a line > grep -w line shakespeare.all.plays.plain.text.txt | grep -i death | cut -d "|" -f 1 | suc | sort -nr 73 Romeo_and_Juliet 72 Richard_III 63 Henry_VI_Part_2 45 Henry_VI_Part_1 43 Richard_II 42 Measure_for_measure 42 Henry_VI_Part_3 39 Hamlet 35 Antony_and_Cleopatra 34 King_John 31 Julius_Caesar 28 Titus_Andronicus 27 Cymbeline 24 Henry_IV,_Part_2 23 A_Winters_Tale 22 King_Lear 22 Coriolanus 21 Macbeth 21 Henry_IV,_Part_1 18 Pericles 17 Much_Ado_about_nothing 17 Alls_well_that_ends_well 16 Troilus_and_Cressida 15 Othello 15 Henry_V 14 A_Midsummer_nights_dream 12 Merchant_of_Venice 10 Twelfth_Night 10 Henry_VIII 9 A_Comedy_of_Errors 8 Timon_of_Athens 8 Loves_Labours_Lost 7 Two_Gentlemen_of_Verona 7 As_you_like_it 6 The_Tempest 6 Taming_of_the_Shrew 6 Merry_Wives_of_Windsor
We'd like to say a ‘cosmic hello’: mathematics, culture, palaeontology, art and science, and ... human genomes.
All animals are equal, but some animals are more equal than others. —George Orwell
This month, we will illustrate the importance of establishing a baseline performance level.
Baselines are typically generated independently for each dataset using very simple models. Their role is to set the minimum level of acceptable performance and help with comparing relative improvements in performance of other models.
Unfortunately, baselines are often overlooked and, in the presence of a class imbalance5, must be established with care.
Megahed, F.M, Chen, Y-J., Jones-Farmer, A., Rigdon, S.E., Krzywinski, M. & Altman, N. (2024) Points of significance: Comparing classifier performance with baselines. Nat. Methods 20.
Celebrate π Day (March 14th) and dig into the digit garden. Let's grow something.
Huge empty areas of the universe called voids could help solve the greatest mysteries in the cosmos.
My graphic accompanying How Analyzing Cosmic Nothing Might Explain Everything in the January 2024 issue of Scientific American depicts the entire Universe in a two-page spread — full of nothing.
The graphic uses the latest data from SDSS 12 and is an update to my Superclusters and Voids poster.
Michael Lemonick (editor) explains on the graphic:
“Regions of relatively empty space called cosmic voids are everywhere in the universe, and scientists believe studying their size, shape and spread across the cosmos could help them understand dark matter, dark energy and other big mysteries.
To use voids in this way, astronomers must map these regions in detail—a project that is just beginning.
Shown here are voids discovered by the Sloan Digital Sky Survey (SDSS), along with a selection of 16 previously named voids. Scientists expect voids to be evenly distributed throughout space—the lack of voids in some regions on the globe simply reflects SDSS’s sky coverage.”
Sofia Contarini, Alice Pisani, Nico Hamaus, Federico Marulli Lauro Moscardini & Marco Baldi (2023) Cosmological Constraints from the BOSS DR12 Void Size Function Astrophysical Journal 953:46.
Nico Hamaus, Alice Pisani, Jin-Ah Choi, Guilhem Lavaux, Benjamin D. Wandelt & Jochen Weller (2020) Journal of Cosmology and Astroparticle Physics 2020:023.
Sloan Digital Sky Survey Data Release 12
Alan MacRobert (Sky & Telescope), Paulina Rowicka/Martin Krzywinski (revisions & Microscopium)
Hoffleit & Warren Jr. (1991) The Bright Star Catalog, 5th Revised Edition (Preliminary Version).
H0 = 67.4 km/(Mpc·s), Ωm = 0.315, Ωv = 0.685. Planck collaboration Planck 2018 results. VI. Cosmological parameters (2018).
constellation figures
stars
cosmology
It is the mark of an educated mind to rest satisfied with the degree of precision that the nature of the subject admits and not to seek exactness where only an approximation is possible. —Aristotle
In regression, the predictors are (typically) assumed to have known values that are measured without error.
Practically, however, predictors are often measured with error. This has a profound (but predictable) effect on the estimates of relationships among variables – the so-called “error in variables” problem.
Error in measuring the predictors is often ignored. In this column, we discuss when ignoring this error is harmless and when it can lead to large bias that can leads us to miss important effects.
Altman, N. & Krzywinski, M. (2024) Points of significance: Error in predictor variables. Nat. Methods 20.
Altman, N. & Krzywinski, M. (2015) Points of significance: Simple linear regression. Nat. Methods 12:999–1000.
Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nat. Methods 13:541–542 (2016).
Das, K., Krzywinski, M. & Altman, N. (2019) Points of significance: Quantile regression. Nat. Methods 16:451–452.