Word Analysis of 2008 U.S. Presidential Debates
Barack Obama vs. John McCain (combined debates)
There be dragons here. The results here are based on a combined transcript from all debates between these candidates. When interpreting metric values from this analysis (e.g. fraction of unique words), keep in mind that you are looking at results based on speech of multiple debates, not one. In the limit of an infinite number of debates, most metrics will converge to a value that is characteristic of the speaker (e.g. total vocabulary size).
Word Statistics
Debate Word Count
Summary Word Count
The summary word count reports the total number of words and the
number of unique, non-stop words
used by each candidate. Word number is expressed as both absolute and relative values.
Table 1. Number of all words and unique words used by each speaker.
Table 1 Analysis
Across all the debates, the candidates delivered just short of 42,000 words. With each debate approximately 1.5 hours in length, the amount of unique words delivered by both candidates corresponds to a delivery rate of one unique word every 4.4 seconds (3 debates x 1.5 hours x 3600 s = 16,200 seconds). The average speech rate was 2.6 words per second.
Obama delivered +9.0% more words than McCain and had a larger overall vocabulary, by +3.1%.
Table 1 Legend
Stop Word Contribution
In the table below, the candidates' delivery is partitioned into stop and non-stop words. Stop words are frequently-used bridging words (e.g. pronouns and conjunctions) and do not carry inherent meaning. The fraction of words that are stop words is one measure of the complexity of speech.
Table 2. Expanded analysis of total, stop and non-stop word count.
Table 2 Analysis
Overall, Obama's stop word fraction was slightly higher than McCain's. However, Obama delivered more words throughout the debates and displayed greater range of vocabulary, with 2,369 unique non-stop words, +3.4% more than McCain.
Table 2 Legend
All further analysis uses debate content that has been filtered for stop words.
Word frequency
The word frequency table summarizes the frequency with which words were used. Specifically, the average word frequency and the weighted cumulative frequencies at 50 and 90 percentile. The average word frequency indicates how many times, on average, a word is used. For a given fraction of the entire delivery, the weighted cumulative frequency indicates the largest word frequency within this fraction (details about weighted cumulative distribution).
Table 3. Average, 50%, and 90% weighted cumulative word frequencies (content filtered for stop words).
Table 3 Analysis
Absolute values of word frequency statistics for a combined debate transcript are not useful because they are directly proportional to the length of the concatenated transcript. In the limit of a large number of debates, total vocabulary size approaches a limit, and as word count goes up so does word frequency.
However, a comparison between the candidates can still be made. Obama's word frequency is slightly higher than McCain's, but not by much (+3.3%).
Table 3 Legend
Sentence Size
Table 4. Number of words in a sentence, as measured by average number of words, 50% and 90% weighted cumulative values for three word groups (all words, stop words and non-stop words).
Table 4 Analysis
Obama consistently delivers larger sentences, at 8.2 words, compared to McCain, at 7.0 words. Obama's sentence size distribution has a greater component of large sentences. 90% of his speech is in sentences ≤26 words in length, whereas McCain fits 90% of his speech in sentences ≤22 words in length.
Table 4 Legend
Part of Speech Analysis
In this section, word frequency is broken down by their part of speech (POS). The four POS groups examined are nouns, verbs, adjectives and adverbs. Conjunctions and prepositions are not considered. The first category (n+v+adj+adv) is composed of all four POS groups.
Part of Speech Count
Table 5. Count of words (total and unique) categorized by part of speech (POS).
Table 5 Analysis
This is a great table for the combined debate analysis because it shows the part of speech breakdown across three independent samples of speech and is therefore a more robust measure of the candidates' natural style than a sampling from a single event.
McCain uses more nouns than Obama, with 54.3% of his parts of speech being nouns (remember, in this analysis I only consider nouns, verbs, adjectives and adverbs and all to the exclusion of other parts of speech), whereas Obama's fraction is 52.3%. McCain's +3.8% increase suggests speech with a greater emphasis on concrete concepts.
Verb usage is also greater by McCain, at 26.3% vs Obama's 25.6%. The difference is +2.7%, smaller than for nouns.
Once we get into adjectives and adverbs, however, it's a different story. Obama's use of adjectives and adverbs is significantly higher than McCain's. Obama's adjective fraction is +13.0% larger than McCain's and his adverb fraction is +16.1% larger than McCain's. This suggests that Obama's speech is more nuanced and that he captures and delivers more texture in his nouns and verbs than McCain.
Table 5 Legend
Part of Speech Frequency
Table 5. Frequency of words by part of speech (POS).
Table 5 Analysis
Obama's overall part of speech frequency is slightly higher than McCain, but not by much (+2.3%). He consistently has slightly greater repetition of nouns and verbs, at +4.0% and +2.0% more than McCain, respectively.
Obama's adjective and adverb use frequency is much higher than McCain's, however, at +9.6% and +4.8%, respectively. This increase reflects the greater proportion of adjectives and adverbs in Obama's speech.
Table 5 Legend
Part of Speech Pairing
Through word pairing, I attempt to capture the contextual use of parts of speech within a sentence and extract concepts from the text. Specifically, unique pairs of words indicate complexity and inter-relatedness between concepts in a sentence.
Table 6a (Barack Obama). Word pairs (total and unique) categorized by part of speech (POS) for Barack Obama.
Table 6b (John McCain). Word pairs (total and unique) categorized by part of speech (POS) for John McCain.
Table 6c (Barack Obama vs John McCain). Word Pairs (total and unique) categorized by part of speech (POS) for both candidates.
Table 6 Analysis
Obama has larger delivery of all pairings. The largest difference is in adverb/adverb pairings, with Obama having twice as many as McCain.
When compared to Obama, McCain has significantly lower parings that include adjectives and adverbs. While for combinations of nouns and verbs McCain is at 76-85% of Obama, when adjectives and adverbs are brought into the mix McCain is at 50-72%.
These numbers starkly illustrate Obama's greater penchant for precision and modification.
Table 6a,b Legend
Table 6c Legend
Word usage
This section enumerates words that were unique to a canddiate
(e.g. used by one candidate but not the other). For a given part of
speech, the table breaks down the number of words that were spoken by
only one of the candidates or both candidates (intersection). The last
row includes all words (union).
Table 7. Total and unique words used exclusively by a candidate or by both candidates.
Table 7 Analysis
This is another table that benefits from a combined debate treatment. Here we can see the number of words, by part of speech, spoken exclusively by one candidate, or by both. Presumably, as the number of debates increases, the number of words spoken by one candidate but not the other steadily decreases, until it reaches some core value that represents words truly unique to that candidate (e.g. the other candidate does not know the word, or consciously avoids using it).
The key values to draw your attention to are the number of exclusive unique words (first two rows, second column for each part of speech). This number corresponds to the exclusive contribution by each candidate to the vocabulary of the speech.
For example, of the 1,859 unique nouns used in the debate, 629 (33.8%) were spoken by both candidates, 600 (32.3%) by McCain only and 576 (31.0%) by Obama only. McCain thus contributed more nouns to the debate, and his repetition of these words was lower than Obama (55.6% vs 62.4%).
When it comes to verbs, however, Obama's contribution is higher, at 34.2% of all debate verbs vs 29.9% for McCain. Note that verbs were the parts of speech that had the lowest shared fraction — only 28.2% of verbs in the debate were spoken by both candidates.
Obama also contributed a greater variety of adjectives and adverbs to the debate. In particular, Obama's contribution to adverbs was 34.2% compared to 24.1% for McCain. In other words, for every 3 adverbs used by Obama not spoken by McCain, McCain had only 2 not spoken by Obama.
The profile presented in this table closely matches previous the result of previous work by Pennebaker) in which McCain is concluded to be a categorical thinker (heavy noun use), while Obama is fluid and contextual (verb and modifier use).
Table 7c Legend
Noun Phrase Usage
Noun phrases were extracted from the text and analyzed for frequency, word count, unique word count and richness.
Top-level noun phrases are those without a parent noun phrase (a parent phrase is one that a similar, longer phrase). Derived noun phrases are those with a parent (more details about noun phrase analysis).
The top-level noun phrases can be interpreted as independent concepts. Derived noun phrases can be interpreted as variants on concepts embodied by the top-level phrases.
Noun Phrase Count
This table reports the absolute number of noun phrases, which is related to the number of total words (specifically, nouns) delivered. The next table presents the number of phrases relative to the number of nouns.
Table 8. Number of noun phrases.
Table 8 Analysis
Obama delivered +3.8% more noun phrases than McCain. He had +6.5% more top-level noun phrases and +2.1% more derived noun phrases. The increase of top-level noun phrases is greater than the increase of derived noun phrases, suggesting greater variation in concept usage.
Table 8c Legend
Noun Phrase Richness
The previous table presented the total number of noun phrases, which can be equated to individual concepts. In this table, this value is shown relative to the number of nouns used. The interpretation of this ratio is that of richness. In other words, how many noun phrases were constructed, per noun.
Table 9. Number of noun phrases relative to the number of nouns.
Table 9 Analysis
Number of noun phrases relative to the number of nouns remains relatively constant.
Table 9c Legend
Noun Phrase Frequency and Size
Table 10. Noun phrase frequency, word count and unique word count.
Table 10 Analysis
Noun phrase frequency and size remains relatively constant.
Table 10c Legend
Windbag Index
The Windbag Index is a compound measure that characterizes the complexity of speech. A low index is indicative of succinct speech with low degree of repetition and large number of independent concepts.
Table 11. Windbag Index for each speaker. The higher the value, the greater the degree of repetition in the speech.
Table 11 Analysis
This index is not particularly well suited for a combined analysis, because it is expected that the candidates repeat themselves across three debates. The same points will be brought up, the same questions asked, and so on. Naturally, the more words are said the more words are repeated, since the pool of unique words is fixed.
The Windbag Index is +15.6% greater for Obama. Although he does better for verbs, and 2/3 of the noun phrase metrics, his uniqueness scores in other categories are lower.
Table 11c Legend
Tag Clouds
In the tag clouds below, the size of the word is proportional to
the number of times it was used by a candidate (tag cloud details).
Not all words from a group used to draw the cloud fit in the
image. Specifically, less frequently used words for large word groups
fall outside the image.
Debate Tag Clouds for Each Candidate — All Words
Each candidate's debate portion was extracted and frequencies were
compiled for each part of speech (noun, verb, adjective, adverb), with
words colored by their part of speech category. The words in these
tag clouds include words unique to one candidate as well as words used by
both candidates. For other tag clouds below, only words unique to a
candidate are used.
Keep in mind that the word sizes between tag clouds cannot be
directly compared, since the minimum and maximum size of the words in
each tag cloud is the same. However, the distribution of sizes within
a tag cloud reflects the frequency distribution of words (tag cloud details).
Debate Tag Cloud for Barack Obama — all words
Debate Tag Cloud for John McCain — all words
Debate Tag Cloud Analysis
Across all the debates, Obama maintains "important" as his most important (ha ha) word. Note "energy", "health", "economic", "care", "tax" and "people" are central concepts.
In stark contrast, McCain truly feels that "nuclear" is an important topic and as relatively important as "Obama".
Debate Tag Clouds for Each Candidate — Unique Words
The tag clouds below show only used exlusively by a candidate. For
example, if candidate A used the word "invest" (any number of times),
but the other candidate B did not, then the word will appear in the
unique word tag cloud for candidate A.
Debate Tag Cloud for Barack Obama — words unique to Barack Obama
Debate Tag Cloud for John McCain — words unique to John McCain
Unique Word Tag Cloud Analysis
The unique word clouds are particularly informative in a combined debate analysis. The more words said, the fewer words are attributed to only one candidate and these gain importance with increased number of debate samples. Remember, these are words spoken by one candidate, but not the other, across all debates.
Obama's unique words have a large noun component, with words such as "notion", "fundamentals", "consequence", and "wages". His most prominent unique word was the verb "agree", which McCain did not use (note: there is no stemming done in the analysis — McCain did use "agreed"). Obama's use of "potentially" suggests openness to complications and the unforeseen.
McCain's unique words on the other hand focus nearly exclusively on verbs. He uses strong action words such as "opposes" and "legitimize" which suggest a confrontational and unilateral view. His top unique adverb was "badly", which suggests an attack stance (presumably the word is used in context of his opponent).
Part of Speech Tag Clouds
In these tag clouds, words by both candidates were categorized on the
basis of exclusivity to a candidate. Words unique to each candidate
are drawn with a different color. Words used by both candidates are
shown in grey.
The size of the word is relative to the frequency for the candidate
— word sizes between candidates should not be used to indicate
difference in absolute frequency.
Words were further cateogorized by part of speech (noun, verb,
adjective, adverb) and individual tag clouds were prepared for each
category.
The last tag cloud in this section, which uses all (noun + verb +
adjective + adverb) parts of speech.
Tag Cloud of noun words, by speaker
Noun Tag Cloud Analysis
Do you see many blue words? Those are nouns exclusive to McCain and there is is hardly a blue word in sight. It is shocking how overwhelming Obama's delivery drowns out McCain's contribution in the realm of nouns across all the debates.
The third debate saw a cloud like this, but McCain at least managed to get a few words into the cloud.
Tag Cloud of verb words, by speaker
Verb Tag Cloud Analysis
For verbs, McCain's contribution was overwhelming — a situation opposite to that of nouns. Take a look, however, at what Obama brings to the cloud: words like "agree", "invest", "recognize", "focused" and "thinking". Obama's contribution is that of conciliation and careful consideration.
Tag Cloud of adjective words, by speaker
Adjective Tag Cloud Analysis
Split in adjective contribution is more even between the debaters. McCain's curious repetition of "angry", "excess" and "afraid" contrasts Obama's central use of "enormous" as well as "strategic", "easy" and "local".
Tag Cloud of adverb words, by speaker
Adverb Tag Cloud Analysis
McCain, though delivering fewer adverbs than Obama, repeats them quite a bit. Here, his relative usage contribution outweights Obama's. Contrast McCain's "badly" to Obama's "potentially". McCain comes across as a hard-liner whereas Obama comes across as moderate.
Tag Cloud of all words, by speaker
All Tag Cloud Analysis
When all parts of speech are compared, Obama is easily the greater verbal force. McCain's contribution is absolutely swamped out by Obama's unique words.
Word Pair Vignette Tag Clouds for Each Candidate
Tag Cloud of word pairs by Barack Obama
▲
adjective/adjective by Barack Obama
▲
adjective/adverb by Barack Obama
▲
adjective/noun by Barack Obama
▲
adjective/verb by Barack Obama
▲
adverb/adverb by Barack Obama
▲
adverb/noun by Barack Obama
▲
adverb/verb by Barack Obama
▲
noun/noun by Barack Obama
▲
noun/verb by Barack Obama
▲
verb/verb by Barack Obama
Word Pair Tag Cloud Analysis for Barack Obama.
An interesting adjective/adverb pairing frequent for Obama is "military never", as well as "correct quickly". Cross all debates, the top pairings suggest focus on "care health" (large noun/noun component), and "think understand" (large verb/verb component).
Tag Cloud of word pairs by John McCain
▲
adjective/adjective by John McCain
▲
adjective/adverb by John McCain
▲
adjective/noun by John McCain
▲
adjective/verb by John McCain
▲
adverb/adverb by John McCain
▲
adverb/noun by John McCain
▲
adverb/verb by John McCain
▲
noun/noun by John McCain
▲
noun/verb by John McCain
▲
verb/verb by John McCain
Word Pair Tag Cloud Analysis for John McCain.
McCain's repetition of "nuclear power" and "national security" drowns out any mention of economy or domestic policy. His largest verb/verb pairing is "america united" (compare this to "think understand" for Obama), and a large component to adverb/verb is "completely control". McCain's stance is one of nationalism and certainty.
Downloads
debate transcript (courtesy of CNN).
parsed word lists (analyzed transcript, including words by speaker, by POS, and all POS pairings).
tag cloud images
data structure
Please see the methods section for details about these files.