Can an actuary / mathematician / data analyst say anything objective and data-oriented about the 2024 US presidential campaign?
Yes, if I confine my remarks to a numerical text analysis of the candidates' words, rather than attempt to comment on the candidates' political and economic views. This analysis is based solely on data that I collected from the two convention speeches.
I performed a sentiment analysis of the candidates' emotional words. Here is a graphical summary, discussion to follow:
Sentiment analysis
Most frequent words
Summary statistics
This is intended as an objective analysis. I am not trying to make either candidate look good, or bad. I have used these same text analysis techniques in other projects such as analyzing Hamlet, analying short stories, and analyzing the Twitter tweets of a radio talk show host.
For each of Trump and Harris, I started with a transcript of their convention sppeches. I believe the transcripts are their spoken transcripts, not their written transcripts, based on their very first few sentences. I used various computer packages in R such as tm and tidytext to tokenize the documents into individual sentences, words, and characters. I was guided by the works of Silge and Robinson in Text Mining with R and Williams in Data Science Desktop Survival Guide.
A summary and a comparison of the of the tokenization of the speeches is the following, repeated from before.
Sentiment analysis ia analyzing text to determine its emotional tone. Linguistics experts have built dictionaries that associate a large list of words with eight emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) and also negative and positive sentiments. For example, the word "ache" is associated with sadness. Some words have more than one emotion; for example "admirable" is associated with both joy and trust. Further, some words have either negative or positive associations.
There are some limitations in sentiment analysis. The sentiment dictionary does not recognize sarcasm, and I am limiting my analysis to single words so I am not picking up negation (such as "not expensive") or other instances where the emotion requires several words. A conclusion from the sentiment distribution graph is that the candidates are surprisingly similar in most of these emotions. The biggest differences are that Trump has a greater portion of his words categorized as anger and negative than Harris has.
Most frequent positive and megative words
Trump has a larger percentage of negative words (negative divided by positive plus negative) than Harris (43% to 37%). These positive and negative lists seem consistent with my memory of their speeches.
Most frequent words
Distribution of word sizes
Final thoughts
It is hard to be indifferent about the 2024 US presidential election. You have your opinion, and I have mine.
Much of what the candidates say is their opinion, or their plan if elected, and these things are not things we can
verify.
Some things the candidates say as if they are facts, are stated in a way that is open to interpretation. A good example is that "You are better (or worse) off financially today than four years ago." I can choose one measure, collect some data, and show I am better off; or I can choose a very different measure, collect some data, and show I am worse off.
Some things the candidates say as facts ARE verifiable. I am in no position to do such verifying, but a number of third parties do this. Here are a few links. I can not vouch for their reliability or bias.
The following is my R code:
# Trump convention speech, July 19, 2024
# Harris convention speech, August 23, 2024
library(tidytext)
speaker <- readline(prompt = "Enter Trump, or enter Harris: ")
docs <- Corpus(VectorSource(text_df$text))
custom_colors <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
common_theme <- theme(
docs_df <- data.frame(text = sapply(docs, as.character)) # Convert Corpus to data.frame
wordcountdist <- wordcountfile %>% count(numbchars)
syls <- nsyllable(wordfile$word, language = "en")
# Flesch-Kincaid reading ease formula
# Function to find the grade based on score; vlookup
score_to_find <- flesch
# delete stop words
wordfreq <- wordfile
unique_words <- nrow(wordfreq)
graphtitle <- paste(speaker, "Word Frequency")
# sentiments; note mother is both positive and negative!
sentiment_colors <- c(
title <- paste(speaker, "- Sentiment Plot")
df4 <- df3 %>% filter(sentiment == "positive" | sentiment == "negative")
title <- paste(speaker, "- Most Frequent Positive and Negative Words")
if (speaker == "Trump"){
# the results of stemming and lemmatizing were not used in the report
# lemmatize
# if word is not in dictionary, then leave word as is; otherwise, use stemmed word.
# End
# https://www.nytimes.com/2024/07/19/us/politics/trump-rnc-speech-transcript.html
# https://singjupost.com/full-transcript-kamala-harriss-2024-dnc-speech/?singlepage=1
library(tm)
library(dplyr)
library(nsyllable)
library(SnowballC)
library(ggplot2)
library(forcats)
library(ggpubr)
if (speaker == "Trump" | speaker == "Harris") print(speaker) else print("Invalid input")
trump_file <- "C:/Users/Jerry/Desktop/Harris_Trump/trump_convention_speech.txt"
harris_file <- "C:/Users/Jerry/Desktop/Harris_Trump/harris_convention_speech.txt"
textfile <- ifelse(speaker=="Trump", trump_file, harris_file)
textfile <- readLines(textfile)
text_df <- data.frame(line = 1:length(textfile), text=textfile)
names(text_df)[2] <- "text"
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation, ucp=TRUE)
docs <- tm_map(docs, stripWhitespace)
inspect(docs[1:8])
"#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf",
"#6a3d9a", "#ff9e1b", "#f6c6c7", "#8dd3c7", "#ffffb3",
"#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3e2cd",
"#ccebc5")
legend.position="NULL",
plot.title = element_text(size=15, face="bold"),
plot.subtitle = element_text(size=12.5, face="bold"),
axis.title = element_text(size=15, face="bold"),
axis.text = element_text(size=15, face="bold"),
legend.title = element_text(size=15, face="bold"),
legend.text = element_text(size=15, face="bold"))
wordfile <- unnest_tokens(docs_df, word, text, token = "words")
wordfile %>% count(word, sort=TRUE)
wordcountfile <- mutate(wordfile, numbchars = nchar(word)) # characters per word
long1 <- wordcountfile[which(wordcountfile$numbchars == max(wordcountfile$numbchars)),1] # longest word
long2 <- wordcountfile[which(wordcountfile$numbchars == max(wordcountfile$numbchars)),2]
numberchars <- sum(wordcountfile$numbchars)
numberwords <- sum(count(wordcountfile, word, sort=TRUE)$n) # no. words
avgcharperword <- round(numberchars / numberwords, digits=2)
sentencefile <- unnest_tokens(text_df, sentence, text, token = "sentences")
sentencecount <- sum(count(sentencefile, sentence, sort=TRUE)$n)
avgwordpersent <- round(numberwords/sentencecount,2)
wordcountdist$numbchars <- as.factor(wordcountdist$numbchars)
title <- paste(speaker, "- Distribution of Word Size")
subtitle <- paste("Longest word: ", long1, long2, "characters")
ggplot(wordcountdist, aes(numbchars, n, fill=numbchars)) +
geom_bar(stat="identity", position = "dodge", width=0.5) +
labs(title=title, subtitle=subtitle) +
xlab("number of characters per word") + ylab("") +
scale_fill_manual(values = custom_colors) +
theme(legend.position = "none") +
common_theme
syls[which(wordfile$word == "jd")] <- 2 # used because nsyllable generated error here
syls[which(wordfile$word == "nd")] <- 1 # used because nsyllable generated error here
syls[which(wordfile$word == "st")] <- 3 # s/b 21st; used because nsyllable generated error here
syls[which(wordfile$word == "gasolinepowered")] <- 5 # used because nsyllable erred here
long2 <- min(syls[(syls == max(syls, na.rm=TRUE))], na.rm=TRUE)
w <- min(which(syls == long2))
long1 <- wordfile$word[w]
avgsylperword <- round(sum(syls)/numberwords, digits = 2)
avgsylperword
syls <- data.frame(syls) %>% count(syls)
syls$syls <- as.factor(syls$syls)
colnames(syls) <- c("syllables", "n")
title <- paste(speaker, "- Distribution of No. Syllables per Word")
subtitle <- paste("Most syllables: ", long1, long2, "syllables")
ggplot(syls, aes(syllables, n, fill = syllables)) +
geom_bar(stat="identity", position = "dodge", width=0.5) +
labs(title=title, subtitle=subtitle) +
xlab("number of syllables per word") + ylab("") +
scale_fill_manual(values = custom_colors) +
theme(legend.position = "none") +
common_theme
flesch <- round(206.835 - 1.015*(numberwords/sentencecount) - 84.6*(sum(syls$n)/numberwords),2) # Flesch reading ease
flesch
flesch_df <- data.frame(score = c(0,30,50,60,70,80,90,100),
grade = c("College graduate","College","10th - 12th grade","8th - 9th grade",
"7th grade","6th grade","5th grade","below 5th grade"))
find_grade <- function(score, flesch_df) {
idx <- findInterval(score, flesch_df$score)
if (idx == 0) {
return("below 5th grade") # Handle case where score is below the minimum
} else {
return(flesch_df$grade[idx])
}
}
flesch_grade <- find_grade(score_to_find, flesch_df)
flesch_grade
docs_df <- data.frame(text = sapply(docs, as.character)) # Convert Corpus to data.frame
wordfile <- unnest_tokens(docs_df, word, text, token = "words")
stop_words <- data.frame(tidytext::stop_words) # more words than tm
my_stop_words <- data.frame(word = c("theyre", "hes", "dont",
"didnt","youre","cant", "im","whats", "weve", "theyve", "youve",
"couldnt", "wont", "youd"))
wordfile <- anti_join(wordfile, stop_words)
wordfile <- anti_join(wordfile, my_stop_words)
wordfreq <- count(wordfreq, word, sort=TRUE) # word frequency excl stop words
wordfreqdf <- data.frame(wordfreq)
portion_unique_words <- round(unique_words / numberwords, digits=2)
wordfreqdf20 <- wordfreqdf[1:21,] # Think about threshold
wordfreqdf20
wordfreqdf20$word <- fct_reorder(wordfreqdf20$word, wordfreqdf20$n, .desc = FALSE)
ggplot(data=wordfreqdf20, aes(x=word, y=n, fill=word)) +
geom_bar(stat="identity", position = "dodge", width=0.5) +
coord_flip() +
common_theme +
xlab("") + ylab("Frequency") +
ggtitle(graphtitle) +
scale_fill_manual(values = custom_colors) +
theme(legend.position = "none")
df1 <- data.frame(wordfile)
colnames(df1) <- "word"
df2 <- get_sentiments("nrc")
df3 <- merge(x=df1, y=df2, by="word", all.x=TRUE, stringsAsFactors=FALSE)
df3 <- subset(df3, !is.na(sentiment))
table(df3$sentiment)
w <- data.frame(table(df3$sentiment))
colnames(w) <- c("sentiment", "n")
"Anger" = "red",
"Anticipation" = "green",
"Disgust" = "brown",
"Fear" = "purple",
"Joy" = "yellow",
"Negative" = "gray",
"Positive" = "lightblue",
"Sadness" = "blue",
"Surprise" = "pink",
"Trust" = "turquoise")
ggplot(w, aes(sentiment, n)) +
geom_bar(stat = "identity", position = "dodge", width = 0.5, fill = sentiment_colors) +
ggtitle(title) +
ylab("") +
common_theme +
theme(axis.text.x = element_text(angle = 45, hjust=1))
w <- with(df4, table(sentiment))
neg <- w[1]
pos <- w[2]
neg_ratio <- round(w[1] / (w[1] + w[2]), digits=2)
df5 <- df4 %>% group_by(sentiment) %>% count(word, sort=TRUE)
pos_freq <- df5 %>% filter(sentiment=="positive") %>% top_n(10, wt = n) %>% slice_head(n = 10)
neg_freq <- df5 %>% filter(sentiment=="negative") %>% top_n(10, wt = n) %>% slice_head(n = 10) # ties
pos_freq$word <- fct_reorder(pos_freq$word, pos_freq$n, .desc = FALSE)
neg_freq$word <- fct_reorder(neg_freq$word, neg_freq$n, .desc = FALSE)
p1 <- ggplot(pos_freq, aes(word, n)) +
geom_bar(stat="identity", position = "dodge", width=0.5, fill="darkgreen") +
ggtitle("Positves") +
common_theme +
xlab("") +
coord_flip()
p2 <- ggplot(neg_freq, aes(word, n)) +
geom_bar(stat="identity", position = "dodge", width=0.5, fill="red") +
ggtitle("Negatives") +
common_theme +
xlab("") +
coord_flip()
plot <- ggarrange(p1,p2, ncol=2, nrow=1, legend=NULL)
annotate_figure(plot, top = text_grob(title,
color = "black", face = "bold", size = 14))
t <- data_frame(speaker, numberwords, avgwordpersent, avgcharperword, avgsylperword, flesch, flesch_grade, portion_unique_words, neg_ratio)
print(t)
} else {h <- data_frame(speaker, numberwords, avgwordpersent, avgcharperword, avgsylperword, flesch, flesch_grade, portion_unique_words, neg_ratio)
conclusion <- data.frame(rbind(t,h))
conclusion <- t(conclusion)
colnames(conclusion) <- c("Trump", "Harris")
conclusion <- conclusion[-1,]
print(conclusion)
}
# stemming
wordfile <- wordfile %>%
mutate(stem = wordStem(word)) %>%
count(stem, sort = TRUE)
df1 <- wordfile # df1 has col named stem
url <- "https://raw.githubusercontent.com/michmech/lemmatization-lists/master/lemmatization-en.txt"
df2 <- read.table(url, header = FALSE, sep = "\t", quote = "", stringsAsFactors = FALSE)
names(df2) <- c("stem", "word")
df3 <- merge(x = df1, y = df2, by = "stem", all.x = TRUE, stringsAsFactors=FALSE)
df3$word <- ifelse(is.na(df3$word), df3$stem, df3$stem)
How do I count thee? Let me count the ways?
Sheldon Cooper's favorite number
If you are a fan of the television series "The Big Bang Theory", then you know Sheldon often wears a shirt with 73 ...
Sunday, August 25, 2024
Text analysis of 2024 US Presidential convention speeches
Sunday, August 11, 2024
Is the Mona Lisa thinking about irrational numbers?
As a math teacher I sometimes share the following problem-solving strategy: If you are really stuck on a problem, let it sit, come back the next day, and maybe you will have a fresh insight. This often works for me.
58 years later ...
I recently applied this strategy with a problem posed by my beloved Bryant High School math teacher Anthony Holman. However, he posed it in 1966, so it took me 58 years before I had any insight. Unfortunately, Mr. Holman passed away in 1985, so I never had the opportunity to discuss this with him.
In calculus class Mr. Holman taught us about e, sometimes called Euler’s number, which is the base of the natural log function. He told us e, like π , would appear in many unlikely places in math, and of course he was right. Then he hypothesized that e and π and other famous irrational numbers were probably part of the great works of art like the Mona Lisa.
That was the only time I remember him making that statement. I had no idea how to approach it. This was 1966, long before computers, even mainframe computers, were readily available to the public. I forgot about his statement for about 55 years.
More recently I have been programming in R, and I have learned a little about computer images. A computer image can be represented as an array of many small elements called pixels, and every pixel is a base 16 number of the hex code of that pixel's color. I decided to interpret Mr. Holman’s hypothesis as whether a finite sequence of the digits of a number like π were contained within the pixel hex code of the Mona Lisa. I decided to separately test four famous irrational numbers: π, e, φ, and √2.
A quick note about φ: This is the Golden Ratio phi, defined as the ratio of a/b such that φ = a/b = (a + b)/a. The Golden Ratio is well-known to appear in numerous artworks including the Mona Lisa, where various measures of the subject’s face appear in Golden ratio proportions. I am assuming this is not what Mr. Holman intended in his hypothesis.
I am making a few assumptions here. An irrational number is an infinite, non-repeating decimal. I can not test whether an infinite sequence of digits appears within some other sequence of numbers. So I am truncating these irrational numbers to ten decimal places each, which I think is sufficient for this exercise. The tenth decimal place is truncated, not rounded. Of course the test may pass at ten decimal places and then fail at the eleventh.
Another assumption is that I have a sufficiently high-quality image of the original Mona Lisa. The painting was completed in approximately 1517. It has aged and there has been some physical restoration. A computer image is the result of a photograph, and these photographs also contain some digital retouching. So the result may not be equivalent to the original. Finally, there is a variety of available resolutions including one that has a size of 90 MB. I do not have sufficient computer memory to handle that file size and the calculations of a file that large.
I am also assuming it is sufficient to do this problem with just the Mona Lisa.
In my first attempt, I was able to get the base 16 number of the hex code of every pixel of my Mona Lisa file. The file looks like this for the first six pixels. I show each pixel's base 16 hex code and its base 10 decimal equivalent.
I converted each of the four irrational numbers from base 10 to base 16. Each irrational number is nine characters in base 16. I created one very long string of the pixel hex codes, and I searched whether each irrational number was contained in the hex string. The string has 1.7 million characters, and I am searching for a nine character sequence. However, the result was none of the irrationals was contained in the string.I decided the first attempt was faulty. The Mona Lisa was created in a base 10 world, and it didn’t make sense to force the irrationals into an artificial base 16 comparison.
In my second attempt, I converted each base 16 pixel to base 10, to compare against base 10 irrationals. I think this is a more natural comparison, and if an irrational is going to be contained in a larger sequence then it would be as a base ten irrational.
Unfortunately this also failed for each of the four irrationals. So sadly, I can not confirm Mr. Holman’s hypothesis. Maybe I should let this problem sit for a few more years and come back when I have a fresh idea (or when I buy a computer with more memory). Maybe Mr. Holman has solved it in Heaven. Nevertheless, I’m sure he is smiling at my efforts.
The following is my R code:
# Mona Lisa problem matching base 10 codes
library(imager)
# url1 is huge: 89.9 MB; url2 is 70 KB.
# use url2 unless you have a lot of memory.
url1 <- "https://upload.wikimedia.org/wikipedia/commons/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg"
url2 <- "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg/402px-Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg"
img <- load.image(url2)
plot(img)
img_data <- as.data.frame(img)
head(img_data) # x, y, cc(color channel: r, g, b, transparent), intensity (normalized 0,1)
dim(img_data) # 722,394 x 4
library(dplyr)
img_data <- arrange(img_data, x, y, cc)
# Pivot the data to wide format so each pixel has its R, G, B values in separate columns
library(tidyr)
img_wide <- pivot_wider(img_data, names_from = cc, values_from = value, values_fill = list(value = 0)) # # If there are missing values, fill them with 0
dim(img_wide) # 240,798 x 5
# Convert normalized values to 0-255 range
img_wide$R <- img_wide$`1` * 255
img_wide$G <- img_wide$`2` * 255
img_wide$B <- img_wide$`3` * 255
# Convert RGB values to hexadecimal format
img_wide$hex <- rgb(img_wide$R, img_wide$G, img_wide$B, maxColorValue = 255)
img_wide$hex <- gsub('#', '', img_wide$hex)
# Convert hex values to decimal format
library(Rmpfr) # extended floating point computations
options(scipen = 999)
img_wide$dec <- as.numeric(mpfr(img_wide$hex, base=16))
# Check the structure of img_wide
head(img_wide)
# Concatenate dec values into a single string
dec_string <- paste(img_wide$dec, collapse = "")
nchar(dec_string) # 1701600
test_value1 <- "04111665813" # starts in position 2
pi_base10 <- "31415926535" # https://www.angio.net/pi/digits/50.txt 10 decimal places, truncated not rounded
e_base10 <- "27182818284" # https://www.i4cy.com/euler/
phi_base10 <- "16180339887" # https://nerdparadise.com/math/reference/phi10000
root_two_base10 <- "14142135623" # https://nerdparadise.com/math/reference/2sqrt10000
# Check if the dec string contains the digits of pi, e, phi, root 2
check_irrational <- function(value, name) {
if (grepl(value, dec_string)) {
cat("The first ten digits of", name, "are present in the pixel data.\n")
start_position <- regexpr(value, dec_string)
cat("The match starts at position:", start_position, "\n")
} else {
cat("The first ten digits of", name, "are not present in the pixel data.")
}
}
check_irrational(test_value1, "test1")
check_irrational(pi_base10, "π")
check_irrational(e_base10, "e")
check_irrational(phi_base10, "φ")
check_irrational(root_two_base10, "√2")
options(scipen = 0)
# END
#######################################