Online College Math Teacher

Sunday, August 25, 2024

Text analysis of 2024 US Presidential convention speeches

Text analysis of 2024 US Presidential convention speeches, by Jerry Tuttle

Can an actuary / mathematician / data analyst say anything objective and data-oriented about the 2024 US presidential campaign?

Yes, if I confine my remarks to a numerical text analysis of the candidates' words, rather than attempt to comment on the candidates' political and economic views. This analysis is based solely on data that I collected from the two convention speeches.

I performed a sentiment analysis of the candidates' emotional words. Here is a graphical summary, discussion to follow:

Sentiment analysis

I also calculated the candidates' most frequent words, excluding stop words like "a", "and", "the" and so on that don't carry much useful information:

Most frequent words

I calculated statistics of things like average sentence length and average word length:

Summary statistics

This blog is my attempt at some objective and data-oriented analysis of the two presidential candidates from their convention sppeches. This is a text analysis of their speeches. Text analysis is a branch of data analysis that takes unstructured text documents, breaks them into sentences and words, and attempts to find some meaning. Text analysis is used by marketers to evaluate customer opinions, by police departments to monitor potential or actual criminal activites, by researchers to evaluate things like whether Shakespeare really wrote all the plays attributed to him and which Beatles songs were mostly written by John versus which by Paul, and by many others.

This is intended as an objective analysis. I am not trying to make either candidate look good, or bad. I have used these same text analysis techniques in other projects such as analyzing Hamlet, analying short stories, and analyzing the Twitter tweets of a radio talk show host.

For each of Trump and Harris, I started with a transcript of their convention sppeches. I believe the transcripts are their spoken transcripts, not their written transcripts, based on their very first few sentences. I used various computer packages in R such as tm and tidytext to tokenize the documents into individual sentences, words, and characters. I was guided by the works of Silge and Robinson in Text Mining with R and Williams in Data Science Desktop Survival Guide.

A summary and a comparison of the of the tokenization of the speeches is the following, repeated from before.

Obviously Trump spoke for much longer than Harris, so he had many more total words. Harris had a larger average sentence length (18.06 words per sentence versus 10.41), but the two were close in average characters per word and average number of syllables per word. Their numerical Flesch scores were about equal. Flesch score is a measure of the grade level required to understand the text, and we would expect a low grade level in a political speech to the public, in contrast to say a student college paper. Harris had a larger percentage of unique or different words in her speech than Trump (22% to 12%). Trump had a larger percentage of negative words than Harris (43% to 37%), which I will discuss below.

Sentiment analysis ia analyzing text to determine its emotional tone. Linguistics experts have built dictionaries that associate a large list of words with eight emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) and also negative and positive sentiments. For example, the word "ache" is associated with sadness. Some words have more than one emotion; for example "admirable" is associated with both joy and trust. Further, some words have either negative or positive associations.

There are some limitations in sentiment analysis. The sentiment dictionary does not recognize sarcasm, and I am limiting my analysis to single words so I am not picking up negation (such as "not expensive") or other instances where the emotion requires several words. A conclusion from the sentiment distribution graph is that the candidates are surprisingly similar in most of these emotions. The biggest differences are that Trump has a greater portion of his words categorized as anger and negative than Harris has.

Most frequent positive and megative words

Trump's frequent positive words include "love", adjectives "beautiful" and "wonderful", and also "money". His negative words include "tax", "invasion", "inflation", "war", "illegal", and "crime".

Harris's frequent positive words include "freedom", "love", "opportunity", and "forward". Her negative words include "tax", "enforcement", "abuse", and "abortion". Interestingly "mother" is both a positive word and a negative word, as is "vote". (I rechecked the sentiment dictionary to confirm these.)

Trump has a larger percentage of negative words (negative divided by positive plus negative) than Harris (43% to 37%). These positive and negative lists seem consistent with my memory of their speeches.

Most frequent words

Word frequency provides some clue on what is important to each candidate. Not surprisingly, both candidates frequently used words like "America", "Americans", "country", "nation", and "border". Trump used adjectives like "beautiful" and "incredible". Harris spent considerable time talking about Trump and also about her mother. Harris frequently used "middle", "class", "women", and "law".

Distribution of word sizes

Distribution of syllables per word

The distributions of size of words and of syllables per word are pretty similar for both candidtaes.

Final thoughts

It is hard to be indifferent about the 2024 US presidential election. You have your opinion, and I have mine. Much of what the candidates say is their opinion, or their plan if elected, and these things are not things we can verify.

Some things the candidates say as if they are facts, are stated in a way that is open to interpretation. A good example is that "You are better (or worse) off financially today than four years ago." I can choose one measure, collect some data, and show I am better off; or I can choose a very different measure, collect some data, and show I am worse off.

Some things the candidates say as facts ARE verifiable. I am in no position to do such verifying, but a number of third parties do this. Here are a few links. I can not vouch for their reliability or bias.

Text analysis is a popular data analysis technique, and I hope I have added some insight into the candidates in an objective manner.

The following is my R code:

# Trump convention speech, July 19, 2024
# https://www.nytimes.com/2024/07/19/us/politics/trump-rnc-speech-transcript.html

# Harris convention speech, August 23, 2024
# https://singjupost.com/full-transcript-kamala-harriss-2024-dnc-speech/?singlepage=1

library(tidytext)
library(tm)
library(dplyr)
library(nsyllable)
library(SnowballC)
library(ggplot2)
library(forcats)
library(ggpubr)

speaker <- readline(prompt = "Enter Trump, or enter Harris: ")
if (speaker == "Trump" | speaker == "Harris") print(speaker) else print("Invalid input")
trump_file <- "C:/Users/Jerry/Desktop/Harris_Trump/trump_convention_speech.txt"
harris_file <- "C:/Users/Jerry/Desktop/Harris_Trump/harris_convention_speech.txt"
textfile <- ifelse(speaker=="Trump", trump_file, harris_file)
textfile <- readLines(textfile)
text_df <- data.frame(line = 1:length(textfile), text=textfile)
names(text_df)[2] <- "text"

docs <- Corpus(VectorSource(text_df$text))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation, ucp=TRUE)
docs <- tm_map(docs, stripWhitespace)
inspect(docs[1:8])

custom_colors <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
      "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf",
      "#6a3d9a", "#ff9e1b", "#f6c6c7", "#8dd3c7", "#ffffb3",
      "#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3e2cd",
      "#ccebc5")

common_theme <- theme(
      legend.position="NULL",
      plot.title = element_text(size=15, face="bold"),
      plot.subtitle = element_text(size=12.5, face="bold"),
      axis.title = element_text(size=15, face="bold"),
      axis.text = element_text(size=15, face="bold"),
      legend.title = element_text(size=15, face="bold"),
      legend.text = element_text(size=15, face="bold"))

docs_df <- data.frame(text = sapply(docs, as.character)) # Convert Corpus to data.frame
wordfile <- unnest_tokens(docs_df, word, text, token = "words")
wordfile %>% count(word, sort=TRUE)
wordcountfile <- mutate(wordfile, numbchars = nchar(word)) # characters per word
long1 <- wordcountfile[which(wordcountfile$numbchars == max(wordcountfile$numbchars)),1] # longest word long2 <- wordcountfile[which(wordcountfile$numbchars == max(wordcountfile$numbchars)),2]
numberchars <- sum(wordcountfile$numbchars)
numberwords <- sum(count(wordcountfile, word, sort=TRUE)$n) # no. words
avgcharperword <- round(numberchars / numberwords, digits=2)
sentencefile <- unnest_tokens(text_df, sentence, text, token = "sentences")
sentencecount <- sum(count(sentencefile, sentence, sort=TRUE)$n)
avgwordpersent <- round(numberwords/sentencecount,2)

wordcountdist <- wordcountfile %>% count(numbchars)
wordcountdist$numbchars <- as.factor(wordcountdist$numbchars)
title <- paste(speaker, "- Distribution of Word Size")
subtitle <- paste("Longest word: ", long1, long2, "characters")
ggplot(wordcountdist, aes(numbchars, n, fill=numbchars)) +
      geom_bar(stat="identity", position = "dodge", width=0.5) +
      labs(title=title, subtitle=subtitle) +
      xlab("number of characters per word") + ylab("") +
      scale_fill_manual(values = custom_colors) +
      theme(legend.position = "none") +
      common_theme

syls <- nsyllable(wordfile$word, language = "en")
syls[which(wordfile$word == "jd")] <- 2 # used because nsyllable generated error here
syls[which(wordfile$word == "nd")] <- 1 # used because nsyllable generated error here
syls[which(wordfile$word == "st")] <- 3 # s/b 21st; used because nsyllable generated error here
syls[which(wordfile$word == "gasolinepowered")] <- 5 # used because nsyllable erred here
long2 <- min(syls[(syls == max(syls, na.rm=TRUE))], na.rm=TRUE)
w <- min(which(syls == long2))
long1 <- wordfile$word[w]
avgsylperword <- round(sum(syls)/numberwords, digits = 2)
avgsylperword
syls <- data.frame(syls) %>% count(syls)
syls$syls <- as.factor(syls$syls)
colnames(syls) <- c("syllables", "n")
title <- paste(speaker, "- Distribution of No. Syllables per Word")
subtitle <- paste("Most syllables: ", long1, long2, "syllables")
ggplot(syls, aes(syllables, n, fill = syllables)) +
      geom_bar(stat="identity", position = "dodge", width=0.5) +
      labs(title=title, subtitle=subtitle) +
      xlab("number of syllables per word") + ylab("") +
      scale_fill_manual(values = custom_colors) +
      theme(legend.position = "none") +
      common_theme

# Flesch-Kincaid reading ease formula
flesch <- round(206.835 - 1.015*(numberwords/sentencecount) - 84.6*(sum(syls$n)/numberwords),2) # Flesch reading ease
flesch
flesch_df <- data.frame(score = c(0,30,50,60,70,80,90,100),
grade = c("College graduate","College","10th - 12th grade","8th - 9th grade",
"7th grade","6th grade","5th grade","below 5th grade"))

# Function to find the grade based on score; vlookup
find_grade <- function(score, flesch_df) {
idx <- findInterval(score, flesch_df$score)
if (idx == 0) {
return("below 5th grade") # Handle case where score is below the minimum
} else {
return(flesch_df$grade[idx])
}
}

score_to_find <- flesch
flesch_grade <- find_grade(score_to_find, flesch_df)
flesch_grade

# delete stop words
docs_df <- data.frame(text = sapply(docs, as.character)) # Convert Corpus to data.frame
wordfile <- unnest_tokens(docs_df, word, text, token = "words")
stop_words <- data.frame(tidytext::stop_words) # more words than tm
my_stop_words <- data.frame(word = c("theyre", "hes", "dont",
"didnt","youre","cant", "im","whats", "weve", "theyve", "youve",
"couldnt", "wont", "youd"))
wordfile <- anti_join(wordfile, stop_words)
wordfile <- anti_join(wordfile, my_stop_words)

wordfreq <- wordfile
wordfreq <- count(wordfreq, word, sort=TRUE) # word frequency excl stop words
wordfreqdf <- data.frame(wordfreq)

unique_words <- nrow(wordfreq)
portion_unique_words <- round(unique_words / numberwords, digits=2)
wordfreqdf20 <- wordfreqdf[1:21,] # Think about threshold
wordfreqdf20

graphtitle <- paste(speaker, "Word Frequency")
wordfreqdf20$word <- fct_reorder(wordfreqdf20$word, wordfreqdf20$n, .desc = FALSE)
ggplot(data=wordfreqdf20, aes(x=word, y=n, fill=word)) +
      geom_bar(stat="identity", position = "dodge", width=0.5) +
      coord_flip() +
      common_theme +
      xlab("") + ylab("Frequency") +
      ggtitle(graphtitle) +
      scale_fill_manual(values = custom_colors) +
      theme(legend.position = "none")

# sentiments; note mother is both positive and negative!
df1 <- data.frame(wordfile)
colnames(df1) <- "word"
df2 <- get_sentiments("nrc")
df3 <- merge(x=df1, y=df2, by="word", all.x=TRUE, stringsAsFactors=FALSE)
df3 <- subset(df3, !is.na(sentiment))
table(df3$sentiment)
w <- data.frame(table(df3$sentiment))
colnames(w) <- c("sentiment", "n")

sentiment_colors <- c(
   "Anger" = "red",
   "Anticipation" = "green",
   "Disgust" = "brown",
   "Fear" = "purple",
   "Joy" = "yellow",
   "Negative" = "gray",
   "Positive" = "lightblue",
   "Sadness" = "blue",
   "Surprise" = "pink",
   "Trust" = "turquoise")

title <- paste(speaker, "- Sentiment Plot")
ggplot(w, aes(sentiment, n)) +
      geom_bar(stat = "identity", position = "dodge", width = 0.5, fill = sentiment_colors) +
      ggtitle(title) +
      ylab("") +
      common_theme +
      theme(axis.text.x = element_text(angle = 45, hjust=1))

df4 <- df3 %>% filter(sentiment == "positive" | sentiment == "negative")
w <- with(df4, table(sentiment))
neg <- w[1]
pos <- w[2]
neg_ratio <- round(w[1] / (w[1] + w[2]), digits=2)
df5 <- df4 %>% group_by(sentiment) %>% count(word, sort=TRUE)
pos_freq <- df5 %>% filter(sentiment=="positive") %>% top_n(10, wt = n) %>% slice_head(n = 10)
neg_freq <- df5 %>% filter(sentiment=="negative") %>% top_n(10, wt = n) %>% slice_head(n = 10) # ties
pos_freq$word <- fct_reorder(pos_freq$word, pos_freq$n, .desc = FALSE)
neg_freq$word <- fct_reorder(neg_freq$word, neg_freq$n, .desc = FALSE)

title <- paste(speaker, "- Most Frequent Positive and Negative Words")
p1 <- ggplot(pos_freq, aes(word, n)) +
      geom_bar(stat="identity", position = "dodge", width=0.5, fill="darkgreen") +
      ggtitle("Positves") +
      common_theme +
      xlab("") +
      coord_flip()
p2 <- ggplot(neg_freq, aes(word, n)) +
      geom_bar(stat="identity", position = "dodge", width=0.5, fill="red") +
      ggtitle("Negatives") +
      common_theme +
      xlab("") +
      coord_flip()
plot <- ggarrange(p1,p2, ncol=2, nrow=1, legend=NULL)
annotate_figure(plot, top = text_grob(title,
    color = "black", face = "bold", size = 14))

if (speaker == "Trump"){
   t <- data_frame(speaker, numberwords, avgwordpersent, avgcharperword, avgsylperword, flesch, flesch_grade, portion_unique_words, neg_ratio)
   print(t)
   } else {h <- data_frame(speaker, numberwords, avgwordpersent, avgcharperword, avgsylperword, flesch, flesch_grade, portion_unique_words, neg_ratio)
    conclusion <- data.frame(rbind(t,h))
    conclusion <- t(conclusion)
    colnames(conclusion) <- c("Trump", "Harris")
    conclusion <- conclusion[-1,]
    print(conclusion)
}

# the results of stemming and lemmatizing were not used in the report
# stemming
wordfile <- wordfile %>%
mutate(stem = wordStem(word)) %>%
count(stem, sort = TRUE)

# lemmatize
df1 <- wordfile # df1 has col named stem
url <- "https://raw.githubusercontent.com/michmech/lemmatization-lists/master/lemmatization-en.txt"
df2 <- read.table(url, header = FALSE, sep = "\t", quote = "", stringsAsFactors = FALSE)
names(df2) <- c("stem", "word")
df3 <- merge(x = df1, y = df2, by = "stem", all.x = TRUE, stringsAsFactors=FALSE)

# if word is not in dictionary, then leave word as is; otherwise, use stemmed word.
df3$word <- ifelse(is.na(df3$word), df3$stem, df3$stem)

# End

Sunday, August 11, 2024

Is the Mona Lisa thinking about irrational numbers?

Is the Mona Lisa thinking about irrational numbers? by Jerry Tuttle

As a math teacher I sometimes share the following problem-solving strategy: If you are really stuck on a problem, let it sit, come back the next day, and maybe you will have a fresh insight. This often works for me.

58 years later ...

I recently applied this strategy with a problem posed by my beloved Bryant High School math teacher Anthony Holman. However, he posed it in 1966, so it took me 58 years before I had any insight. Unfortunately, Mr. Holman passed away in 1985, so I never had the opportunity to discuss this with him.

In calculus class Mr. Holman taught us about e, sometimes called Euler’s number, which is the base of the natural log function. He told us e, like π , would appear in many unlikely places in math, and of course he was right. Then he hypothesized that e and π and other famous irrational numbers were probably part of the great works of art like the Mona Lisa.

That was the only time I remember him making that statement. I had no idea how to approach it. This was 1966, long before computers, even mainframe computers, were readily available to the public. I forgot about his statement for about 55 years.

More recently I have been programming in R, and I have learned a little about computer images. A computer image can be represented as an array of many small elements called pixels, and every pixel is a base 16 number of the hex code of that pixel's color. I decided to interpret Mr. Holman’s hypothesis as whether a finite sequence of the digits of a number like π were contained within the pixel hex code of the Mona Lisa. I decided to separately test four famous irrational numbers: π, e, φ, and √2.

A quick note about φ: This is the Golden Ratio phi, defined as the ratio of a/b such that φ = a/b = (a + b)/a. The Golden Ratio is well-known to appear in numerous artworks including the Mona Lisa, where various measures of the subject’s face appear in Golden ratio proportions. I am assuming this is not what Mr. Holman intended in his hypothesis.

I am making a few assumptions here. An irrational number is an infinite, non-repeating decimal. I can not test whether an infinite sequence of digits appears within some other sequence of numbers. So I am truncating these irrational numbers to ten decimal places each, which I think is sufficient for this exercise. The tenth decimal place is truncated, not rounded. Of course the test may pass at ten decimal places and then fail at the eleventh.

Another assumption is that I have a sufficiently high-quality image of the original Mona Lisa. The painting was completed in approximately 1517. It has aged and there has been some physical restoration. A computer image is the result of a photograph, and these photographs also contain some digital retouching. So the result may not be equivalent to the original. Finally, there is a variety of available resolutions including one that has a size of 90 MB. I do not have sufficient computer memory to handle that file size and the calculations of a file that large.

I am also assuming it is sufficient to do this problem with just the Mona Lisa.

In my first attempt, I was able to get the base 16 number of the hex code of every pixel of my Mona Lisa file. The file looks like this for the first six pixels. I show each pixel's base 16 hex code and its base 10 decimal equivalent.

I converted each of the four irrational numbers from base 10 to base 16. Each irrational number is nine characters in base 16. I created one very long string of the pixel hex codes, and I searched whether each irrational number was contained in the hex string. The string has 1.7 million characters, and I am searching for a nine character sequence. However, the result was none of the irrationals was contained in the string.

I decided the first attempt was faulty. The Mona Lisa was created in a base 10 world, and it didn’t make sense to force the irrationals into an artificial base 16 comparison.

In my second attempt, I converted each base 16 pixel to base 10, to compare against base 10 irrationals. I think this is a more natural comparison, and if an irrational is going to be contained in a larger sequence then it would be as a base ten irrational.

Unfortunately this also failed for each of the four irrationals. So sadly, I can not confirm Mr. Holman’s hypothesis. Maybe I should let this problem sit for a few more years and come back when I have a fresh idea (or when I buy a computer with more memory). Maybe Mr. Holman has solved it in Heaven. Nevertheless, I’m sure he is smiling at my efforts.

The following is my R code:

# Mona Lisa problem matching base 10 codes
library(imager)
# url1 is huge: 89.9 MB; url2 is 70 KB.
# use url2 unless you have a lot of memory.

url1 <- "https://upload.wikimedia.org/wikipedia/commons/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg"
url2 <- "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg/402px-Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg"
img <- load.image(url2)
plot(img)

img_data <- as.data.frame(img)
head(img_data) # x, y, cc(color channel: r, g, b, transparent), intensity (normalized 0,1)
dim(img_data) # 722,394 x 4
library(dplyr)
img_data <- arrange(img_data, x, y, cc)

# Pivot the data to wide format so each pixel has its R, G, B values in separate columns
library(tidyr)
img_wide <- pivot_wider(img_data, names_from = cc, values_from = value, values_fill = list(value = 0)) # # If there are missing values, fill them with 0
dim(img_wide) # 240,798 x 5

# Convert normalized values to 0-255 range
img_wide$R <- img_wide$`1` * 255
img_wide$G <- img_wide$`2` * 255
img_wide$B <- img_wide$`3` * 255

# Convert RGB values to hexadecimal format
img_wide$hex <- rgb(img_wide$R, img_wide$G, img_wide$B, maxColorValue = 255)
img_wide$hex <- gsub('#', '', img_wide$hex)

# Convert hex values to decimal format
library(Rmpfr) # extended floating point computations
options(scipen = 999)
img_wide$dec <- as.numeric(mpfr(img_wide$hex, base=16))

# Check the structure of img_wide
head(img_wide)
# Concatenate dec values into a single string
dec_string <- paste(img_wide$dec, collapse = "")
nchar(dec_string) # 1701600

test_value1 <- "04111665813" # starts in position 2
pi_base10 <- "31415926535" # https://www.angio.net/pi/digits/50.txt 10 decimal places, truncated not rounded
e_base10 <- "27182818284" # https://www.i4cy.com/euler/
phi_base10 <- "16180339887" # https://nerdparadise.com/math/reference/phi10000
root_two_base10 <- "14142135623" # https://nerdparadise.com/math/reference/2sqrt10000

# Check if the dec string contains the digits of pi, e, phi, root 2

check_irrational <- function(value, name) {
if (grepl(value, dec_string)) {
cat("The first ten digits of", name, "are present in the pixel data.\n")
start_position <- regexpr(value, dec_string)
cat("The match starts at position:", start_position, "\n")
} else {
cat("The first ten digits of", name, "are not present in the pixel data.")
}
}

check_irrational(test_value1, "test1")
check_irrational(pi_base10, "π")
check_irrational(e_base10, "e")
check_irrational(phi_base10, "φ")
check_irrational(root_two_base10, "√2")

options(scipen = 0)

# END

#######################################

Wednesday, July 24, 2024

The distribution has changed; and pretty tables in base R

The distribution has changed; and pretty tables in base R, by Jerry Tuttle

So you spent hours, or maybe days, cranking out thousands of numbers, you submit it to your boss just at the deadline, your boss quickly peruses your exhibit of numbers, points to a single number and says, "This number doesn't look right." Bosses have an uncanny ability to do this.

Your boss is pointing to something like this: Your company sells property insurance on both personal and commercial properties. The average personal property premium increased 10% in 2024. The average commercial property premium increased 10% in 2024. But you say the combined average property premium decreased 3% in 2024. You realize that negative 3% does not look right.

You might have made an input error or a calculation error, but you don't want to admit to that. So you blurt out, "That's because the distribution has changed." And to your relief, the boss buys into that.

"The distribution has changed" is probably a pretty good answer in more instances than you realize. A more common example is if your investment portfolio starts at 90% stocks and 10% bonds, but you have a good year with stocks, at year-end the distribution of your portfolio has changed to 94% stocks and 6% bonds, and you may want to rebalance your portfolio. How about the distribution of population by state? It has definitely changed since the prior Census. It might be hard to think of a non-trivial example of something where the distribution has not changed.

The key calculations are the 2023 average premium is (2000 * 90 + 10000 * 10) / (90 + 10) = 2800, the 2024 average premium is (2200 * 110 + 11000 * 7) / (110 + 7) = 2726, and so the average premium change is 2726 / 2800 = .97 (rounded) = - 3%. And the distribution between personal and commercial DID change, as measured either by the distribution of number of policies (was 90% personal, now 94%) or by the distribution of premium (was 64% personal, now 76%). So because the distribution has changed towards more smaller personal policies, this pulls the combined average premium down, even though the average premium for each of the two separate subgroups has increased. (There are alternatives to the key calculations, such as weighting the percentage changes, instead of weighting the average premiums.)

Years ago I posed a dilemma like this during job interviews for actuarial trainees to see how well they would respond to a sort of non-routine problem, but I decided it was too difficult.

I did the above exhibit in Excel because it was quick. It was also pretty easy to add custom colors to alternating rows, which I did via FONT > FILL > MORE COLORS > CUSTOM > enter HEX CODE. Here I chose cornsilk #fff8dc and cyan #00ffff for fun.

Then I wondered how easy it would be to make a pretty table in R. If you google something like "pretty tables in R", you will find a number of R packages that create HTML type code that can be saved as an HTML file, a PDF file, or another file format. Much has been written about these packages, but they seem a little complicated for basic work, and further, I like the idea of staying exclusively within the R environment. When I realized a table is just a collection of rectangles, it occurred to me that the base R commands of rectangle and text are pretty much all I need.

Here is a table of sample rectangles with text, written in R. The rectangle syntax is rect(xleft, ybottom, xright, ytop, col, border) and the text syntax is text(x, y, labels, col, cex, font). The numerical axis is helpful when first defining the rectangles, but can be deleted by adding axes = FALSE to the plot function for the final table.

# rectangle syntax: rect(xleft, ybottom, xright, ytop)
plot(x = c(0, 500), y = c(0, 500), type= "n", xlab = "", ylab = "", main = "Sample rectangles with text", cex=2, font=2)

rect(0, 0, 250, 250, col = "#E41A1C", border = "blue")
text(125, 125, "red rectangle, white font", col="white", cex=1.15, font=2)

rect(250, 0, 500, 250, col = "yellow", border = "blue")
text(375, 125, "yellow rectangle, blue font", col="navyblue", cex=1.15, font=2)

rect(0, 250, 250, 500, col = "cornsilk", border = "blue")
text(125, 375, "cornsilk rectangle, black font", col="black", cex=1., font=2)

rect(250, 250, 500, 500, col = "cyan", border = "blue")
text(375, 375, "cyan rectangle, purple font", col="purple", cex=1.15, font=2)

The R equivalent of the Excel exhibit is the following. Note that all code is in base R.

title <- c("Subgroups increase, but the average decreases")
df <- data.frame(Personal = c(2000, 90, 2200, 110, 10),
Commercial = c(10000, 10, 11000, 7, 10),
Weighted = c(2800, 100, 2726, 117, -3))

rownames(df) = c("2023 Avg Prem", "2023 No. Policies","2024 Avg Prem", "2024 No. Policies", "Avg Prem % Change")
df

# rectangle syntax: rect(xleft, ybottom, xright, ytop)
op <- par(bg = "thistle")
col1 = "cornsilk"
col2 = "cyan"

plot(x = c(0, 500), y = c(0, 700), type= "n", xlab = "", ylab = "", axes = FALSE)
height = 100

rect(0, 6*height, 500, 7*height, col = col1, border = "blue")
text(250, 6.5*height, title, col="black", cex=1.25, font=2)

rect(0, 5*height, 200, 6*height, col = col2, border = "blue")
rect(200, 5*height, 300, 6*height, col = col2, border = "blue")
rect(300, 5*height, 400, 6*height, col = col2, border = "blue")
rect(400, 5*height, 500, 6*height, col = col2, border = "blue")

text(100, 5.5*height, "", col="blue")
text(250, 5.5*height, colnames(df)[1], col="blue")
text(350, 5.5*height, colnames(df)[2], col="blue")
text(450, 5.5*height, colnames(df)[3], col="blue")

rect(0, 4*height, 200, 5*height, col = col1, border = "blue")
rect(200, 4*height, 300, 5*height, col = col1, border = "blue")
rect(300, 4*height, 400, 5*height, col = col1, border = "blue")
rect(400, 4*height, 500, 5*height, col = col1, border = "blue")

text(100, 4.5*height, rownames(df)[1], col="blue")
text(250, 4.5*height, df[1,1], col="blue")
text(350, 4.5*height, df[1,2], col="blue")
text(450, 4.5*height, df[1,3], col="blue")

rect(0, 3*height, 200, 4*height, col = col2, border = "blue")
rect(200, 3*height, 300, 4*height, col = col2, border = "blue")
rect(300, 3*height, 400, 4*height, col = col2, border = "blue")
rect(400, 3*height, 500, 4*height, col = col2, border = "blue")

text(100, 3.5*height, rownames(df)[2], col="blue")
text(250, 3.5*height, df[2,1], col="blue")
text(350, 3.5*height, df[2,2], col="blue")
text(450, 3.5*height, df[2,3], col="blue")

rect(0, 2*height, 200, 3*height, col = col1, border = "blue")
rect(200, 2*height, 300, 3*height, col = col1, border = "blue")
rect(300, 2*height, 400, 3*height, col = col1, border = "blue")
rect(400, 2*height, 500, 3*height, col = col1, border = "blue")

text(100, 2.5*height, rownames(df)[3], col="blue")
text(250, 2.5*height, df[3,1], col="blue")
text(350, 2.5*height, df[3,2], col="blue")
text(450, 2.5*height, df[3,3], col="blue")

rect(0, height, 200, 2*height, col = col2, border = "blue")
rect(200, height, 300, 2*height, col = col2, border = "blue")
rect(300, height, 400, 2*height, col = col2, border = "blue")
rect(400, height, 500, 2*height, col = col2, border = "blue")

text(100, 1.5*height, rownames(df)[4], col="blue")
text(250, 1.5*height, df[4,1], col="blue")
text(350, 1.5*height, df[4,2], col="blue")
text(450, 1.5*height, df[4,3], col="blue")

rect(0, 0, 200, height, col = col1, border = "blue")
rect(200, 0, 300, height, col = col1, border = "blue")
rect(300, 0, 400, height, col = col1, border = "blue")
rect(400, 0, 500, height, col = col1, border = "blue")

text(100, .5*height, rownames(df)[5], col="blue")
text(250, .5*height, paste(df[5,1], "%"), col="black", cex=1.5)
text(350, .5*height, paste(df[5,2], "%"), col="black", cex=1.5)
text(450, .5*height, paste(df[5,3], "%"), col="black", cex=2, font=2)

par(op)

# END

#######################################

Thursday, July 18, 2024

Radar charts and five-tool baseball players

Radar charts and five-tool baseball players by Jerry Tuttle

I was looking for an opportunity to practice with radar charts and I came across an article on five-tool baseball players, so this seemed like a perfect application for this kind of chart.

A radar chart is an alternative to a column chart to display three or more quantitative variables. The chart graphs the values in a circular manner around a center point.

The five tools in baseball are: (1) hitting for average; (2) hitting for power; (3) defense; (4) throwing; and (5) speed. A five-tool player excels in all five of these.

Among current players, Mike Trout is considered a five-tool player. The measurement of Trout’s five tools can be displayed in the following radar chart:

Trout is rated at 80 for hitting for average, 70 for hitting for power, and his lowest scores are 60 for defense, throwing and speed. This is based on a 20-to-80 rating system, where 80 is elite, 70 is plus-plus, and 60 is plus. Sorry - I could not get the points to line up with the concentric pentagons.

For comparison, here is a display of Aaron Judge's ratings.

Judge is rated at 80 for hitting for power, 70 for hitting for average, 60 for defense, 70 for throwing, and 50 for speed, where the 50 is average at the major league level.

The results of several players can be displayed in a single radar chart, but this becomes hard to read. Three players are probably the maximum for readability.

The alternative to visualizing several players is either to create several individual radar charts or else to create a bar (horizontal) chart or a column (vertical) chart.

Each of the five tools is generally rated on a 20-to-80 scale, where 50 is average (for a major leaguer), 80 is elite, and every 10 points is supposed to represent one standard deviation. I suspect the standard deviation concept is more judgmental than mathematical. There is not a single rating system; some use traditional baseball statistics, and others use modern motion tracking data.

The numerical data above was obtained from an article by Jake Mintz in 2022 for Fox Sports https://www.foxsports.com/stories/mlb/trout-betts-rodriguez-the-definition-of-mlbs-five-tool-players . In Mintz's data, all numbers are shown rounded to the nearest 10. Mintz only has five current players as five-tool players: Mike Trout, Mookie Betts, Trea Turner, Byron Buxton, and Julio Rodriguez. I tried graphing all five players in a single radar chart, but this was too hard to read. Mintz thinks a true five-tool player should have a grade of at least 60 in each of the five categories. By this measure, Aaron Judge is not quite a five-tool player due to a 50 in speed, and a number of elite major leaguers have at least one 50. Note that each category is considered separately. If there were some sort of weighting system, many people would weigh hitting with power as most important, followed by hitting for average, although perhaps the weights should vary by position with higher weights for defense and throwing for catcher, middle infielders, and center fielder. Pitchers have a different grading system.

What about Shohei Ohtani? At the time of his article, Mintz did not have sufficient data on Ohtani.

Mintz observes that Mike Trout worked one winter to improve his throwing, and Julio Rodriguez worked to increase his speed. This suggests that the ratings probably change over the life of a player and are dependent on when they are measured.

Other authors suggest that there is a sixth tool of exceptional players such as mental makeup and character. Another tool might be situational game awareness.

Modern motion tracking data by Statcast and others did not exist until fairly recently. Willie Mays is generally considered the greatest five-tool player. Using statistical measures, author Herm Krabbenhoft suggests Tris Speaker, Ty Cobb, and Honus Wagner should be considered as five tool players, although Krabbenhoft measures hitting for power with SLG (slugging percentage) and ISO (isolated power), not home runs https://sabr.org/journal/article/honus-wagner-baseballs-prototypical-five-tooler/ . A very different measure of hitting with power would be something like home run distance greater than 425 feet or launch angle and velocity.

What about Babe Ruth? We know Babe Ruth's career numbers are .342 batting average and 714 home runs. I have not read anything about his defense, throwing, or speed. He did steal 123 bases, including home 10 times; maybe he was faster than we realize. He is remembered for getting thrown out stealing second to end the 1926 World Series, but perhaps the hit-and-run play was on, and Bob Meusel, the batter, swung and missed the pitch? See https://baseballegg.com/2019/10/30/babe-ruths-failed-stolen-base-attempt-ended-the-1926-world-series-or-is-that-what-really-happened/ . Ruth had 204 assists as an outfielder, which sounds like a lot. I wonder how he would have ranked in defense, throwing, and speed?

Here is my R code. I do like radar charts for comparing one to three observations over five variables, as a change of pace from column charts. I used the fmsb library for the radar charts. There is also a ggradar library, but I did not like its visualization. One of the quirks of fmsb is that the axis for each variable can have its own scale. Originally I used each variable's max and min values, but the axes were out of sync, so I replaced this with the grand max and min. Also, I could not get the values, which are all multiples of ten, to line up on the concentric pentagons.

library(fmsb)
library(scales)

group = c("Hit_avg", "Hit_power", "Defense", "Throwing", "Speed")
player_names = c("Trout","Betts","Judge")
players <- data.frame(
   row.names = player_names,
   Hit_avg = c(80, 70, 70),
   Hit_power = c(70,60,80),
   Defense = c(60,70,60),
   Throwing = c(60,80,70),
   Speed = c(60,70,50))
players

# The row 1 should contain the maximum values for each variable
# The row 2 should contain the minimum values for each variable
# Data for cases or individuals should be given starting from row 3
# Define the variable ranges: maximum and minimum; however, want axes to have equal scales

max_min <- data.frame(
   Hit_avg = c(max(players), min(players)),
   Hit_power = c(max(players), min(players)),
   Defense = c(max(players), min(players)),
   Throwing = c(max(players), min(players)),
   Speed = c(max(players), min(players)))

rownames(max_min) <- c("Max", "Min") # row 1 has max's, row 2 has min's.
df <- rbind(max_min, players)
df

player1_data <- df[c("Max", "Min", player_names[1]), ]
player2_data <- df[c("Max", "Min", player_names[2]), ]
player3_data <- df[c("Max", "Min", player_names[3]), ]

chart <- function(data, color, title){
     radarchart(data, axistype = 0,
       pcol = color, pfcol = scales::alpha(color, 0.5), plwd = 2, plty = 1,
        vlabels = colnames(data), vlcex = 1.5,
       cglcol = "black", cglty = 1, cglwd = 0.8,
        caxislabels = NULL,
       title = title)
}

# Plot the data for players 1, 2, and 3 separately
chart(data=player1_data, color="#00AFBB", title="MIKE TROUT 5 Tools")
chart(data=player2_data, color="#E7B800", title="MOOKIE BETTS 5 Tools")
chart(data=player3_data, color="#FC4E07", title="AARON JUDGE 5 Tools")

# Plot the data for three players
chart(data=df, color=c("#00AFBB", "#E7B800", "#FC4E07"), # blue-green, red-green, red-green
     title="TROUT, BETTS, JUDGE 5 Tools")
legend(
     x = "bottom", legend = rownames(df[-c(1,2),]), horiz = FALSE,
     bty = "n", pch = 20 , col = c("#00AFBB", "#E7B800", "#FC4E07"),
     text.col = "black", cex = 1.25, pt.cex = 1.5)

###########################################

# column graphs

library(tibble)
library(tidyr)
library(ggplot2)
# Reshape data to long format
players_long <- players %>%
rownames_to_column("player") %>%
pivot_longer(cols = -player, names_to = "group", values_to = "value")

# Common theme for graphs
common_theme <- theme(
   legend.position="right",
   plot.title = element_text(size=15, face="bold"),
   axis.title = element_text(size=15, face="bold"),
   axis.text = element_text(size=15, face="bold"),
   legend.title = element_text(size=15, face="bold"),
   legend.text = element_text(size=15, face="bold"))

# Create column graph: Tool Ratings by Player
ggplot(players_long, aes(x = player, y = value, fill = group, title = "Tool Ratings by Player")) +
   geom_col(position = "dodge") +
   labs(x = "Player", y = "Rating", fill = "Group") +
   common_theme

# Create the column graph: Player Ratings for each Tool
ggplot(players_long, aes(x = group, y = value, fill = player)) +
   geom_col(position = "dodge") +
   labs(x = "Group", y = "Rating", fill = "Player", title = "Player Ratings for each Tool") +
   common_theme

### END

##################################################################################

Sunday, June 23, 2024

Something a llttle different: Hexbin maps

Something a llttle different: Hexbin maps by Jerry Tuttle

I recently became acquainted with hexbin maps, so I thought I would experiment with one. In a hexbin map, each geographical region is represented by an equally sized regular hexagon. (A regular hexagon can tessellate a plane, one of only three regular polygons that can do so besides an equilateral triangle and a square.) A hexbin map can be used to illustrate the distribution of a categorical variable by using colors, at a cost that geographical areas are distorted by size, shape, and orientation. Here is a hexbin map, followed by the same data in a conventional map. The four categories represent which of the two political parties had the larger popular vote in the 2020 presidential election and by what margin over the other party.

# Hexbin map

# Conventional map

A hexbin is based on a geoJson file that provides the hexagon boundaries as two-dimensional coordinates. I downloaded a file from https://team.carto.com/u/andrew/tables/andrew.us_states_hexgrid/public/map in geo json format containing the boundaries of the US states. I read it in with read_sf from the R sf (simple features spatial vector data) library. I used a datafile of 2020 presidential election popular votes by state scraped from https://www.presidency.ucsb.edu/statistics/elections/2020 . I binned percent Democraric votes to total votes minus percent Republican votes to total votes into four bins and joined to the sf file. I used ggplot to plot the hexbin map, where label = iso3166_2 provides a two-character state abbreviation.

For the conventional map, I used the R usmap library.

A hexbin map is a nice alternative to a conventional map. With a conventional map, it's hard to read those small states. Howwever, it is not the right visualization tool if the data is quantitative. It is also not the right tool if the size or geography is important. Utah is not the same size as Califirnia in either area or number of voters, and South Carolina is not east of North Carolina.

An alternative to hexagonal areas is to make them squares, but hexagons seem more interesting.

Here is the R code I used:

# plot the conventional map first:
library(usmap)
# Color maps with data; must have fips column
library(ggplot2)
library(readxl)
data <- read_excel("C:/Users/Jerry/Desktop/R_files/presidential_2020.xlsx")
data$party <- ifelse(data$Dem_Minus_Rep >=0 & data$Dem_Minus_Rep <= 0.10, 'Dem 0 to 10%',
      ifelse(data$Dem_Minus_Rep > 0.10, 'Dem > 10%',
      ifelse(data$Dem_Minus_Rep <0 & data$Dem_Minus_Rep >= -0.10, 'Rep 0 to 10%',
      ifelse(data$Dem_Minus_Rep < -0.10, 'Rep > 10%', "Error"))))
#
party_colors <- c("blue", "lightskyblue", "red", "#FF66CC")
mytitle="2020 Presidential Election Popular Vote by State"
#
plot_usmap(data = data, values = "party", labels = TRUE, label_color = "white") +
    labs(title=mytitle) +
    scale_fill_manual(values=party_colors) +
    theme(
      legend.position="bottom",
      plot.title = element_text(size=15, face="bold"),
      legend.title = element_text(size=12, face="bold"),
      legend.text = element_text(size=12, face="bold"))
#
# plot the hexbin map:
# Download coordinates from https://team.carto.com/u/andrew/tables/andrew.us_states_hexgrid/public/map
# in geo json format
my_sf <- read_sf("C:/Users/Jerry/Desktop/R_files/us_states_hexgrid.geojson")
my_sf <- my_sf %>%
    mutate(google_name = gsub(" \$United States\$", "", google_name))
#
# join sf file with data file
my_sf_data <- my_sf %>%
    left_join(data, by = c("google_name" = "STATE"))
#
ggplot(my_sf_data) +
    geom_sf(aes(fill = party), linewidth = 0, alpha = 0.9) +
    geom_sf_text(aes(label = iso3166_2), color = "white", size = 4, alpha = 1) +
    theme_void() +
    scale_fill_manual(values=party_colors) +
    ggtitle(mytitle) +
    theme(
      legend.position="bottom",
      plot.title = element_text(size=15, face="bold"),
      legend.title = element_text(size=12, face="bold"),
      legend.text = element_text(size=12, face="bold"))
# End

##################################################################################

Thursday, February 29, 2024

How easily can you be identified on the Internet?

Suppose you finish your meal at a restaurant, you are about to pay the check, and the manager asks, “Would you like a free dessert on your next birthday? Just fill out the bottom of the check with your date of birth, your email address and your zip code, and we’ll send you an email with a coupon just before your birthday.”

That sounds OK, but you ask, “You mean just the month and day, right?”

The manager replies, “No, we need the year too because we will send you other coupons that are age-appropriate. The zip code is just so we know how far our customers traveled.”

That sounds reasonable, but you are still a little skeptical. “Are you going to sell your mailing list to others with my name and home address?”

The manager replies, “We don’t need your name or your home address for this.” So you fill out the bottom of the check. Still skeptical, you pay with cash so the restaurant has no record of your name.

And of course, several months later you receive a coupon by snail mail at your home address, correctly addressed to you by name and home address.

Suppose you google your own name. I googled mine. Eliminating people with the same name who are not me, I found references to me at a number of free sites that state at least two of my age, mailing address, phone numbers, and family members.

Some of these have my correct birth date and some do not. Most of these have my correct age. Mailing addresses, phone numbers, email addresses, and family members may or may not be accurate, or current.

Computer scientist Latanya Sweeney, currently a dean at Harvard, is credited with a commonly quoted statement that 87% of the US population is uniquely identifiable from just 5-digit zip code, gender, and date of birth, (Sweeney, 2000.) She used many publicly available data sources in her research including voter registration lists.

Sweeney provides this quick calculation: 365 days in a year x 100 years x 2 genders = 73,000 unique permutations. (About My Info, 2013.) But many 5-digit zip codes are not that large. My zip code from my youth in New York City has a population of about 67,000. My current zip code is about 31,000. Sweeney’s work estimates that in my current 5-digit zip code, combined with some census data on total number of people with my age and gender, there is likely to be only one person in my zip code with my gender and date of birth.

So when that restaurant sells its customer database to some third party for a completely different use, that third party can probably identify me by name, even though I never gave my name or address to the restaurant. And that was the time I ordered the burger AND the fries.

About My Info. (2013.) How unique am I? Retrieved from https://aboutmyinfo.org/identity/about

Sweeney, L. (2000.) Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh. Retrieved from https://dataprivacylab.org/projects/identifiability/paper1.pdf

Sunday, February 25, 2024

Miss Google

Some time ago I reported that the voice of Google Maps, whom I call Miss Google, seemed to be getting lazy. I facetiously said that she stopped wearing mascara and lipstick, and that more importantly she stopped naming specific streets and exit names.

It turns out the real Miss Google was quite offended by this. She said she absolutely did not stop wearing mascara and lipstick. But more importantly, she told me to set my phone setting in Google Maps to DEFAULT English, not simply English. That did solve my Google Maps problem.

I was curious what Miss Google looks like. She refused to tell me. I thought I could cleverly get around this by asking Gemini, the Google AI tool, but Gemini said it would violate Google's AI principles to share a picture.

Undeterred, I tried several other AI tools. Gencraft was happy to answer my question of what Miss Google looks like, and here is the answer:

Surprisingly, I got a number of responses from my original comment about Miss Google and lipstick. Now that I have an authoritative picture, I went to another site to determine her lipstick shade. That site identifies it as #91454C, which Carol tells me is a burgundy.

Happy to be of service in the world of data analysis.

plot(-10:10, -10:10, type = 'n',
axes = FALSE, xlab = NA, ylab = NA)
rect(-8, -8, 18, -4, col = '#91454C', border="blue")

Sunday, February 11, 2024

Taylor Swift and Data Analysis

Taylor Swift and Data Analysis. by Jerry Tuttle

Who will be the most talked-about celebrity before, during, and after the Super Bowl?

She is an accomplished performer. songwriter, businesswoman, and philanthropist. I think she is very pretty. And those lips!

So what can a data analyst add to everything that has been said about her?

I was curious whether R could identify her lipstick color. I should preface this by saying I have some degree of color-challengedness, although I am not colorblind. I am also aware that you can Google something like "what lipstick shade does taylor swift use" and you will get many replies. But I am more interested in an answer like E41D4F. I do wonder if I could visit a cosmetics store and say, "I'd like to buy a lipstick for my wife. Do you have anything in E41D4F ?"

There are sites that take an image, allow you to hover over a particular point, and the site will attempt to identify the computer color. Here is one: RedKetchup But I want a more R-related approach. A note on computers and colors: A computer represents an image in units called pixels. Each pixel contains a combination of base sixteen numbers for red, green and blue. A base 16 number ranges from 0 through F. Each of red, green and blue is a two-digit base 16 number, so a full number is a six-digit base 16 number. There are 16^ 6 = 16,777,216 possible colors. E41D4F is one of those 16.8 million colors.

There are R packages that will take an image and identify the most frequent colors. I tried this with the image above, and I got unhelpful colors. This is because the image contains the background, her hair, her clothing, and lots of other things unrelated to her lips. If you think about it, the lips are really a small portion of a face anyway. So I need to narrow down the image to her lips.

I plotted the image on a rectangular grid, using the number of columns of the image file as the xlimit width, and the number of rows as the ylimit height. Then by trial and error I manually found the coordinates of a rectangle for the lips. The magick library allows you to crop an image, using this crop format:

<width>x<height>{+-}<xoffset>{+-}<yoffset> The y offset is from the top, not the bottom. The cropped image can be printed.

The package colouR will then identify the most frequent colors. I found it necessary to save the cropped image to my computer and then read it back in because colouR would not accept it otherwise. The getTopCol command will extract the top colors by frequency. I assume it is counting frequency of hex color codes among the pixel elements. Here is a histogram of the result:

Really? I'm disappointed. Although I am color-challenged, this can't be right.

I have tried this with other photos of Taylor. I do get that she wears more than one lipstick color. I have also learned that with 16.8 million colors, perhaps the color is not identical on the entire lip - maybe some of you lipstick aficionados have always known this. All of these attempts have been somewhat unsatisfactory. There are too many colors on the graph that seem absolutely wrong, and no one color seems to really capture her shade, at least as I perceive it. Any suggestions from the R community?

No matter who you root for in the Super Bowl - go Taylor.

Here is my R code:

library(png)
library(ggplot2)
library(grid)
library(colouR)
library(magick)

xpos <- c(0,0,0)
ypos <- c(0,0,0)
df <- data.frame(xpos = xpos, ypos = ypos)

# downloaded from
# https://img.etimg.com/thumb/msid-100921419,width-300,height-225,imgsize-50890,resizemode-75/taylor-swift-mitchell-taebel-from-indiana-arrested-for-stalking-threatening-singer.jpg

img <- "C:/Users/Jerry/Desktop/R_files/taylor/taylor_swift.png"
img <- readPNG(img, native=TRUE)
height <- nrow(img) # 457
width <- ncol(img) # 584
img <- rasterGrob (img, interpolate = TRUE)

# print onto grid
ggplot(data = df,
aes(xpos, ypos)) +
xlim(0, width) + ylim(0, height) +
geom_blank() +
annotation_custom(img, xmin=0, xmax=width, ymin=0, ymax=height)

#############################################
# choose dimensions of subset rectangle

width <- 105
height <- 47
x1 <- 215 # from left
y1 <- 300 # from top

library(magick)
# must read in as magick object
img <- image_read('C:/Users/Jerry/Desktop/R_files/taylor/taylor_swift.png')
# print(img)

# crop format: x{+-}{+-}
cropped_img <- image_crop(img, "105x47+215+300")
print(cropped_img) # lips only
image_write(cropped_img, path = "C:/Users/Jerry/Desktop/R_files/taylor/lips1.png", format = "png")

##############################################

# extract top colors of lips image

top10 <- colouR::getTopCol(path = "C:/Users/Jerry/Desktop/R_files/taylor/lips1.png",
n = 10, avgCols = FALSE, exclude = FALSE)
top10

# plot
ggplot(top10, aes(x = hex, y = freq, , fill = hex)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = top10$hex) + # added this line based on suggestion
labs(title="Top 10 colors by frequency") +
xlab("HEX colour code") + ylab("Frequency") +
theme(
legend.position="NULL",
plot.title = element_text(size=15, face="bold"),
axis.title = element_text(size=15, face="bold"),
axis.text.x = element_text(angle = 45, hjust = 1, size=12, face="bold"))

# End
##################################################################################

Tuesday, January 2, 2024

Using great circle distance to determine Twin Cities

In the US we think of Minneapolis and St. Paul as the Twin Cities, but Ben Olin, author of Math with Bad Drawings, https://mathwithbaddrawings.com/, posed this data analysis question: Which U.S. cities are the true twin cities? He imposed three conditions:

the cities must be at most 10 miles apart,
each city must have at least 200,000 people, and
the populations have to be within a ratio of 2:1.

This seemed like a nice data analysis problem, so I searched for a dataset containing both population and location. Interestingly, each of population and location has its nuances, and I learned a lot more from this problem than I expected.

I found https://simplemaps.com/data/us-cities has a dataset of 30,844 cities containing latitude, longitude, and population. However, when Minneapolis’ population came out as 2.9 million, Ben recognized that the population was shown for the broader metropolitan area, not the city proper. I got a second dataset of populations of city propers from https://www.census.gov/data/tables/time-series/demo/popest/2020s-total-cities-and-towns.html, I joined the two datasets, and I used the populations from the second database.

How do you measure the distance between two cities? This is not as simple a question as it sounds.

As a start, I used https://www.distancecalculator.net/ and entered two cities I am familiar with, New York, NY and Hoboken, NJ. That website calculated a distance of 2.39 miles, and it provided a map. The site further clarified that it used the great circle distance formula. So this raises two questions: How does it decide which two points to measure from, and what is a great circle distance?

Hoboken has an area of 1.97 square miles, so it probably doesn’t matter too much which point in Hoboken to use. New York City has an area of 472.43 square miles, so it does matter which point to use. It is not obvious which point it used, and it certainly did not use the closest point, but from other work, I believe it used City Hall or something close.

Although some sites will measure distance between two cities as driving distance, it is more common to use great circle distance, which is the shortest distance along the surface of a sphere. The earth is not exactly a sphere, but it is approximately a sphere.

Latitude and longitude is a coordinate system to describe any point on the earth’s surface. Lines of latitude are horizontal lines parallel to the Equator, and represent how far north or south a point is from the Equator. Lines of longitude represent how far a point is east or west from a vertical line called the Prime Meridian that runs through Greenwich, England. Both latitude and longitude are measured in degrees, which are broken down into smaller units called minutes and seconds. For convenience they are also expressed in decimal degrees. If D is degrees, M is minutes, and S is seconds, then the conversion to decimal degree uses D + M/60 + S/3600.

When we use trig functions to calculate distances, we need to convert decimal degrees to radians by multiplying by π / 180. We also need to know the radius of the earth which is 3963.0 miles.

If point A is (lat1, long1) in decimals and point B is (lat2, long2) in decimals, then the distance formula d is the great-circle distance on a perfect sphere using the haversine distance formula, which is derived from principles of three-dimensional spherical trigonometry including the spherical law of cosines. A haversine of an angle θ is defined as hav(θ) = sin²(θ/2), and this concept is used in the derivation.

d = ACOS(SIN(PI()*[Lat_start]/180.0)*SIN(PI()*[Lat_end]/180.0)+COS(PI()*[Lat_start]/180.0) *COS(PI()*[Lat_end]/180.0)*COS(PI()*[Long_start]/180.0-PI()*[Long_end]/180.0))*3963

As I mentioned, I used https://simplemaps.com/data/us-cities which contains cities with their latitude and longitude, and I applied the above distance formulas to pairs of cities. But I was still curious about the choice of a latitude and longitude for a particular city. That file lists New York City as (40.6943, -73.9249). Another website that finds a street address from a decimal latitude and longitude, https://www.mapdevelopers.com/reverse_geocode_tool.php , lists the address of (40.6943, -73.9249) as 871 Bushwick Avenue, Brooklyn, which is some distance from City Hall, but does not appear to be the centroid of New York City either. Wikipedia's choice of latitude and longitude for New York City is 42 Park Row which is close to City Hall.

The following map from https://www.mapdevelopers.com/reverse_geocode_tool.php?lat=40.694300&lng=-73.924900&zoom=12 shows the approximate location of 871 Bushwick Avenue, Brooklyn.

I joined the dataset of locations with the populations of city propers from the second dataset, and I applied Ben’s three conditions. This produced 8 pairs of cities, and of course this list uses the first dataset’s choice of a city’s latitude and longitude and the distances resulting from that. Different choices of latitude and longitude produce a different list.

Of these pairs, I actually like Hialeah and Miami as the true twin cities. Besides meeting the original three criteria, they both share the same large ethnic population and they share a public transportation system.

Wikipedia has a much larger list of twin cities https://en.wikipedia.org/wiki/Twin_cities, but they did not necessarily use Ben Olin's three criteria. Also, Ben's problem is for US cities only, and Wikipedia has several pairs of Canada-US and Mexico-US cities that I had not thought about.

Here is the R code I used:

df1 <- read.csv("https://raw.githubusercontent.com/fcas80/jt_files/main/uscities.csv")
df1 <- subset(df1, select = c(city, lat, lng, state_name))
n1 <- nrow(df1) # 30844

library("readxl")
df2 <- read_excel("https://raw.githubusercontent.com/fcas80/jt_files/main/censuspop.xlsx", mode = "wb", skip = 3)
df2 <- df2[ -c(1,4:6) ]
colnames(df2) <- c("city", "pop")
# city appears as format Los Angeles city, California
df2$state <- gsub(".*\\, ", "", df2$city) # extract state: everything after comma blank
df2$city <- gsub("\\,.*", "", df2$city) # extract everything before comma
df2$city <- gsub(" city*", "", df2$city) # delete: blank city

df = merge(x=df1, y=df2, by="city",all=TRUE)
df <- na.omit(df)
df <- df[df$pop >= 200000, ]
df <- df[df$state_name == df$state, ] # delete improper merge same city in 2 states
df <- subset(df, select = -c(state_name, state))
n <- nrow(df) # 112

kount <- 1
df11 <- data.frame()
for (i in 1:n){
Lat_start <- df[i,2]
Long_start <- df[i,3]
for (j in 1:n){
Lat_end <- df[j,2]
Long_end <- df[j,3]
dist_miles <- acos(sin(pi*(Lat_start)/180.0)*sin(pi*(Lat_end)/180.0)+
cos(pi*(Lat_start)/180.0)*cos(pi*(Lat_end)/180.0)*cos(pi*
(Long_start)/180.0-pi*(Long_end)/180.0))*3963
cos(pi*(Lat_start)/180.0)*cos(pi*(Lat_end)/180.0)*cos(pi*(Long_start)/180.0-pi*
(Long_end)/180.0))*3963
dist_miles <- round(dist_miles, 0)
pop_ratio <- round(max(df[i,4]/df[j,4], df[j,4]/df[i,4]),1)
if (df[i,1] != df[j,1] & dist_miles > 0 & dist_miles <= 10 & pop_ratio <= 2){
df11[kount,1] <- df[i,1]
df11[kount,2] <- df[j,1]
df11[kount,3] <- dist_miles
df11[kount,4] <- df[i,4]
df11[kount,5] <- df[j,4]
df11[kount,6] <- pop_ratio
df11[kount,7] <- df[i,4] + df[j,4]
kount <- kount + 1
}
}
}
colnames(df11) <- c("City1", "City2", "Dist", "Pop1", "Pop2", "Ratio", "TotPop")
df11 <- df11[!duplicated(df11$TotPop), ] # remove duplicates
df11 <- df11[ -c(7) ]
df11 <- data.frame(df11, row.names = NULL) # renumber rows consecutively
df11

Sunday, December 3, 2023

Ten Lords-a-Leaping

Just what is a lord-a-leaping? Well, what is a lord? A lord is a title of nobility, usually inherited, that exists in the UK and other countries. And those lords like to leap, especially during the twelve days of Christmas.

The song the Twelve Days of Christmas is a well-known Christmas song, whose earliest known publication was in London in 1780. There are various versions of the lyrics, various melodies, and meanings of the gifts. As usual, this is all nicely summarized in Wikipedia https://en.wikipedia.org/wiki/The_Twelve_Days_of_Christmas_(song) .

PNC Bank, one of the largest banks in the US, has been calculating the prices of the twelve gifts given by my true love since 1984, and has trademarked its PNC Christmas Price Index ® . Two senior executives at PNC calculate the prices, and many of the details are available at https://www.pnc.com/en/about-pnc/topics/pnc-christmas-price-index.html#about , especially in their FAQ. In particular, they note that the price of services has generally increased while the price of goods has slowed. The price index is a humorous proxy for the general cost of inflation.

On day one there was 1 gift (the partridge). On day two there were 3 gifts (2 doves + 1 partridge). On day three there were 6 gifts (3 hens + 2 doves + 1 partridge). On day twelve there were 78 gifts, and 78 is the sum of the first 12 natural numbers, whose general formula Σn = n(n+1)/2 was known by Gauss in the 1700’s.

The cumulative number of gifts is 1 + 3 + 6 + … + 78, whose sum is 364. (One fewer than the number of days in a year. Coincidence?) Each of these numbers is called a Triangular number T_i , and the general formula of their sum is Σ T_i = n(n+1)(n+2)/6.

The PNC Christmas Price Index ®, or the Total Cost of Christmas reflects the total cost of the 78 gifts: one set of each of the gifts. For 2023 that cost is $46,729.86, versus $45,523.33 in 2022, a change of + 2.7%. The prior year’s change was 10.5%. The largest individual item in the index is not the five gold rings as I had thought ($1,245), but rather those leaping lords ($14,539, up 4.0%), followed by the swimming swans ($13,125 and unchanged for many years).

PNC also calculates the True Cost of Christmas, which is the cost of 364 gifts. For 2023 that cost is $201,972.66, a change of 2.5% over a year ago.

And PNC calculates a core index excluding the swans, which some time ago had been the most volatile item, and also an e-commerce index buying all items online.

The overall Bureau of Labor Statistics CPI for All Urban Consumers (CPI-U) increased 3.2% for twelve months ending October 2023. October is the closest month for CPI-U to the PNC data. CPI-U of course is based on a broad market basket of goods including food, energy, medical care, housing, transportation, etc., which are not the gifts given in the song, but CPI-U is a common measure of inflation. The PNC index is based on a very specific twelve items and is heavily weighted toward the lords and the swans.

The PNC website contains detailed information on its calculations, but it does not contain historical information on CPI-U. I used twelve-month October historical CPI-U percent changes from https://www.bls.gov/regions/mid-atlantic/data/consumerpriceindexhistorical_us_table.htm . Then I graphed the percentage changes of the PNC Christmas Price Index® , the PNC True Cost of Christmas index, and the CPI.

With such a small number of items, the two PNC indices fluctuate drastically. 2014 reflects a one-time increase in the cost of the swans. 2020 was the unusual year during the pandemic when some of the gifts (including the lords!) were unavailable and so the cost that year was zero. The two PNC indices were fairly close to CPI-U for five years starting in 2015 and again for 2022 and 2023. Maybe these PNC indices are pretty good.

PNC uses the Philadelphia Ballet to calculate the cost of the lords-a-leaping.

Here is the R code I used:

library(readxl)
library(ggplot2)
df1 <- read_excel("C:/Users/Jerry/Desktop/R_files/xmas.xlsx", sheet = 1)
df2 <- read_excel("C:/Users/Jerry/Desktop/R_files/xmas.xlsx", sheet = 2)
cpi <- round(df2$Percent_change,3)
df1 <- df1[c(3:13)]
year <- as.numeric(colnames(df1)[2:11])
total_cost_dollars <- colSums(df1)
total_cost_index <- vector()
true_cost_dollars <- vector()
true_cost_index <- vector()
for(i in 1:length(total_cost_dollars)){
true_cost_dollars[i] <- 12*df1[1,i] + 11*df1[2,i] + 10*df1[3,i] + 9*df1[4,i] + 8*df1[5,i] +
7*df1[6,i] + 6*df1[7,i] + 5*df1[8,i] + 4*df1[9,i] + 3*df1[10,i] + 2*df1[11,i] + 1*df1[12,i]
}
true_cost_dollars <- unlist(true_cost_dollars)
for(i in 1:length(total_cost_dollars) - 1){
total_cost_index[i] <- round(100*(total_cost_dollars[i+1]/total_cost_dollars[i] - 1),1)
true_cost_index[i] <- round(100*(true_cost_dollars[i+1]/true_cost_dollars[i] - 1),1)
}
df <- data.frame(cbind(year, total_cost_index, true_cost_index, cpi))

colors <- c("total_cost_index" = "red", "true_cost_index" = "navy", "cpi" = "grey")
ggplot(df, aes(x=year)) +
geom_line(aes(y=total_cost_index, color="total_cost_index")) +
geom_line(aes(y=true_cost_index, color="true_cost_index"))
geom_line(aes(y=cpi, color="cpi")) +
labs(title = "12 Days of Christmas", x = "Year", y = "Percent change", color = "Legend") +
scale_color_manual(values = colors) +
# scale_y_continuous(labels = scales::percent_format(scale = 1, prefix = "", suffix = "%")) +
theme(
legend.position="right",
plot.title = element_text(size=15, face="bold"),
axis.title = element_text(size=15, face="bold"),
axis.text = element_text(size=15, face="bold"),
legend.title = element_text(size=15, face="bold"),
legend.text = element_text(size=15, face="bold"))

How do I count thee? Let me count the ways?

It's a linear world - approximately; and the "Rule of 72"

Sunday, August 25, 2024

Text analysis of 2024 US Presidential convention speeches

Sunday, August 11, 2024

Is the Mona Lisa thinking about irrational numbers?

Wednesday, July 24, 2024

The distribution has changed; and pretty tables in base R

Thursday, July 18, 2024

Radar charts and five-tool baseball players

Sunday, June 23, 2024

Something a llttle different: Hexbin maps

Thursday, February 29, 2024

How easily can you be identified on the Internet?

Sunday, February 25, 2024

Miss Google

Sunday, February 11, 2024

Taylor Swift and Data Analysis

Tuesday, January 2, 2024

Using great circle distance to determine Twin Cities

Sunday, December 3, 2023

Ten Lords-a-Leaping