Online College Math Teacher

Thursday, January 30, 2025

The Mathematics of Taylor Swift

I confess to being fascinated by Taylor Swift for far more than her music. I think she is an extraordinary person for her philanthropy, her speaking out for victims of sexual assault, her advocacy of artists' ownership rights, and her urging her fans to vote.

But of course there's much more, and so let's look at the mathematics of Taylor Swift.

Let's start with her net worth. She is estimared to have $1.6 billion in assets. Her largest asset is her music catalog. She did not own the masters (first recording) to her older music, but she re-recorded them and owns the masters to the re-recorded versions and to her future recordings.

She donates large sums to a variety of organizations. Most of these donations are not public, but here is a sample of some recent donations:

      There are three parts to this mathematical analysis:
         1. her attractiveness,
         2. her lipstick shades, and
         3. her lyrics.
Each part contains a considerable amount of imprecision in measuremenys and judgments. This is more of a work in progress than the final word. Nevertheless, I think it is a non-routine use of math, and it was a lot of fun.

1. How pretty is she?

You can have your opinion. But mine is based on math.

The ancient Greeks discovered a particular number called the Golden Ratio, denoted by Greek letter Φ (phi), that has many interesting mathematical properties, apppears in some patterns of nature, and is considered by many to be asthetically pleasing. The Golden Ratio results from finding the point on a line segment that splits the segement into two smaller segments with lengths a and b, such that (a + b)/a = a/b.

That ratio a/b is the Golden Ratio, Φ. With a little algebra, Φ = (1 + √5)/2 , which is an irrational number so it has an infinite non-repeating decimal, and rounded to three decimal places is 1.618.

Renaissance artists, plastic surgeons, and makeup artists are among those who use Golden Ratios in various ways with faces to create ideally proportioned faces. Gary Meisner has wriiren extensively on the Golden Ratio, and he believes there are over 20 different ways that the Golden Ratio shows up in human faces and that “the Golden Ratio is also found very commonly in beautiful models of today across all ethnic groups". Biostatistican professor Dr. Kendra Schmid and her colleagues performed various measures of many faces. They began with 17 potential Golden Ratios, and they decided only six of these ratios were predictors of facial attractiveness. See Schmid.

This takes us to Taylor. I attempted to measure these six ratios on a picture of Taylor. There are many pictures of her, she does enjoy experimenting with different hairstyles, and I had to find one with a hairstyle that gave me the best chance of measuring her from her hairline and also between her ears. The measurement is not exact for many reasons, and because we are using a two-dimensional photo of a three-dimensional object there is certainly some loss of accuracy. Nevertheless, here are the results:

Face length / Face Width	1.4722
Mouth width / Interocular distance	1.4000
Mouth width / Nose width	1.5556
Lips to chin / Interocular	1.7000
Lips to chin / Nose width	1.8889
Ear length / Nose width	1.5556
Average	1.5954
% Deviation from Φ	- 1.4%

On average, Taylor's measurements are quite close to phi, the ideal measurement of facial attractiveness.

(Note that I did the measurements manually, and they are inexact. I understand there are Python libraries such as dlib that can detect facial ladmarks, and this might eliminate some of the imprecision.)

2. What lipstick shade is she wearing?

A common question on the Internet is what are her favorite lipsticks. I think more interesting mathematically is: given a particular photo of Taylor, can we get the computer to mathematically identify her lipstick?

First, a little background about image files on the computer. A computer image is a collection of tiny dots called pixels. Each dot is about 1/96 inch. Each pixel contains a color code, as a 6-digit base 16 (hex) number, or as a 6 digit base 10 red-green-blue ordered triple. For example 861A23 in base 16 equals (134, 26, 35) in rgb base 10. The first two hex digits, 86, is the red. 86 in base 16 equals 8*16¹ + 6*16⁰ = 134 in base ten. The second two hex digits, 1A, is the green, etc.

We can ask how similar two colors are by plotting them in 3D and calculating their distance.

D = √ [ (134 – 174)² + (26 – 31)² + (35 – 61)² ] = 48.0

I started with five different photos of Taylor, where it appears to me she is wearing a different shade in each. I tried to crop each photo as a rectangular area as close as I could get to just her lips, and I saved each cropped rectangle as a file. I used R packsge colouR to find the most frequent colors. This was unsatisfactory, possibly because there was too much background color noise. I found a website https://www.remove.bg/ that removes background, and I retried colouR with the background removed files.

I did the same process with online swatches of ten lipsticks that the Internet says are among Taylor's favorites.

To my surprise, there were more than 50 different colors, most of them reasonably close, on both the Taylor photos and on the lipstick swatches.

Why are there so many colors?

there are 16⁶ = 16,777,216 different computer colors
women tell me lipstick never looks same on a person as in a tube
my dermatologist offered that lips have lines and grooves that create uneven coloring
it is hard to isolate the image of just lips in a rectangle
lighting creates distortions
there are image resolution and quality issues

Nevertheless, I decided to do the following. For each of Taylor's ten favorite lipsticks, I found the ten most frequent colors. For each of the five Taylor lip photos, I found their ten most frequent colors. Then for each photo, I calculated the average color distance from the photo to each lipstick. The lipstick with the minimum distance from the photo was then the computer forecast matching a photo to a lipstick. Get it?

Here are the five Taylor photos with the computer's match:

In summary, the computer thinks photo 1 has the minimum distance, 320, to its match. Do you agree this is the best match?

Photo	Shade	Distance
1	Ruby Woo	320
2	Elson 4	625
3	Blood Lust	388
4	Flame	392
5	Flame	399

3. Which of her songs are emotional?

The R package ‘taylor’ contains lyrics for 11 albums and 240 songs. I will examine TS songs for their emotional content, as follows:

% emotional words per song
most frequent words per song
trend in emotions over time
examine theory that 5th song in every album is most emotional

What makes a song emotional? Some possibilities are: emotional words, volume, tempo. pitch, rhythm, and instrumentation. A few musicians I spoke to have a longer list. But I don't have the data to measure any of these, except emotional words.

Linguists have created lists of emotional words. Mohammad and Turney created the NRC Emotion Lexicon list of over 14,000 English words and their associations with ten basic emotions (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, and trust). A word may have more than one emotion. For example, faith has the emotions anticipation, joy, positive, and trust.

Like most datasets, Taylor Swift songs required a considerable amount of data cleaning. In a text analysis like this, data cleaning includes removing punctuation and capitals, deleting stop words (a, oh, the, etc.) that add no value, and fixing informal speech and colloquialisms (dyin’, gonna, etc.). Taylor uses a large number of colloquialisms, often by dropping an ending g in an "ing" word. Also part of the cleaning is that words are stemmed so that love, lovable, loving, and lovely all reduce to "lov"; and then the stemmed words are lemmatized and returned to their root word love. The R packages SnowballC and textstem perform stemming and lemmatizing respectively. However, this combination did not always produce a valid lemmatized word, and I manually adjusted about 500 words which I placed into a file lem_replacements_df.csv. The final result is over 23,000 words from her 240 songs, which I am calling processed words.

For all 240 songs combined, I calculated Taylor's most frequent words, her most frequent emotional words, her number of words for each of the 10 emotions, and the ratio r = number of emotional words / number of processed) words. The results are contained in the following three plots:

The above is a baseline for all songs combined. The r value is .364. I asked a friend whose teen-aged daughter is a Swiftie, to suggest a few Taylor songs that she considers especially emotional, and I created the same types of plots for some of these songs. I'll share the plots for the song "My Tears Ricochet", which had a greater than average r value:

Note that in the above song, the counts are pretty small. Perhaps this is one explanation for why a number of songs that we thought were emotional did not have a higher than average r value.

Another question I was curious about is whether Taylor's songs have become more emotional over time. I am using original release date of album as the time variable. For songs that Taylor re-recorded, I am using the original release date under the assumption that her re-release may have changed the music but not the lyrics.

That her songs are becoming more emotional over time appears not to be true. To the contrary, the slope of a poor fitting trend line is negative, but its F-statistic says the linear model is not significant.

Finally, there is a theory that every fifth song on an album is intentionally more emotional than the others.

This theory is discussed on the Internet and even Taylor gives it some credence. Albums by release year provide 10 data points (there are 11 albums, but two were released in the same year). In 6 years out of 10 the track 5 r ratio exceeded the all other tracks, and in 4 years out of 10 they did not. A two-sample t-test with one tail concludes that the track 5 mean is not significantly greater than the all other tracks mean, p = .4591. Note that track 5 has an outlier, “All you had to do was stay,” which sounds like it should have a high r, but its r was .134.

4. Bonus: &$!#%

To the Taylor Tots parents: Our Taylor has been known to use an R-rated word in her songs, or two. Here is the frequency of some of her most frequent such words (and there are others), mostly from her Tortured Poets Department album:

As I mentioned earlier, each of the three parts of this analysis - her attractiveness, her lipstick shades, and her lyrics - contains various imprecisions and judgments. Perhaps some day I will go back to this project and make some improvements. Comments are welcome. But I think it was a non-routine use of math. And it was fun!

This project uses three R script files and a number of image files. The R code is shown below, but it is long. Both the R code files and the image files may be found at https://github.com/fcas80/The-Taylor-Swift-Project/tree/main


# file 1 of 3: lovely.txt

library(png)
library(ggplot2)
library(grid)

image_path <- "my_path/ts1_phi_removebg.png"
img <- readPNG(image_path, native=TRUE)   # width 584, height 457
height <- nrow(img)   # 383
width <- ncol(img)    # 331
raster_img <- grid::rasterGrob(img, interpolate = TRUE)
df <- data.frame(xpos = c(0, width), ypos = c(0, height))

# 1. plot photo on minimum grid
ggplot(data = df,
       aes(xpos, ypos)) +
  xlim(0, width) + ylim(0, height) +
  geom_blank() +
  annotation_custom(raster_img, xmin=0, xmax=width, ymin=0, ymax=height) + 
  theme(axis.title.x = element_blank(), 
      axis.title.y = element_blank())

# 2. plot photo on more detailed grid; measurements are manual and imprecise
ggplot(data = df,
       aes(xpos, ypos)) +
  xlim(0, width) + ylim(0, height) +
  geom_blank() +
  annotation_custom(raster_img, xmin=0, xmax=width, ymin=0, ymax=height) +
  geom_hline(yintercept = seq(0, height, by = 10), color = "gray", linwidth = 0.5) +
  geom_vline(xintercept = seq(0, width, by = 10), color = "gray", linwidth = 0.5) +
  scale_x_continuous(breaks = seq(0, width, by = 20)) +
  scale_y_continuous(breaks = seq(0, height, by = 20)) +
  annotate("segment", x = 90, y = 265, xend = 90, yend = 0, color = "red", linwidth = 3) +
  annotate("segment", x = 70, y = 90, xend = 250, yend = 90, color = "red", linwidth = 3) +
  annotate("segment", x = 130, y = 170, xend = 180, yend = 170, color = "red", linwidth = 3) +
  annotate("segment", x = 135, y = 105, xend = 180, yend = 105, color = "red", linwidth = 3) +
  annotate("segment", x = 125, y = 70, xend = 195, yend = 70, color = "blue", linwidth = 3) +
  annotate("segment", x = 160, y = 85, xend = 160, yend = 0, color = "blue", linwidth = 3) +
  annotate("segment", x = 50, y = 160, xend = 50, yend = 90, color = "red", linwidth = 3) +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())

segments_df <- data.frame(
  segment_name = c("face_length", "face_width", "dist_bet_eyes", 
                   "nose_width", "mouth_width", "lips_to_chin", "ear_length"),
  x = c(90, 70, 130, 135, 125, 160, 50),
  y = c(265, 90, 170, 105, 70, 85, 160),
  xend = c(90, 250, 180, 180, 195, 160, 50),
  yend = c(0, 90, 170, 105, 70, 0, 90))
segments_df$dist <- sqrt((segments_df$x - segments_df$xend)^2 + (segments_df$y - segments_df$yend)^2)
segments_df

ratios_df <- data.frame(
  ratio_name = c("face length / width", "mouth width / interocular", "mouth width / nose width", 
       "lips to chin / interocular", "lips to chin / nose width", "ear length / nose width"),
  ratio = rep(0, times=6))

ratios_df$ratio[1] <- round(segments_df$dist[1] / segments_df$dist[2], 4)   # face length / width
ratios_df$ratio[2] <- round(segments_df$dist[5] / segments_df$dist[3], 4)   # mouth width / interocular
ratios_df$ratio[3] <- round(segments_df$dist[5] / segments_df$dist[4], 4)   # mouth width / nose width
ratios_df$ratio[4] <- round(segments_df$dist[6] / segments_df$dist[3], 4)   # lips to chin / interocular
ratios_df$ratio[5] <- round(segments_df$dist[6] / segments_df$dist[4], 4)   # lips to chin / nose width
ratios_df$ratio[6] <- round(segments_df$dist[7] / segments_df$dist[4], 4)   # ear length / nose width 
ratios_df
m <- round(mean(ratios_df$ratio),3)
m
error <- round((m - (1+sqrt(5))/2)/m,3)
error
# END


# file 2 of 3: lipsticks.txt


setwd("my_path")
library(colouR)
library(ggplot2)

# These shades were determined from various Internet articles
shade <- c("Ruby Woo", "Morocco", "Dragon Girl", "Elson 4", "Always Red", "Kyoto Red", 
           "Eadie Scarlet", "Blood Lust", "Flame", "Red to Go")

# These swatches of shades were copied from various Internet stores
df_lipsticks <- data.frame(shade,
        file = c("Mac_Ruby_Woo.png", "NARS_morocco.png", "NARS_Dragon_Girl.png", "PatMcGrath_Elson4.png", 
           "Sephora_always_red.png", "Tatcha_Kyoto_Red.png", "Gucci_Velvet_Eadie_Scarlet.png", 
           "PatMcGrath_Blood_Lust.png", "TomFord_flame.png", "Armani_red_to_go.png" ))

# These files were created by copying various photos of TS, cropping rectangle of lips, then
# using https://www.remove.bg/ to remove backgrounds  
file = c("ts1_lips-removebg-preview.png", "ts2_lips-removebg-preview.png",
       "ts3_lips-removebg-preview.png", "ts4_lips-removebg-preview.png","ts5_lips-removebg-preview.png")
name = c("TS photo 1", "TS photo 2", "TS photo 3", "TS photo 4" ,"TS photo 5")
df_photos <- data.frame(file, name)

############

# p10 for ten lipstick shades, but some only have 1 color
p10 <- function(image, name) {
  # Get primary color
  primary_color <- colouR::getTopCol(image, n = 1, avgCols = FALSE, exclude = TRUE)
  
  top10 <- tryCatch(
    {
      colouR::getTopCol(image, n = 10, avgCols = FALSE, exclude = TRUE)
    },
    error = function(e) {
      primary_color
    }
  )
  
  if (nrow(top10) < 10) {
    top10 <- rbind(top10, top10[rep(1, 10 - nrow(top10)),])
  }
  
  plot <- ggplot(top10, aes(x = hex, y = freq, fill = hex)) +
    geom_bar(stat = 'identity') +
    scale_fill_manual(values = top10$hex) +
    labs(title = paste(name, "Top 10 colors by frequency")) +
    xlab("HEX color code") + ylab("Frequency") +
    theme(
      legend.position = "none",
      plot.title = element_text(size = 15, face = "bold"),
      axis.title = element_text(size = 15, face = "bold"),
      axis.text.x = element_text(angle = 45, hjust = 1, size = 12, face = "bold")
    )
  print(plot)
  return(top10)
}   # close p function

shade_dataframe <- data.frame(matrix(ncol = 10, nrow = 0))

for (i in 1:10) {
  result <- p10(df_lipsticks$file[i], df_lipsticks$shade[i])
  
  # Repeat the first color if there are fewer than 10 colors
  if (nrow(result) < 10) {
    result <- rbind(result, result[rep(1, 10 - nrow(result)),])
  }
  
  shade_dataframe <- rbind(shade_dataframe, t(result$hex))
}

colnames(shade_dataframe) <- paste0("Color", 1:10)
print(shade_dataframe)

#############

p5 <- function(image, name) {
   top10 <- colouR::getTopCol(image, n = 10, avgCols = FALSE, exclude = TRUE)
   return(top10)
}

df_ts_colors <- data.frame()
for(i in 1:nrow(df_photos)){
   top10_colors <- p5(df_photos$file[i], df_photos$name[i])
   top10_colors <- top10_colors$hex
   df_ts_colors <- if(i == 1) {
     df_ts_colors <- top10_colors
    } else {rbind(df_ts_colors, top10_colors)
   }
}
colnames(df_ts_colors) <- c("color1","color2","color3","color4","color5",
                         "color6","color7","color8","color9","color10")
rownames(df_ts_colors) <- df_photos$name
df_ts_colors <- t(df_ts_colors)
print(df_ts_colors)

# begin total RGB distance of a TS photo to centroid of lipstick shade

# sample 3d plot
library(scatterplot3d)
m <- data.frame(
     x = c(174,134),
     y = c(31,26),
     z = c(61,35))

s3d <- scatterplot3d(m$x, m$y, m$z, pch = 19, color = "blue", 
    xlim = c(100, 200), ylim = c(0, 50), zlim = c(0, 100), 
    xlab = expression(bold("R-axis")), ylab = expression(bold("G-axis")), zlab = expression(bold("B-axis")), 
    main = "Plot of two colors in RGB space", cex.main = 2)

s3d$points3d(m$x, m$y, m$z, type = "l", lty = 1, lwd = 2, col = "red")

# Use 'text' for adding labels with bold font
text(s3d$xyz.convert(m$x, m$y, m$z), labels = paste0("(", m$x, ",", m$y, ",", m$z, ")"), 
     pos = 3, cex = 2, col = "#00008B", font = 2)

A <- shade_dataframe$Color1
B <- shade_dataframe$Color2
C <- shade_dataframe$Color3
D <- shade_dataframe$Color4
E <- shade_dataframe$Color5
F <- shade_dataframe$Color6
G <- shade_dataframe$Color7
H <- shade_dataframe$Color8
I <- shade_dataframe$Color9
J <- shade_dataframe$Color10


hex_to_rgb <- function(hex) {
  rgb <- t(col2rgb(hex))
  return(data.frame(r=rgb[1,], g=rgb[2,], b=rgb[3,]))
}

# Function to calculate Euclidean distance between two RGB vectors
euclidean_distance <- function(rgb1, rgb2) {
  sqrt(sum((rgb1 - rgb2)^2))
}

# Function to calculate centroid of a vector of hex colors
calculate_centroid <- function(hex_colors) {
  rgb_values <- hex_to_rgb(hex_colors)
  centroid <- colMeans(rgb_values)
  return(centroid)
}

# Calculate centroids of shades, e.g., centroid_A <- calculate_centroid(A)
  centroids <- lapply(list(A, B, C, D, E, F, G, H, I, J), calculate_centroid)

for (j in 1:nrow(df_photos)) {
   x <- df_ts_colors[,j]
   centroid_x <- calculate_centroid(x)
# Calculate total distance of x to each centroid
   total_distance_to_centroid <- function(x, centroid) {
     x_rgb <- hex_to_rgb(x)
     total <- 0
     for (i in 1:nrow(x_rgb)) {
       total <- total + euclidean_distance(x_rgb[i, ], centroid)
     }
     return(total)
   }

  distances <- sapply(centroids, total_distance_to_centroid, x = x)
  w <- which(distances == min(distances))
  most_similar <- shade[w]
  cat("The lipstick most similar to", df_photos$name[j], "is:", most_similar, 
      "with distance of ", min(distances), "\n")
 }   #  end for loop on photos
  
# END


file 3 of 3:  lyrics.txt

library(taylor)
library(dplyr)
library(tidyverse)
library(tidytext)
data(stop_words)
library(textclean)
library(SnowballC) # For stemming
library(textstem) # For lemmatization

######################################
# functions, preliminaries

# function to print more rows
more_rows <- function(filename){
  options(tibble.print_max = nrow(filename), tibble.print_min = nrow(filename))
  print(filename)
}

# function to fix informal speech (colloquialisms); also contractions, punctuation, lower case
colloq <- function(df) {
  pattern <- "ahh|Ah-uh|ahah|eh|haah|haha|iiii|Oh|oh|Oh-oh|oh-oh|ooh-oh|oohah|Uh|Uh-oh|uh-oh-oh-oh|ya|La-la-la|lala|la-la-la|la-la-la-la|Mm-mm|Mmm-mm|mm-mm|mm-mm-mm-mm|Ha-ha-ha"
  df <- df %>%
    mutate(lyric = str_remove_all(lyric, paste0("\\b(", pattern, ")\\b"))) %>%
    mutate(lyric = str_replace_all(lyric, 
               c("ain't" = "is not", 
                 "Beggin'" = "begging", "beggin'" = "begging", "birth right" = "birthright", "blood-soaked" = "blood soaked",
                 "'bout" = "about", "burnin'" = "burning", "callin'" = "calling", "'Cause" = "because", "'cause" = "because", "Cept" = "Except", "cursee" = "cure",
                 "convers" = "conversation","crashin'" = "crashing", "doin'" = "doing", "driftin'" = "drifting", "dyin'" = "dying", "'em" = "them", "feelin'" = "feeling", "flyin'" = "flying","feverishlying" = "feverishly", "'fore" = "before", "'Fore" = "before", "foreyesgn" = "foreign",
                 "everythi-i-ing" = "everything","fuckin'"="fuck", "gettin'" = "getting", "gonna" = "going to", "gotta" = "got to", "happ'nin'" = "happening", "haven't" = "have not", "Holdin'" = "holding", "hero's" = "hero",
                 "Hopin'" = "hoping", "hopin'" = "hoping","I'ma" = "I am going to", "kinda"="kind of", "king's" = "king", "keepin" = "keeping", "Laughin'" = "laughing", "lookin'" = "looking", "losin" = "losing", "losingg" = "losing", "lovin'" = "loving", "lucki" = "lucky", "no-one" = "no one", "Nothin'" = "nothing", "nothin'" = "nothing","one-hand" = "one hand",
                 "mornin'" = "morning", "'nother"="another", "nothin'" = "nothing", "pickin'" = "picking", "post-mortem" = "postmortem", "prayin'" = "praying", "Prayin'" = "praying", "pretti" = "pretty", "ridin'" = "riding", "Sneakin'" = "sneaking",
                 "outta" = "out of", "'round" = "around", "self-implode" = "self implode", "shoulda" = "should have", "standin'" = "standing", "summer's" = "summer is", "There's" = "there is", "Thinkin'" = "thinking", 
                 "thankin'" = "thanking", "thinkin'" = "thinking", "'Til" = "until", "'til" = "until", "tryin'" = "trying", "tryna" = "trying to", "Tryna" = "trying to","twin-sized" = "twin sized", 
                 "waitin'" = "waiting", "white-knuckle" = "white knuckle",
                 "wanna" = "want to", "weepin'" = "weeping", "whatcha" = "what are you", "Where's" = "where is", "Why'd" = "why did",
                 "wide-eyed" = "wide eyed", "wonderin'" = "wondering", "Wonderin'" = "wondering"))) %>%
    mutate(lyric = map_chr(lyric, clean_text))
  
  return(df)
}


custom_stop_words <- bind_rows(stop_words, 
           tibble(word = c("ah", "dadadada", "ha", "haha", "hey", "hi", "huh", "la", "MMM", "mmm", "mmmm", "mm", "na", "oh", "okay","OK", "ok", "ooh", 
           "uh", "whoa", "woah", "yeah"), lexicon = "custom"))


# Function to clean text  (included in colloq function)
clean_text <- function(text) { 
  text %>% 
    replace_contraction() %>% 
    str_remove_all("[[:punct:]]") %>% 
    str_to_lower()
}

replacements_df <- read.csv("my_path/lem_replacements_df.csv", header = TRUE)

# Function to stem and lemmatize
pro <- function(words_df) {
  words_df <- words_df %>% filter(!is.na(word))
  processed_words <- words_df %>%
    mutate(stemmed_word = wordStem(word),
           lemmatized_word = lemmatize_words(stemmed_word)) %>%
    filter(lemmatized_word != "") %>%  # Filter out empty strings
    left_join(replacements_df, by = c("lemmatized_word" = "stemmed_word")) %>%
    mutate(lemmatized_word = ifelse(is.na(lemmatized_word.y), lemmatized_word, lemmatized_word.y)) %>%
    select(-lemmatized_word.y)  # Remove the extra column created during the join
}

nrc_sentiments <- get_sentiments("nrc") %>%
  filter(word != "count" & word != "heel" & word != "truck" & word != "wear" & word != "word")

common_theme <- theme(
  legend.position="NULL",
  plot.title = element_text(size = 15, face = "bold"),
  plot.subtitle = element_text(size = 12.5, face = "bold"),
  axis.title = element_text(size = 15, face = "bold"),
  axis.text = element_text(size = 15, face = "bold"),
  legend.title = element_text(size = 15, face = "bold"),
  legend.text = element_text(size = 15, face = "bold"))

sentiment_colors <- c("anger" = "red", "anticipation" = "green", "disgust" = "brown",
                      "fear" = "purple", "joy" = "yellow", "negative" = "gray", "positive" = "lightblue",
                      "sadness" = "blue", "surprise" = "pink", "trust" = "turquoise")

emotion_plot <- function(sentiment_summary, song, r){
  ggplot(sentiment_summary, aes(x = sentiment, y = word_count, fill = sentiment)) + 
    geom_col(color = "black") + 
    scale_fill_manual(values = sentiment_colors) +
    labs(title = paste(song, "# words by emotion"),
         subtitle = paste("emotional words / all (processed) words =", r), 
         x = "Emotion", 
         y = "Number of Words") + 
    common_theme + 
    theme(axis.text.x = element_text(angle = 45, hjust=1))
}

lemmatized_colors <- c("red", "blue", "green", "orange", 
                       "purple", "cyan", "magenta", "yellow", "brown", "pink")

word_plot <- function(word_summary, song, title){
  ggplot(word_summary, aes(x = lemmatized_word, y = total_count, fill = lemmatized_word)) + 
    geom_col(color = "black") + 
    scale_fill_manual(values = lemmatized_colors) +
    labs(title = paste(song, title),
         x = "", 
         y = "Frequency") + 
    common_theme +
    theme(axis.text.x = element_text(angle = 45, hjust=1))
}

all_albums <- c("Taylor Swift", "Fearless (Taylor's Version)", "Speak Now (Taylor's Version)", "Red (Taylor's Version)", "1989 (Taylor's Version)",
                "reputation", "Lover", "folklore", "evermore",  "Midnights", "THE TORTURED POETS DEPARTMENT")
number_of_tracks <- c(15, 26, 22, 30, 22, 15, 18, 17, 17, 19, 31)
original_release_year <- c(2006, 2008, 2010, 2012, 2014, 2017, 2019, 2020, 2020, 2022, 2024)
# The years are original release years, i.e., not "Taylor's version" years

#######################################

# Find song by word in title - could be lower case!
x <- "ricochet"
df1 <- taylor_album_songs %>%
   select(album_name, track_name) %>%
   filter(grepl(x, track_name))
n <- df1$album_name[1]
a <- which(all_albums == n)
a

# Find track number
x <- df1$track_name[1]
t <- taylor_album_songs %>%
  filter(album_name == n) %>%
  select(album_name, track_number, track_name) %>%
  filter(track_name == x) %>%
  select(track_number) %>%
  pull() %>% as.numeric()
t

Or instead, choose album a & track t   eg: Style: a=11, t=3
# a <- 8
# t <- 5

#######################################
song <- taylor_album_songs %>% 
  # filter(album_name == all_albums[a] ) %>%  # all songs in an album
  filter(album_name == all_albums[a] & track_number == t) %>% 
  pull(track_name[1])   # NOTE the track_name is always 1
if (length(song) == 0) print("No such track number/song") else print(song)
song <- gsub("\\(Taylor's Version\\)|\\[Taylor's Version\\]", '', song)
album <- all_albums[a]

df <- taylor_album_songs %>%
  filter(album_name == all_albums[a] & track_number == t) %>%
  select(lyrics) %>%
  unnest(lyrics) %>%
  select(line, lyric)
more_rows(df)   # Examine for colloquialisms; add to function colloq

df <- colloq(df)
more_rows(df)

words_df <- df %>%
  unnest_tokens(word, lyric) %>%
  anti_join(custom_stop_words, by = "word")
more_rows(words_df)

processed_words <- pro(words_df) 
more_rows(processed_words) # Examine for words incorrectly lemmatized; add to function pro

processed_words_count <- processed_words %>%
  select(lemmatized_word) %>%
  group_by(lemmatized_word) %>%
  summarize(total_count = n()) %>%
  arrange(desc(total_count)) %>%
  head(10)
more_rows(processed_words_count) 

total_processed_words <- nrow(processed_words) 

sentiment_analysis <- processed_words %>%
  inner_join(nrc_sentiments, by = c("lemmatized_word" = "word"),
             relationship = "many-to-many") %>%
  count(lemmatized_word, sentiment, sort = TRUE)

distinct_sentiment_words <- sentiment_analysis %>% 
  pull(lemmatized_word) %>%
  n_distinct()   

total_sentiment_words <- sentiment_analysis %>%
  select(lemmatized_word, n) %>%
  distinct() %>%
  group_by(lemmatized_word) %>%
  summarize(total_count = sum(n)) %>%
  summarize(total_count = sum(total_count)) %>%
  pull(total_count)   

r <- round(total_sentiment_words / total_processed_words, 3)
cat("emotional words = ", total_sentiment_words, ", total processed words = ", total_processed_words, ", ratio = ", r )

sentiment_summary <- sentiment_analysis %>%
  group_by(sentiment) %>%
  summarise(word_count = sum(n))
sentiment_summary

word_summary <- sentiment_analysis %>%
  select(lemmatized_word, n) %>%
  distinct() %>%
  group_by(lemmatized_word) %>%
  summarize(total_count = sum(n)) %>%
  arrange(desc(total_count)) %>%
  head(10)
word_summary   # check for word with unusual sentiment (truck = trust?); delete from NRC 

word_plot(processed_words_count, song, 
          title="10 most frequent processed words")
word_plot(word_summary, song, 
          title = "10 most frequent emotional words")
emotion_plot(sentiment_summary, song, r)

# End individual song analysis;  for all songs combined, in df <- taylor_album_songs %>%
# delete filter(album_name == all_albums[a] & track_number == t) %>%, and also song <- ""

dirty_words <- processed_words %>%
  select(lemmatized_word) %>%
  filter(, lemmatized_word=="fuck" | lemmatized_word=="shit" | lemmatized_word=="slut" | lemmatized_word=="bitch") %>%
  group_by(lemmatized_word) %>%
  mutate(lemmatized_word = str_replace(lemmatized_word, "fuck", "f**k")) %>%
  mutate(lemmatized_word = str_replace(lemmatized_word, "shit", "sh*t")) %>%
  mutate(lemmatized_word = str_replace(lemmatized_word, "slut", "sl*t")) %>%
  mutate(lemmatized_word = str_replace(lemmatized_word, "bitch", "b*tch")) %>%
  summarize(total_count = n())
dirty_words   # limited to these four; there are others :)

dirty_colors <- c("#FF0000","#000000","#800080","#FFA500")
ggplot(dirty_words, aes(x = lemmatized_word, y = total_count, fill = lemmatized_word)) + 
  geom_col(color = "black") + 
  scale_fill_manual(values = dirty_colors) +
  labs(title = "TS # of R-rated Words",
       x = "", 
       y = "Number of Words") + 
  common_theme 

# END

Tuesday, October 29, 2024

Sheldon Cooper's favorite number

If you are a fan of the television series "The Big Bang Theory", then you know Sheldon often wears a shirt with 73 on it.

In the tenth episode of the fourth season of The Bing Bang Theory, aired December 9, 2010, https://www.imdb.com/title/tt1632225/characters/nm1433588 , Sheldon asks what everyone’s favorite number is.

Raj suggests 5318008. Flip this number upside-down to see what Raj has in mind.

Sheldon declares the best number is 73 because it satisfies the following two properties: a. the product property, and b. the mirror property:

p=73 is the i = 21st prime number. The product of the digits of p, 7 and 3, equals i.
The reverse of the digits of 73 (its mirror) is 37. 37 is the 12th prime number. 12 is the reverse of 21, and 21 is the product of the digits of 73.

We will call a prime number satisfying both the product property and mirror property a “Sheldon Prime”.

Sheldon adds that 73 in base two, 1001001, is a palindrome number. However, many base two numbers are palindrome numbers; 5 in base two is 101.

This episode was the 73rd episode of Bing Bang. Actor Jim Parsons was born in 1973, and he was 37 years old when the episode aired. Coincidences?

The 73rd day of a non-leap year is March 14, also known as Pi Day.

73 is an emirp. An emirp (prime spelled backwards) is a prime number that results in a different prime when its digits are reversed.

Mathematicians Pomerance and Spicer prove that 73 is the only prime satisfying both the product property and mirror property. https://math.dartmouth.edu/~carlp/sheldon02132019.pdf . They also state that the only primes less than 10^10 satisfying the product property are p=17, i=7; p=73, i=21; and p=2475989, i=181440.

The following R code: a. verifies that 73 is the only Sheldon Prime within the first 10000 primes. b. verifies that 73 in base two, 1001001, is a palindrome number. c. flips Raj’s 5318008 number upside-down.

# a. 73 is the only Sheldon prime in first 10000 primes:

library(RcppAlgos)
# Function to calculate the product of digits
digit_product <- function(x) {
digits <- as.numeric(strsplit(as.character(x), NULL)[[1]])
digits <- digits[!is.na(digits)]
if (length(digits) == 0) return(NA)
prod(digits)
}

# Function to reverse digits of a number
reverse_digits <- function(x) {
as.numeric(paste(rev(strsplit(as.character(x), NULL)[[1]]), collapse = ""))
}

# Function to check if a number is prime
is_prime <- function(num) {
if (num <= 1) return(FALSE)
if (num == 2) return(TRUE)
if (num %% 2 == 0) return(FALSE)
for (i in 3:sqrt(num)) {
if (num %% i == 0) return(FALSE)
}
return(TRUE)
}

# Generate the first 10000 primes # stopped here for memory & speed issues
primes <- primeSieve(104730) # Generates first 10000 primes; the 10,000th prime is 104729
results <- list()

for (i in seq_along(primes)) {
if (digit_product(primes[i]) == i) { # product property
   reversed_prime <- reverse_digits(primes[i])
   if (is_prime(reversed_prime)) { # mirror property
    reversed_index <- which(primes == reversed_prime)
    if (reverse_digits(reversed_index) == digit_product(primes[i])){
     results <- append(results, list(c(prime = primes[i], index = i, reversed_prime = reversed_prime,     reversed_index = reversed_index)))
    }
   }
}
}

if (length(results) > 0) {
print(results)
} else {
print("No such prime found.")
}

# b. verify that 73 in base two, 1001001, is a palindrome number.

library(gtools)
x <- 73
v <- baseOf(x, base = 2)
w <- rev(v)
if (all(v == w)) {
cat(v, "is a palindrome number")
} else {
cat(v, " is not a palindrome number")
}

# c. flips Raj’s 5318008 number upside-down.

flip_string <- function(x) {
flip <- c('0' = '0', '1' = 'I', '2' = '2', '3' = 'E', '4' = '4', '5' = 'S', '6' = '9', '7' = '7', '8' = 'B', '9' = '6')
paste(rev(sapply(unlist(strsplit(as.character(x), NULL)), function(ch) flip[ch])), collapse = "")
}

Raj <- '5318008'
flipped_Raj <- flip_string(Raj)
flipped_Raj

# End

Friday, September 27, 2024

Rene Descartes walks into a bar

Rene Descartes walks into a bar, by Jerry Tuttle

I recently told the old Rene Descartes joke to a math class: Rene Descartes walks into a bar. The bartender asks, "Would you like a beer?" Descartes pauses for a moment and then replies, "I think not." Then poof - he disappears.

Of course I naively assumed my students had been exposed to the Descartes quote, "I think, therefore I am." Philosopher and mathematician Rene Descartes wrote this in French in 1637, and later in Latin as cogito, ergo sum."

After explaining the Descartes quote, I think the students understood the joke. Well, maybe it's not that funny.

But perhaps funnier to math people than you realize, is: this joke is logically flawed because the punchline is the inverse to the original conditional statement, and an inverse is not logically equivalent to the original statement.

Let P and Q be declarative sentences that can be definitively classified as either true or false. Then define:

Conditional Statement: If P, then Q.
Converse: If Q, then P.
Inverse: If not P, then not Q.
Contrapositive: If not Q, then not P

Two conditional statements are defined as logically equivalent when their truth values are identical for every possible combination of truth values for their individual declarative sentences.

P	Q	statement	converse	inverse	contrapositive
TRUE	TRUE	TRUE	TRUE	TRUE	TRUE
TRUE	FALSE	FALSE	TRUE	TRUE	FALSE
FALSE	TRUE	TRUE	FALSE	FALSE	TRUE
FALSE	FALSE	TRUE	TRUE	TRUE	TRUE

The above table shows statement and contrapositive have the same truth values in columns 3 and 6, and so are logically equivalent. Statement and inverse are not logically equivalent.

The Descartes quote is, "If I think, therefore I am", or "If P then Q". The punchline is, "If I don't think, therefore I am not", or "If not P, then not Q". The punchline is the inverse, and is not logically equivalent to the quote. If P is false, then "if P then Q" is true regardless of the value of Q. So Q can be either true or false.

Occasionally on television someone, often a police detective, will make a statement where they confuse a statement with its converse or inverse, and I have been known to yell at the television.

Descartes is known for developing analytic geometry, which uses algebra to describe geometry. Descartes' rule of signs counts the roots of a polynomial by examining sign changes in its coefficients.

And before someone else feels the need to say this, I will: "Don't put Descartes before the horse." This is perhaps the punchline to changing the original joke to "A horse walks into a bar ... "

The following is R code to create truth tables. Logical is a variable type in R. Conditional statements in R are created using the fact that “If P then Q” is equivalent to “Not P or Q”. I am defining the logic rules for statement, converse, inverse, contrapositive, but I could have defined the rules for more complicated statements as well.

# Define the possible values for P and Q
P <- c(TRUE, TRUE, FALSE, FALSE)
Q <- c(TRUE, FALSE, TRUE, FALSE)

# Calculate the 4 logical rules: statement, converse, inverse, contrapositive
# (Note that “if P then Q” is equivalent to “Not P or Q”.)
P_implies_Q <- !P | Q
Q_implies_P <- !Q | P
not_P_implies_not_Q <- P | !Q
not_Q_implies_not_P <- Q | !P

# expand.grid(P, Q) would also be a good start, but I wanted a specific ordering
# Create a data frame to display the truth table
truth_table <- data.frame(
P = P,
Q = Q,
`P -> Q` = P_implies_Q,
`Q -> P` = Q_implies_P,
`!P -> !Q` = not_P_implies_not_Q,
`!Q -> !P` = not_Q_implies_not_P
)

# Print the truth table
colnames(truth_table) <- c("P", "Q", "statement", "converse", "inverse", "contrapositive")
print(truth_table)

P_variable <- "I think"
Q_variable <- "I am"

colnames(truth_table) <- c(P_variable, Q_variable, "statement", "converse", "inverse", "contrapositive")
print(truth_table)

End

Sunday, September 15, 2024

The dead body problem, and the coffee with milk problem

Here are two problems in life we all commonly face: What time did the victim die, and is it better to put milk in your coffee right away or to wait?

In any police show with a murder, one of the detectives will always ask this math question: "What time did the victim die?" And yes, it is a math question, whose solution dates back over 300 years to Isaac Newton. For algebra and calculus students, it is probably more interesting than a lot of word problems they see.

Newton’s Law of Cooling states that the rate of change of the temperature of an object is proportional to the difference between its own temperature and the surrounding temperature.

         Let the temperature of the object be T(t) at time t.
         Let T0 denote the initial temperature of the object at time 0.
         Let Ts denote the temperature of the surrounding medium.

Newton's Law of Cooling states dT/dt = -k (T - Ts), where k is a cooling constant and where the negative sign is for the case where the body is getting cooler. The solution of the differential equation is:

T(t) = Ts + (T0 - Ts) * e^(-k*t)

Newton published his Law of Cooling in 1701. Modern physicists believe Newton's Law holds up well and approximates the result in some but not all situations. Large temperature differences between the object and its surroundings, or non-constant temperatures, will cause difficulties. I leave the details to the physicists.

Problem 1: The dead body problem

Example: A dead body was 80 degrees Fahrenheit when it was discovered. Two hours later it had cooled to 75 degrees. The room is 60 degrees. What time did the body die?

Assume T0 at time of death = 98.6
Assume Ts the temperature of surrounding room = 60
Let k be the unknown cooling constant. k depends on the characteristics of the object and its environment and is often between 0.1 and 0.2.

T(t) = Ts + (T0 − Ts) * e^(−kt)

First equation, at time of discovery:
Let t = t_0 be time of discovery. 80 = 60 + (98.6 - 60) * e^(-k*t_0)

Second equation, two hours later:
At t = t_0 + 2, 75 = 60 + (98.6 - 60) * e^(-k*(t_0 + 2))

Solve first equation for k:
80 = 60 + (98.6 - 60) * e^(-k*t_0)
20 = 38.6 * e^(-k*t_0)
e^(-k*t_0) = 20 / 38.6 = .518

Solve second equation for k:
75 = 60 + (98.6 - 60) * e^(-k*(t_0 + 2))
15 = 38.6 * e^(-k*(t_0 + 2))
e^(-k*(t_0 + 2)) = 15 / 38.6 = .389

Solve for k by taking the ratio of the two equations, noting t_0 drops out:
[e^(-k*(t_0 + 2))] / [e^(-k*t_0) ] = .389 /518
e^(-2k) = .751
-2k = LN(.751)
k = - LN(.751) / 2
k = .143

Solve for time since death from first equation
80 = 60 + (98.6 - 60) * e^(-.143*t_0)
20 = 38.6 * e^(-.143*t_0)
e^(-.143*t_0) = 20 / 38.6 = .518
-.143*t_0 = LN(.518)
t_0 = - LN(.518) / .143
t_o = 4.8 hours

The body had been cooling for 4.8 hours before it was discovered.

Problem 2: The coffee with milk problem

Here is a different application of Newton's Law of Cooling.

Example: Who drinks the hotter coffee after five minutes: The person who puts milk in right away and then waits five minutes, or the person who waits five minutes and then puts in milk?

Assume initial temperature of coffee, T0 = 200 F
Assume temperature of the room, Ts = 72 F
Assume initial temperature of the milk, Tm = 40 F
Assume the mixture is 80% coffee and 20% milk
Assume cooling constant k = 0.1.

Case 1: Add cold milk immediately

mixture T(0) = .80*200 + .20*40 = 168
mixture T(5) = 72 + (168 − 72) * e^(−0.1*5) = 130.23

Case 2: Wait 5 minutes, then add room temperature milk

coffee T(5) = 72 + (200 - 72)*e^(-0.1*5) = 149.64
milk T(5) = 72 + (40 - 72)*e^(-0.1*5) = 52.59
mixture T(5) = .80*149.64 + .20*52.59 = 130.23

The conclusion is that cases 1 and 2 give the same temperature! However, there is also a case 3.

Case 3: Wait 5 minutes, then add cold milk

coffee T(5) = 72 + (200 - 72)*e^(-0.1*5) = 149.64
mixture T(5) = .80*149.64 + .20*40 = 127.71

For simplicity, I ignored a few things that probably have a minor effect on the final temperature: I assume the mixing process is instantaneous. I assume the cup does not absorb heat from the coffee. I assume stirring the mixture does accelerate heat transfer. Perhaps there are others. I can not measure the effect of these.

Thank you, Isaac Newton (who as a Brit, perhaps drank tea?).

End

Sunday, September 8, 2024

Mind reader game, and Unicode symbols

Mind reader game, and Unicode symbols, by Jerry Tuttle

Perhaps you've seen this Mind Reader game? Think of a two-digit positive whole number, such as 54. Subtract each of the two digits from your number, such as 54 - 5 - 4 = 45, and call 45 the RESULT. Examine the table of symbols below and find the SYMBOL that corresponds with your RESULT. Concentrate on the SYMBOL, and remember it. Then scroll down below, and I will read your mind to predict your SYMBOL.

Before I get to my prediction, let's talk about Unicode symbols.

Unicode symbols

Unicode is a standard international system that assigns a unique Unicode number to each letter, digit, or symbol. This permits thousands of symbols to be written across different platforms, programs, and languages. An example of a Unicode number is "U+" followed by a hex number, such as U+1F499.

The table of symbols in the Mind Reader game is based on plotting a variety of Unicode symbols by printing their Unicode numbers.

A Unicode symbol is printed in R with its Unicode number, but beginning with a backslash to escape the U, and omitting the plus sign. For example, here is how to print a heart symbol.

print("\U1F499")
"💙"

Example with Hebrew letters

Let's spell out the Hebrew word shalom , letter by letter, with each letter's Unicode number, and then use the R paste command to paste the letters together.

    shin <- "\U05E9"     # "ש"
    kamatz <- "\U05B7"     # vowel as two perpendicular lines "ַ"
    lamed <- "\U05DC"     # "ל"
    vav <- "\U05D5"     # "ו"
    final_mem <- "\U05DD"     # "ם"
    shin_with_kamatz <- "\U05E9\U05B7"     # "שַ"
    paste(shin_with_kamatz, lamed, vav, final_mem, sep = "")

"שַלום"

Note that the letters are entered in the paste statement in order of first Hebrew letter, second Hebrew letter, etc., but they are printed in Hebrew right-to-left. Also, a better choice than vav is cholam, which is vav with a dot above it, "\U05BA", but this doesn't print for me.

My prediction

Here is my prediction of your symbol:

Want to play again? Think of another two-digit positive whole number, such as 54. Subtract each of the two digits from your number, such as 54 - 5 - 4 = 45, and call 45 the RESULT. Examine the table of symbols below and find the SYMBOL that corresponds with your RESULT. Concentrate on the SYMBOL, and remember it. Then scroll down below, and I will read your mind to predict your SYMBOL.

Plotting with Unicode characters

Tired of plotting points with those boring 25 pch symbols? You can use Unicode symbols, but you can't simply use pch = . Here for no good reason I use a heart and a thumbs up.

library(ggplot2)
x <- seq(from=0,to=4, by=1)
y <- x^2
z <- exp(x)
df1 <- data.frame(x, y)
df2 <- data.frame(x, z)
heart <- "\U1F499"
thumbs_up <- "\U1F44D"
# Create a custom function to convert Unicode values to GeomPoint
custom_points <- function(data, mapping, ..., shape = heart) {
    ggplot2::geom_point(data = data, mapping = mapping, ..., shape = shape)
}
ggplot() +
    custom_points(data = df1, aes(x = x, y = y, color = "y = x^2"), shape = heart, size=5) +
    geom_line(data = df1, aes(x = x, y = y, color = "y = x^2")) +
    custom_points(data = df2, aes(x = x, y = z, color = "z = exp(x)"), shape = thumbs_up, size=5) +
    geom_line(data = df2, aes(x = x, y = z, color = "z = exp(x)")) +
    ggtitle("Plot with Unicode points") +
    labs(color = "Function") +
    theme(legend.position = c(0.15, 0.85),
      plot.title = element_text(color="black", size=14, face="bold"),
      legend.text = element_text(color="black", size=10, face="bold"))

My second prediction

Here is my second prediction:

Here is the code for the mind reader:

# start with arbitrary set of unicode symbols
description =
    c("a_bengali","a_gurmukhi","approximately_equal","biohazard","black_diamond",
    "black_heart","black_scissors","mercury",
    "mushroom","nya_gujarati",
    "section","snowflake",
    "snowman","teardrop_spoked_asterisk","thunderstorm",
    "umbrella_raindrops","white_cross","white_florette",
    "zhe_cyrillic", "airplane",
   "black_right_arrow","black_telephone","blue_heart",
    "two_xs","hot_beverage","green apple","pill",
    "trophy","thumbs_up")

unicode=
    c("\U0986","\U0A05","\U2248","\U2623","\U25C6","\U2665","\U2702","\U263F",
    "\U1F344","\U0A9E","\U00A7","\U2746",
    "\U26C4","\U273B","\U2608","\U2614","\U271E","\U2740",
    "\U04DC","\U2708","\U27A4","\U260E","\U1F499","\U1F9E0","\U2615","\U1F34F",
   "\U1F48A","\U1F3C6","\U1F44D")

df_uni <- data.frame(cbind(description, unicode))
n <- nrow(df_uni)
x <- seq(1,n,1)
s <- sample(x, 1)
diag <- unicode[s]
diag
x <- x[-x[s]]
y <- sample(x, 100, replace = TRUE) # randomly choose 100 symbols

df <- data.frame(matrix(ncol = 20, nrow = 10))

for (i in seq(1, 19, by = 2)) {
    df[, i] <- seq(99, 9, -10) - (i - 1) / 2 # row i, odd columns
    df[, i + 1] <- y[((i - 1) / 2 * 10 + 1):((i - 1) / 2 * 10 + 10)] # row i, even columns
    df[, i + 1] <- unicode[df[, i + 1]]
}

for (i in 1:10) {
df[i, 20 - 2*(i - 1)] <- diag
}

op <- par(bg = "thistle")
plot(x = c(0, 50*20), y = c(0, 50*10), type = "n", xlab = "", ylab = "", xaxt = 'n',
yaxt = 'n', main = "Mind Reader # 1", cex=2, font=2)

# Loop through each cell of the dataframe to draw rectangles and add text
for (i in 1:20) {
    for (j in 1:10) {
      # Determine the color based on whether the column index is odd or even
      fill_color <- ifelse(i %% 2 == 1, "cornsilk", "lightblue")

# Draw rectangle with the determined color
rect(50*(i-1), 50*(10-j), 50*i, 50*(10-j+1), col = fill_color, border = "blue")

      # Add text in the center of the rectangle
      text(50*(i-1) + 25, 50*(10-j) + 25, df[j, i], col = "navyblue", cex = 1, font = 2)
      }
}

# Restore original graphics parameters
par(op)

Hint on prediction

The prediction relies on a little algebra. As a hint, your original two-digit whole number is of the form, 10*T + U. What happens when you subtract the digits?

End

Sunday, August 25, 2024

Text analysis of 2024 US Presidential convention speeches

Text analysis of 2024 US Presidential convention speeches, by Jerry Tuttle

Can an actuary / mathematician / data analyst say anything objective and data-oriented about the 2024 US presidential campaign?

Yes, if I confine my remarks to a numerical text analysis of the candidates' words, rather than attempt to comment on the candidates' political and economic views. This analysis is based solely on data that I collected from the two convention speeches.

I performed a sentiment analysis of the candidates' emotional words. Here is a graphical summary, discussion to follow:

Sentiment analysis

I also calculated the candidates' most frequent words, excluding stop words like "a", "and", "the" and so on that don't carry much useful information:

Most frequent words

I calculated statistics of things like average sentence length and average word length:

Summary statistics

This blog is my attempt at some objective and data-oriented analysis of the two presidential candidates from their convention sppeches. This is a text analysis of their speeches. Text analysis is a branch of data analysis that takes unstructured text documents, breaks them into sentences and words, and attempts to find some meaning. Text analysis is used by marketers to evaluate customer opinions, by police departments to monitor potential or actual criminal activites, by researchers to evaluate things like whether Shakespeare really wrote all the plays attributed to him and which Beatles songs were mostly written by John versus which by Paul, and by many others.

This is intended as an objective analysis. I am not trying to make either candidate look good, or bad. I have used these same text analysis techniques in other projects such as analyzing Hamlet, analying short stories, and analyzing the Twitter tweets of a radio talk show host.

For each of Trump and Harris, I started with a transcript of their convention sppeches. I believe the transcripts are their spoken transcripts, not their written transcripts, based on their very first few sentences. I used various computer packages in R such as tm and tidytext to tokenize the documents into individual sentences, words, and characters. I was guided by the works of Silge and Robinson in Text Mining with R and Williams in Data Science Desktop Survival Guide.

A summary and a comparison of the of the tokenization of the speeches is the following, repeated from before.

Obviously Trump spoke for much longer than Harris, so he had many more total words. Harris had a larger average sentence length (18.06 words per sentence versus 10.41), but the two were close in average characters per word and average number of syllables per word. Their numerical Flesch scores were about equal. Flesch score is a measure of the grade level required to understand the text, and we would expect a low grade level in a political speech to the public, in contrast to say a student college paper. Harris had a larger percentage of unique or different words in her speech than Trump (22% to 12%). Trump had a larger percentage of negative words than Harris (43% to 37%), which I will discuss below.

Sentiment analysis ia analyzing text to determine its emotional tone. Linguistics experts have built dictionaries that associate a large list of words with eight emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) and also negative and positive sentiments. For example, the word "ache" is associated with sadness. Some words have more than one emotion; for example "admirable" is associated with both joy and trust. Further, some words have either negative or positive associations.

There are some limitations in sentiment analysis. The sentiment dictionary does not recognize sarcasm, and I am limiting my analysis to single words so I am not picking up negation (such as "not expensive") or other instances where the emotion requires several words. A conclusion from the sentiment distribution graph is that the candidates are surprisingly similar in most of these emotions. The biggest differences are that Trump has a greater portion of his words categorized as anger and negative than Harris has.

Most frequent positive and megative words

Trump's frequent positive words include "love", adjectives "beautiful" and "wonderful", and also "money". His negative words include "tax", "invasion", "inflation", "war", "illegal", and "crime".

Harris's frequent positive words include "freedom", "love", "opportunity", and "forward". Her negative words include "tax", "enforcement", "abuse", and "abortion". Interestingly "mother" is both a positive word and a negative word, as is "vote". (I rechecked the sentiment dictionary to confirm these.)

Trump has a larger percentage of negative words (negative divided by positive plus negative) than Harris (43% to 37%). These positive and negative lists seem consistent with my memory of their speeches.

Most frequent words

Word frequency provides some clue on what is important to each candidate. Not surprisingly, both candidates frequently used words like "America", "Americans", "country", "nation", and "border". Trump used adjectives like "beautiful" and "incredible". Harris spent considerable time talking about Trump and also about her mother. Harris frequently used "middle", "class", "women", and "law".

Distribution of word sizes

Distribution of syllables per word

The distributions of size of words and of syllables per word are pretty similar for both candidtaes.

Final thoughts

It is hard to be indifferent about the 2024 US presidential election. You have your opinion, and I have mine. Much of what the candidates say is their opinion, or their plan if elected, and these things are not things we can verify.

Some things the candidates say as if they are facts, are stated in a way that is open to interpretation. A good example is that "You are better (or worse) off financially today than four years ago." I can choose one measure, collect some data, and show I am better off; or I can choose a very different measure, collect some data, and show I am worse off.

Some things the candidates say as facts ARE verifiable. I am in no position to do such verifying, but a number of third parties do this. Here are a few links. I can not vouch for their reliability or bias.

Text analysis is a popular data analysis technique, and I hope I have added some insight into the candidates in an objective manner.

The following is my R code:

# Trump convention speech, July 19, 2024
# https://www.nytimes.com/2024/07/19/us/politics/trump-rnc-speech-transcript.html

# Harris convention speech, August 23, 2024
# https://singjupost.com/full-transcript-kamala-harriss-2024-dnc-speech/?singlepage=1

library(tidytext)
library(tm)
library(dplyr)
library(nsyllable)
library(SnowballC)
library(ggplot2)
library(forcats)
library(ggpubr)

speaker <- readline(prompt = "Enter Trump, or enter Harris: ")
if (speaker == "Trump" | speaker == "Harris") print(speaker) else print("Invalid input")
trump_file <- "C:/Users/Jerry/Desktop/Harris_Trump/trump_convention_speech.txt"
harris_file <- "C:/Users/Jerry/Desktop/Harris_Trump/harris_convention_speech.txt"
textfile <- ifelse(speaker=="Trump", trump_file, harris_file)
textfile <- readLines(textfile)
text_df <- data.frame(line = 1:length(textfile), text=textfile)
names(text_df)[2] <- "text"

docs <- Corpus(VectorSource(text_df$text))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation, ucp=TRUE)
docs <- tm_map(docs, stripWhitespace)
inspect(docs[1:8])

custom_colors <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
      "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf",
      "#6a3d9a", "#ff9e1b", "#f6c6c7", "#8dd3c7", "#ffffb3",
      "#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3e2cd",
      "#ccebc5")

common_theme <- theme(
      legend.position="NULL",
      plot.title = element_text(size=15, face="bold"),
      plot.subtitle = element_text(size=12.5, face="bold"),
      axis.title = element_text(size=15, face="bold"),
      axis.text = element_text(size=15, face="bold"),
      legend.title = element_text(size=15, face="bold"),
      legend.text = element_text(size=15, face="bold"))

docs_df <- data.frame(text = sapply(docs, as.character)) # Convert Corpus to data.frame
wordfile <- unnest_tokens(docs_df, word, text, token = "words")
wordfile %>% count(word, sort=TRUE)
wordcountfile <- mutate(wordfile, numbchars = nchar(word)) # characters per word
long1 <- wordcountfile[which(wordcountfile$numbchars == max(wordcountfile$numbchars)),1] # longest word long2 <- wordcountfile[which(wordcountfile$numbchars == max(wordcountfile$numbchars)),2]
numberchars <- sum(wordcountfile$numbchars)
numberwords <- sum(count(wordcountfile, word, sort=TRUE)$n) # no. words
avgcharperword <- round(numberchars / numberwords, digits=2)
sentencefile <- unnest_tokens(text_df, sentence, text, token = "sentences")
sentencecount <- sum(count(sentencefile, sentence, sort=TRUE)$n)
avgwordpersent <- round(numberwords/sentencecount,2)

wordcountdist <- wordcountfile %>% count(numbchars)
wordcountdist$numbchars <- as.factor(wordcountdist$numbchars)
title <- paste(speaker, "- Distribution of Word Size")
subtitle <- paste("Longest word: ", long1, long2, "characters")
ggplot(wordcountdist, aes(numbchars, n, fill=numbchars)) +
      geom_bar(stat="identity", position = "dodge", width=0.5) +
      labs(title=title, subtitle=subtitle) +
      xlab("number of characters per word") + ylab("") +
      scale_fill_manual(values = custom_colors) +
      theme(legend.position = "none") +
      common_theme

syls <- nsyllable(wordfile$word, language = "en")
syls[which(wordfile$word == "jd")] <- 2 # used because nsyllable generated error here
syls[which(wordfile$word == "nd")] <- 1 # used because nsyllable generated error here
syls[which(wordfile$word == "st")] <- 3 # s/b 21st; used because nsyllable generated error here
syls[which(wordfile$word == "gasolinepowered")] <- 5 # used because nsyllable erred here
long2 <- min(syls[(syls == max(syls, na.rm=TRUE))], na.rm=TRUE)
w <- min(which(syls == long2))
long1 <- wordfile$word[w]
avgsylperword <- round(sum(syls)/numberwords, digits = 2)
avgsylperword
syls <- data.frame(syls) %>% count(syls)
syls$syls <- as.factor(syls$syls)
colnames(syls) <- c("syllables", "n")
title <- paste(speaker, "- Distribution of No. Syllables per Word")
subtitle <- paste("Most syllables: ", long1, long2, "syllables")
ggplot(syls, aes(syllables, n, fill = syllables)) +
      geom_bar(stat="identity", position = "dodge", width=0.5) +
      labs(title=title, subtitle=subtitle) +
      xlab("number of syllables per word") + ylab("") +
      scale_fill_manual(values = custom_colors) +
      theme(legend.position = "none") +
      common_theme

# Flesch-Kincaid reading ease formula
flesch <- round(206.835 - 1.015*(numberwords/sentencecount) - 84.6*(sum(syls$n)/numberwords),2) # Flesch reading ease
flesch
flesch_df <- data.frame(score = c(0,30,50,60,70,80,90,100),
grade = c("College graduate","College","10th - 12th grade","8th - 9th grade",
"7th grade","6th grade","5th grade","below 5th grade"))

# Function to find the grade based on score; vlookup
find_grade <- function(score, flesch_df) {
idx <- findInterval(score, flesch_df$score)
if (idx == 0) {
return("below 5th grade") # Handle case where score is below the minimum
} else {
return(flesch_df$grade[idx])
}
}

score_to_find <- flesch
flesch_grade <- find_grade(score_to_find, flesch_df)
flesch_grade

# delete stop words
docs_df <- data.frame(text = sapply(docs, as.character)) # Convert Corpus to data.frame
wordfile <- unnest_tokens(docs_df, word, text, token = "words")
stop_words <- data.frame(tidytext::stop_words) # more words than tm
my_stop_words <- data.frame(word = c("theyre", "hes", "dont",
"didnt","youre","cant", "im","whats", "weve", "theyve", "youve",
"couldnt", "wont", "youd"))
wordfile <- anti_join(wordfile, stop_words)
wordfile <- anti_join(wordfile, my_stop_words)

wordfreq <- wordfile
wordfreq <- count(wordfreq, word, sort=TRUE) # word frequency excl stop words
wordfreqdf <- data.frame(wordfreq)

unique_words <- nrow(wordfreq)
portion_unique_words <- round(unique_words / numberwords, digits=2)
wordfreqdf20 <- wordfreqdf[1:21,] # Think about threshold
wordfreqdf20

graphtitle <- paste(speaker, "Word Frequency")
wordfreqdf20$word <- fct_reorder(wordfreqdf20$word, wordfreqdf20$n, .desc = FALSE)
ggplot(data=wordfreqdf20, aes(x=word, y=n, fill=word)) +
      geom_bar(stat="identity", position = "dodge", width=0.5) +
      coord_flip() +
      common_theme +
      xlab("") + ylab("Frequency") +
      ggtitle(graphtitle) +
      scale_fill_manual(values = custom_colors) +
      theme(legend.position = "none")

# sentiments; note mother is both positive and negative!
df1 <- data.frame(wordfile)
colnames(df1) <- "word"
df2 <- get_sentiments("nrc")
df3 <- merge(x=df1, y=df2, by="word", all.x=TRUE, stringsAsFactors=FALSE)
df3 <- subset(df3, !is.na(sentiment))
table(df3$sentiment)
w <- data.frame(table(df3$sentiment))
colnames(w) <- c("sentiment", "n")

sentiment_colors <- c(
   "Anger" = "red",
   "Anticipation" = "green",
   "Disgust" = "brown",
   "Fear" = "purple",
   "Joy" = "yellow",
   "Negative" = "gray",
   "Positive" = "lightblue",
   "Sadness" = "blue",
   "Surprise" = "pink",
   "Trust" = "turquoise")

title <- paste(speaker, "- Sentiment Plot")
ggplot(w, aes(sentiment, n)) +
      geom_bar(stat = "identity", position = "dodge", width = 0.5, fill = sentiment_colors) +
      ggtitle(title) +
      ylab("") +
      common_theme +
      theme(axis.text.x = element_text(angle = 45, hjust=1))

df4 <- df3 %>% filter(sentiment == "positive" | sentiment == "negative")
w <- with(df4, table(sentiment))
neg <- w[1]
pos <- w[2]
neg_ratio <- round(w[1] / (w[1] + w[2]), digits=2)
df5 <- df4 %>% group_by(sentiment) %>% count(word, sort=TRUE)
pos_freq <- df5 %>% filter(sentiment=="positive") %>% top_n(10, wt = n) %>% slice_head(n = 10)
neg_freq <- df5 %>% filter(sentiment=="negative") %>% top_n(10, wt = n) %>% slice_head(n = 10) # ties
pos_freq$word <- fct_reorder(pos_freq$word, pos_freq$n, .desc = FALSE)
neg_freq$word <- fct_reorder(neg_freq$word, neg_freq$n, .desc = FALSE)

title <- paste(speaker, "- Most Frequent Positive and Negative Words")
p1 <- ggplot(pos_freq, aes(word, n)) +
      geom_bar(stat="identity", position = "dodge", width=0.5, fill="darkgreen") +
      ggtitle("Positves") +
      common_theme +
      xlab("") +
      coord_flip()
p2 <- ggplot(neg_freq, aes(word, n)) +
      geom_bar(stat="identity", position = "dodge", width=0.5, fill="red") +
      ggtitle("Negatives") +
      common_theme +
      xlab("") +
      coord_flip()
plot <- ggarrange(p1,p2, ncol=2, nrow=1, legend=NULL)
annotate_figure(plot, top = text_grob(title,
    color = "black", face = "bold", size = 14))

if (speaker == "Trump"){
   t <- data_frame(speaker, numberwords, avgwordpersent, avgcharperword, avgsylperword, flesch, flesch_grade, portion_unique_words, neg_ratio)
   print(t)
   } else {h <- data_frame(speaker, numberwords, avgwordpersent, avgcharperword, avgsylperword, flesch, flesch_grade, portion_unique_words, neg_ratio)
    conclusion <- data.frame(rbind(t,h))
    conclusion <- t(conclusion)
    colnames(conclusion) <- c("Trump", "Harris")
    conclusion <- conclusion[-1,]
    print(conclusion)
}

# the results of stemming and lemmatizing were not used in the report
# stemming
wordfile <- wordfile %>%
mutate(stem = wordStem(word)) %>%
count(stem, sort = TRUE)

# lemmatize
df1 <- wordfile # df1 has col named stem
url <- "https://raw.githubusercontent.com/michmech/lemmatization-lists/master/lemmatization-en.txt"
df2 <- read.table(url, header = FALSE, sep = "\t", quote = "", stringsAsFactors = FALSE)
names(df2) <- c("stem", "word")
df3 <- merge(x = df1, y = df2, by = "stem", all.x = TRUE, stringsAsFactors=FALSE)

# if word is not in dictionary, then leave word as is; otherwise, use stemmed word.
df3$word <- ifelse(is.na(df3$word), df3$stem, df3$stem)

# End

Sunday, August 11, 2024

Is the Mona Lisa thinking about irrational numbers?

Is the Mona Lisa thinking about irrational numbers? by Jerry Tuttle

As a math teacher I sometimes share the following problem-solving strategy: If you are really stuck on a problem, let it sit, come back the next day, and maybe you will have a fresh insight. This often works for me.

58 years later ...

I recently applied this strategy with a problem posed by my beloved Bryant High School math teacher Anthony Holman. However, he posed it in 1966, so it took me 58 years before I had any insight. Unfortunately, Mr. Holman passed away in 1985, so I never had the opportunity to discuss this with him.

In calculus class Mr. Holman taught us about e, sometimes called Euler’s number, which is the base of the natural log function. He told us e, like π , would appear in many unlikely places in math, and of course he was right. Then he hypothesized that e and π and other famous irrational numbers were probably part of the great works of art like the Mona Lisa.

That was the only time I remember him making that statement. I had no idea how to approach it. This was 1966, long before computers, even mainframe computers, were readily available to the public. I forgot about his statement for about 55 years.

More recently I have been programming in R, and I have learned a little about computer images. A computer image can be represented as an array of many small elements called pixels, and every pixel is a base 16 number of the hex code of that pixel's color. I decided to interpret Mr. Holman’s hypothesis as whether a finite sequence of the digits of a number like π were contained within the pixel hex code of the Mona Lisa. I decided to separately test four famous irrational numbers: π, e, φ, and √2.

A quick note about φ: This is the Golden Ratio phi, defined as the ratio of a/b such that φ = a/b = (a + b)/a. The Golden Ratio is well-known to appear in numerous artworks including the Mona Lisa, where various measures of the subject’s face appear in Golden ratio proportions. I am assuming this is not what Mr. Holman intended in his hypothesis.

I am making a few assumptions here. An irrational number is an infinite, non-repeating decimal. I can not test whether an infinite sequence of digits appears within some other sequence of numbers. So I am truncating these irrational numbers to ten decimal places each, which I think is sufficient for this exercise. The tenth decimal place is truncated, not rounded. Of course the test may pass at ten decimal places and then fail at the eleventh.

Another assumption is that I have a sufficiently high-quality image of the original Mona Lisa. The painting was completed in approximately 1517. It has aged and there has been some physical restoration. A computer image is the result of a photograph, and these photographs also contain some digital retouching. So the result may not be equivalent to the original. Finally, there is a variety of available resolutions including one that has a size of 90 MB. I do not have sufficient computer memory to handle that file size and the calculations of a file that large.

I am also assuming it is sufficient to do this problem with just the Mona Lisa.

In my first attempt, I was able to get the base 16 number of the hex code of every pixel of my Mona Lisa file. The file looks like this for the first six pixels. I show each pixel's base 16 hex code and its base 10 decimal equivalent.

I converted each of the four irrational numbers from base 10 to base 16. Each irrational number is nine characters in base 16. I created one very long string of the pixel hex codes, and I searched whether each irrational number was contained in the hex string. The string has 1.7 million characters, and I am searching for a nine character sequence. However, the result was none of the irrationals was contained in the string.

I decided the first attempt was faulty. The Mona Lisa was created in a base 10 world, and it didn’t make sense to force the irrationals into an artificial base 16 comparison.

In my second attempt, I converted each base 16 pixel to base 10, to compare against base 10 irrationals. I think this is a more natural comparison, and if an irrational is going to be contained in a larger sequence then it would be as a base ten irrational.

Unfortunately this also failed for each of the four irrationals. So sadly, I can not confirm Mr. Holman’s hypothesis. Maybe I should let this problem sit for a few more years and come back when I have a fresh idea (or when I buy a computer with more memory). Maybe Mr. Holman has solved it in Heaven. Nevertheless, I’m sure he is smiling at my efforts.

The following is my R code:

# Mona Lisa problem matching base 10 codes
library(imager)
# url1 is huge: 89.9 MB; url2 is 70 KB.
# use url2 unless you have a lot of memory.

url1 <- "https://upload.wikimedia.org/wikipedia/commons/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg"
url2 <- "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg/402px-Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg"
img <- load.image(url2)
plot(img)

img_data <- as.data.frame(img)
head(img_data) # x, y, cc(color channel: r, g, b, transparent), intensity (normalized 0,1)
dim(img_data) # 722,394 x 4
library(dplyr)
img_data <- arrange(img_data, x, y, cc)

# Pivot the data to wide format so each pixel has its R, G, B values in separate columns
library(tidyr)
img_wide <- pivot_wider(img_data, names_from = cc, values_from = value, values_fill = list(value = 0)) # # If there are missing values, fill them with 0
dim(img_wide) # 240,798 x 5

# Convert normalized values to 0-255 range
img_wide$R <- img_wide$`1` * 255
img_wide$G <- img_wide$`2` * 255
img_wide$B <- img_wide$`3` * 255

# Convert RGB values to hexadecimal format
img_wide$hex <- rgb(img_wide$R, img_wide$G, img_wide$B, maxColorValue = 255)
img_wide$hex <- gsub('#', '', img_wide$hex)

# Convert hex values to decimal format
library(Rmpfr) # extended floating point computations
options(scipen = 999)
img_wide$dec <- as.numeric(mpfr(img_wide$hex, base=16))

# Check the structure of img_wide
head(img_wide)
# Concatenate dec values into a single string
dec_string <- paste(img_wide$dec, collapse = "")
nchar(dec_string) # 1701600

test_value1 <- "04111665813" # starts in position 2
pi_base10 <- "31415926535" # https://www.angio.net/pi/digits/50.txt 10 decimal places, truncated not rounded
e_base10 <- "27182818284" # https://www.i4cy.com/euler/
phi_base10 <- "16180339887" # https://nerdparadise.com/math/reference/phi10000
root_two_base10 <- "14142135623" # https://nerdparadise.com/math/reference/2sqrt10000

# Check if the dec string contains the digits of pi, e, phi, root 2

check_irrational <- function(value, name) {
if (grepl(value, dec_string)) {
cat("The first ten digits of", name, "are present in the pixel data.\n")
start_position <- regexpr(value, dec_string)
cat("The match starts at position:", start_position, "\n")
} else {
cat("The first ten digits of", name, "are not present in the pixel data.")
}
}

check_irrational(test_value1, "test1")
check_irrational(pi_base10, "π")
check_irrational(e_base10, "e")
check_irrational(phi_base10, "φ")
check_irrational(root_two_base10, "√2")

options(scipen = 0)

# END

#######################################

Wednesday, July 24, 2024

The distribution has changed; and pretty tables in base R

The distribution has changed; and pretty tables in base R, by Jerry Tuttle

So you spent hours, or maybe days, cranking out thousands of numbers, you submit it to your boss just at the deadline, your boss quickly peruses your exhibit of numbers, points to a single number and says, "This number doesn't look right." Bosses have an uncanny ability to do this.

Your boss is pointing to something like this: Your company sells property insurance on both personal and commercial properties. The average personal property premium increased 10% in 2024. The average commercial property premium increased 10% in 2024. But you say the combined average property premium decreased 3% in 2024. You realize that negative 3% does not look right.

You might have made an input error or a calculation error, but you don't want to admit to that. So you blurt out, "That's because the distribution has changed." And to your relief, the boss buys into that.

"The distribution has changed" is probably a pretty good answer in more instances than you realize. A more common example is if your investment portfolio starts at 90% stocks and 10% bonds, but you have a good year with stocks, at year-end the distribution of your portfolio has changed to 94% stocks and 6% bonds, and you may want to rebalance your portfolio. How about the distribution of population by state? It has definitely changed since the prior Census. It might be hard to think of a non-trivial example of something where the distribution has not changed.

The key calculations are the 2023 average premium is (2000 * 90 + 10000 * 10) / (90 + 10) = 2800, the 2024 average premium is (2200 * 110 + 11000 * 7) / (110 + 7) = 2726, and so the average premium change is 2726 / 2800 = .97 (rounded) = - 3%. And the distribution between personal and commercial DID change, as measured either by the distribution of number of policies (was 90% personal, now 94%) or by the distribution of premium (was 64% personal, now 76%). So because the distribution has changed towards more smaller personal policies, this pulls the combined average premium down, even though the average premium for each of the two separate subgroups has increased. (There are alternatives to the key calculations, such as weighting the percentage changes, instead of weighting the average premiums.)

Years ago I posed a dilemma like this during job interviews for actuarial trainees to see how well they would respond to a sort of non-routine problem, but I decided it was too difficult.

I did the above exhibit in Excel because it was quick. It was also pretty easy to add custom colors to alternating rows, which I did via FONT > FILL > MORE COLORS > CUSTOM > enter HEX CODE. Here I chose cornsilk #fff8dc and cyan #00ffff for fun.

Then I wondered how easy it would be to make a pretty table in R. If you google something like "pretty tables in R", you will find a number of R packages that create HTML type code that can be saved as an HTML file, a PDF file, or another file format. Much has been written about these packages, but they seem a little complicated for basic work, and further, I like the idea of staying exclusively within the R environment. When I realized a table is just a collection of rectangles, it occurred to me that the base R commands of rectangle and text are pretty much all I need.

Here is a table of sample rectangles with text, written in R. The rectangle syntax is rect(xleft, ybottom, xright, ytop, col, border) and the text syntax is text(x, y, labels, col, cex, font). The numerical axis is helpful when first defining the rectangles, but can be deleted by adding axes = FALSE to the plot function for the final table.

# rectangle syntax: rect(xleft, ybottom, xright, ytop)
plot(x = c(0, 500), y = c(0, 500), type= "n", xlab = "", ylab = "", main = "Sample rectangles with text", cex=2, font=2)

rect(0, 0, 250, 250, col = "#E41A1C", border = "blue")
text(125, 125, "red rectangle, white font", col="white", cex=1.15, font=2)

rect(250, 0, 500, 250, col = "yellow", border = "blue")
text(375, 125, "yellow rectangle, blue font", col="navyblue", cex=1.15, font=2)

rect(0, 250, 250, 500, col = "cornsilk", border = "blue")
text(125, 375, "cornsilk rectangle, black font", col="black", cex=1., font=2)

rect(250, 250, 500, 500, col = "cyan", border = "blue")
text(375, 375, "cyan rectangle, purple font", col="purple", cex=1.15, font=2)

The R equivalent of the Excel exhibit is the following. Note that all code is in base R.

title <- c("Subgroups increase, but the average decreases")
df <- data.frame(Personal = c(2000, 90, 2200, 110, 10),
Commercial = c(10000, 10, 11000, 7, 10),
Weighted = c(2800, 100, 2726, 117, -3))

rownames(df) = c("2023 Avg Prem", "2023 No. Policies","2024 Avg Prem", "2024 No. Policies", "Avg Prem % Change")
df

# rectangle syntax: rect(xleft, ybottom, xright, ytop)
op <- par(bg = "thistle")
col1 = "cornsilk"
col2 = "cyan"

plot(x = c(0, 500), y = c(0, 700), type= "n", xlab = "", ylab = "", axes = FALSE)
height = 100

rect(0, 6*height, 500, 7*height, col = col1, border = "blue")
text(250, 6.5*height, title, col="black", cex=1.25, font=2)

rect(0, 5*height, 200, 6*height, col = col2, border = "blue")
rect(200, 5*height, 300, 6*height, col = col2, border = "blue")
rect(300, 5*height, 400, 6*height, col = col2, border = "blue")
rect(400, 5*height, 500, 6*height, col = col2, border = "blue")

text(100, 5.5*height, "", col="blue")
text(250, 5.5*height, colnames(df)[1], col="blue")
text(350, 5.5*height, colnames(df)[2], col="blue")
text(450, 5.5*height, colnames(df)[3], col="blue")

rect(0, 4*height, 200, 5*height, col = col1, border = "blue")
rect(200, 4*height, 300, 5*height, col = col1, border = "blue")
rect(300, 4*height, 400, 5*height, col = col1, border = "blue")
rect(400, 4*height, 500, 5*height, col = col1, border = "blue")

text(100, 4.5*height, rownames(df)[1], col="blue")
text(250, 4.5*height, df[1,1], col="blue")
text(350, 4.5*height, df[1,2], col="blue")
text(450, 4.5*height, df[1,3], col="blue")

rect(0, 3*height, 200, 4*height, col = col2, border = "blue")
rect(200, 3*height, 300, 4*height, col = col2, border = "blue")
rect(300, 3*height, 400, 4*height, col = col2, border = "blue")
rect(400, 3*height, 500, 4*height, col = col2, border = "blue")

text(100, 3.5*height, rownames(df)[2], col="blue")
text(250, 3.5*height, df[2,1], col="blue")
text(350, 3.5*height, df[2,2], col="blue")
text(450, 3.5*height, df[2,3], col="blue")

rect(0, 2*height, 200, 3*height, col = col1, border = "blue")
rect(200, 2*height, 300, 3*height, col = col1, border = "blue")
rect(300, 2*height, 400, 3*height, col = col1, border = "blue")
rect(400, 2*height, 500, 3*height, col = col1, border = "blue")

text(100, 2.5*height, rownames(df)[3], col="blue")
text(250, 2.5*height, df[3,1], col="blue")
text(350, 2.5*height, df[3,2], col="blue")
text(450, 2.5*height, df[3,3], col="blue")

rect(0, height, 200, 2*height, col = col2, border = "blue")
rect(200, height, 300, 2*height, col = col2, border = "blue")
rect(300, height, 400, 2*height, col = col2, border = "blue")
rect(400, height, 500, 2*height, col = col2, border = "blue")

text(100, 1.5*height, rownames(df)[4], col="blue")
text(250, 1.5*height, df[4,1], col="blue")
text(350, 1.5*height, df[4,2], col="blue")
text(450, 1.5*height, df[4,3], col="blue")

rect(0, 0, 200, height, col = col1, border = "blue")
rect(200, 0, 300, height, col = col1, border = "blue")
rect(300, 0, 400, height, col = col1, border = "blue")
rect(400, 0, 500, height, col = col1, border = "blue")

text(100, .5*height, rownames(df)[5], col="blue")
text(250, .5*height, paste(df[5,1], "%"), col="black", cex=1.5)
text(350, .5*height, paste(df[5,2], "%"), col="black", cex=1.5)
text(450, .5*height, paste(df[5,3], "%"), col="black", cex=2, font=2)

par(op)

# END

#######################################

Thursday, July 18, 2024

Radar charts and five-tool baseball players

Radar charts and five-tool baseball players by Jerry Tuttle

I was looking for an opportunity to practice with radar charts and I came across an article on five-tool baseball players, so this seemed like a perfect application for this kind of chart.

A radar chart is an alternative to a column chart to display three or more quantitative variables. The chart graphs the values in a circular manner around a center point.

The five tools in baseball are: (1) hitting for average; (2) hitting for power; (3) defense; (4) throwing; and (5) speed. A five-tool player excels in all five of these.

Among current players, Mike Trout is considered a five-tool player. The measurement of Trout’s five tools can be displayed in the following radar chart:

Trout is rated at 80 for hitting for average, 70 for hitting for power, and his lowest scores are 60 for defense, throwing and speed. This is based on a 20-to-80 rating system, where 80 is elite, 70 is plus-plus, and 60 is plus. Sorry - I could not get the points to line up with the concentric pentagons.

For comparison, here is a display of Aaron Judge's ratings.

Judge is rated at 80 for hitting for power, 70 for hitting for average, 60 for defense, 70 for throwing, and 50 for speed, where the 50 is average at the major league level.

The results of several players can be displayed in a single radar chart, but this becomes hard to read. Three players are probably the maximum for readability.

The alternative to visualizing several players is either to create several individual radar charts or else to create a bar (horizontal) chart or a column (vertical) chart.

Each of the five tools is generally rated on a 20-to-80 scale, where 50 is average (for a major leaguer), 80 is elite, and every 10 points is supposed to represent one standard deviation. I suspect the standard deviation concept is more judgmental than mathematical. There is not a single rating system; some use traditional baseball statistics, and others use modern motion tracking data.

The numerical data above was obtained from an article by Jake Mintz in 2022 for Fox Sports https://www.foxsports.com/stories/mlb/trout-betts-rodriguez-the-definition-of-mlbs-five-tool-players . In Mintz's data, all numbers are shown rounded to the nearest 10. Mintz only has five current players as five-tool players: Mike Trout, Mookie Betts, Trea Turner, Byron Buxton, and Julio Rodriguez. I tried graphing all five players in a single radar chart, but this was too hard to read. Mintz thinks a true five-tool player should have a grade of at least 60 in each of the five categories. By this measure, Aaron Judge is not quite a five-tool player due to a 50 in speed, and a number of elite major leaguers have at least one 50. Note that each category is considered separately. If there were some sort of weighting system, many people would weigh hitting with power as most important, followed by hitting for average, although perhaps the weights should vary by position with higher weights for defense and throwing for catcher, middle infielders, and center fielder. Pitchers have a different grading system.

What about Shohei Ohtani? At the time of his article, Mintz did not have sufficient data on Ohtani.

Mintz observes that Mike Trout worked one winter to improve his throwing, and Julio Rodriguez worked to increase his speed. This suggests that the ratings probably change over the life of a player and are dependent on when they are measured.

Other authors suggest that there is a sixth tool of exceptional players such as mental makeup and character. Another tool might be situational game awareness.

Modern motion tracking data by Statcast and others did not exist until fairly recently. Willie Mays is generally considered the greatest five-tool player. Using statistical measures, author Herm Krabbenhoft suggests Tris Speaker, Ty Cobb, and Honus Wagner should be considered as five tool players, although Krabbenhoft measures hitting for power with SLG (slugging percentage) and ISO (isolated power), not home runs https://sabr.org/journal/article/honus-wagner-baseballs-prototypical-five-tooler/ . A very different measure of hitting with power would be something like home run distance greater than 425 feet or launch angle and velocity.

What about Babe Ruth? We know Babe Ruth's career numbers are .342 batting average and 714 home runs. I have not read anything about his defense, throwing, or speed. He did steal 123 bases, including home 10 times; maybe he was faster than we realize. He is remembered for getting thrown out stealing second to end the 1926 World Series, but perhaps the hit-and-run play was on, and Bob Meusel, the batter, swung and missed the pitch? See https://baseballegg.com/2019/10/30/babe-ruths-failed-stolen-base-attempt-ended-the-1926-world-series-or-is-that-what-really-happened/ . Ruth had 204 assists as an outfielder, which sounds like a lot. I wonder how he would have ranked in defense, throwing, and speed?

Here is my R code. I do like radar charts for comparing one to three observations over five variables, as a change of pace from column charts. I used the fmsb library for the radar charts. There is also a ggradar library, but I did not like its visualization. One of the quirks of fmsb is that the axis for each variable can have its own scale. Originally I used each variable's max and min values, but the axes were out of sync, so I replaced this with the grand max and min. Also, I could not get the values, which are all multiples of ten, to line up on the concentric pentagons.

library(fmsb)
library(scales)

group = c("Hit_avg", "Hit_power", "Defense", "Throwing", "Speed")
player_names = c("Trout","Betts","Judge")
players <- data.frame(
   row.names = player_names,
   Hit_avg = c(80, 70, 70),
   Hit_power = c(70,60,80),
   Defense = c(60,70,60),
   Throwing = c(60,80,70),
   Speed = c(60,70,50))
players

# The row 1 should contain the maximum values for each variable
# The row 2 should contain the minimum values for each variable
# Data for cases or individuals should be given starting from row 3
# Define the variable ranges: maximum and minimum; however, want axes to have equal scales

max_min <- data.frame(
   Hit_avg = c(max(players), min(players)),
   Hit_power = c(max(players), min(players)),
   Defense = c(max(players), min(players)),
   Throwing = c(max(players), min(players)),
   Speed = c(max(players), min(players)))

rownames(max_min) <- c("Max", "Min") # row 1 has max's, row 2 has min's.
df <- rbind(max_min, players)
df

player1_data <- df[c("Max", "Min", player_names[1]), ]
player2_data <- df[c("Max", "Min", player_names[2]), ]
player3_data <- df[c("Max", "Min", player_names[3]), ]

chart <- function(data, color, title){
     radarchart(data, axistype = 0,
       pcol = color, pfcol = scales::alpha(color, 0.5), plwd = 2, plty = 1,
        vlabels = colnames(data), vlcex = 1.5,
       cglcol = "black", cglty = 1, cglwd = 0.8,
        caxislabels = NULL,
       title = title)
}

# Plot the data for players 1, 2, and 3 separately
chart(data=player1_data, color="#00AFBB", title="MIKE TROUT 5 Tools")
chart(data=player2_data, color="#E7B800", title="MOOKIE BETTS 5 Tools")
chart(data=player3_data, color="#FC4E07", title="AARON JUDGE 5 Tools")

# Plot the data for three players
chart(data=df, color=c("#00AFBB", "#E7B800", "#FC4E07"), # blue-green, red-green, red-green
     title="TROUT, BETTS, JUDGE 5 Tools")
legend(
     x = "bottom", legend = rownames(df[-c(1,2),]), horiz = FALSE,
     bty = "n", pch = 20 , col = c("#00AFBB", "#E7B800", "#FC4E07"),
     text.col = "black", cex = 1.25, pt.cex = 1.5)

###########################################

# column graphs

library(tibble)
library(tidyr)
library(ggplot2)
# Reshape data to long format
players_long <- players %>%
rownames_to_column("player") %>%
pivot_longer(cols = -player, names_to = "group", values_to = "value")

# Common theme for graphs
common_theme <- theme(
   legend.position="right",
   plot.title = element_text(size=15, face="bold"),
   axis.title = element_text(size=15, face="bold"),
   axis.text = element_text(size=15, face="bold"),
   legend.title = element_text(size=15, face="bold"),
   legend.text = element_text(size=15, face="bold"))

# Create column graph: Tool Ratings by Player
ggplot(players_long, aes(x = player, y = value, fill = group, title = "Tool Ratings by Player")) +
   geom_col(position = "dodge") +
   labs(x = "Player", y = "Rating", fill = "Group") +
   common_theme

# Create the column graph: Player Ratings for each Tool
ggplot(players_long, aes(x = group, y = value, fill = player)) +
   geom_col(position = "dodge") +
   labs(x = "Group", y = "Rating", fill = "Player", title = "Player Ratings for each Tool") +
   common_theme

### END

##################################################################################

How do I count thee? Let me count the ways?

Hebrew Gematria in R